A short guide with code example for how to return urls as well as scrapy items with scrapy spider and frontera.
I'm using the scrapy spider framework for a project to scrape some data from an external site. I wanted to scale out my application and move from one spider to multiple spiders. The go to framework for this is Frontera, which implements a crawling framework.
I set up an environment on Amazon aws, and installed Frontera as per the quick-start user guide.
[Issue] Can't extract urls and data
I uploaded my custom spider script and set the crawl framework and spider running. But, where was the data? The problem was that the spider was returning urls and passing them to the crawl framework, but I couldn't figure out why there was no data being extracted from the pages.
[Fix] Yield the extracted urls as well as the scrapy items
The spider must yield both urls and scrapy items but there was no example for how to do this. The core piece code is below, you basically bolt on the item code below the link extractor stuff then deal with it on the pipeline.
class GeneralSpider(Spider): name = 'generalSpider' allowed_domains = ['yourDomain.com'] def __init__(self, *args, **kwargs): super(GeneralSpider, self).__init__(*args, **kwargs) self.le = LinkExtractor(allow_domains=self.allowed_domains) def parse(self, response): if not isinstance(response, HtmlResponse) or response.status != 200: logger.debug('Not a response object.') return # extract the links from the page for link in self.le.extract_links(response): cleanUrl = Util.RemoveQueryString(link.url) r = Request(url=cleanUrl, callback=self.parse) r.meta.update(link_text=link.text) logger.debug('Extracted a link: ' + cleanUrl) yield r # create the scrapy item item = TestItem() url = response.url item['url'] = url item['response'] = res yield item
Hope this helps someone, scrapy and frontera are both awesome projects.