Scraping a bookstore
A bookstore website is given at http://books.toscrape.com/ has been built to learn web scraping.
As shown with thought process in the lecture, we build the following spider in a file book_spider.py
, starting from the 50 pages each with a listing of 20 books:
import scrapy
class BookSpider(scrapy.Spider):
= "toscrape-css"
name = [
start_urls 'http://books.toscrape.com/catalogue/page-%s.html' % page
for page in range(1, 51)
]
def parse(self, response):
for link in response.css("ol.row article.product_pod h3 a"):
# we can extract a shortened title with the following line, although this is shortened with (...) for long ones
"::text").extract_first()
link.css(# url to an individual book page where complete information can be scraped
yield scrapy.Request(book_url,
=self.parse_book)
callback
def parse_book(self, response):
yield {
'title': response.css('div.product_main h1::text').extract_first(),
'price': response.css('div.product_main p.price_color::text').extract_first(),
'image_url': response.css('#product_gallery img::attr(src)').extract_first(),
'stars': response.css('div.product_main p.star-rating::attr(class)').extract_first().replace('star-rating ', '')
}
From there, scrapy will scrape the website, follow links to individual book pages, and store the data inside a csv file using the command scrapy runspider book_spider.py -o bookstore.csv
.