Scraping a bookstore

A bookstore website is given at http://books.toscrape.com/ has been built to learn web scraping.

As shown with thought process in the lecture, we build the following spider in a file book_spider.py, starting from the 50 pages each with a listing of 20 books:

import scrapy


class BookSpider(scrapy.Spider):
    name = "toscrape-css"
    start_urls = [
        'http://books.toscrape.com/catalogue/page-%s.html' % page
        for page in range(1, 51)
    ]


    def parse(self, response):
        for link in response.css("ol.row article.product_pod h3 a"):
            # we can extract a shortened title with the following line, although this is shortened with (...) for long ones
            link.css("::text").extract_first()
            # url to an individual book page where complete information can be scraped
            yield scrapy.Request(book_url,
                                 callback=self.parse_book)

    def parse_book(self, response):
        yield {
                'title': response.css('div.product_main h1::text').extract_first(),
                'price': response.css('div.product_main p.price_color::text').extract_first(),
                'image_url': response.css('#product_gallery img::attr(src)').extract_first(),
                'stars': response.css('div.product_main p.star-rating::attr(class)').extract_first().replace('star-rating ', '')
                }

From there, scrapy will scrape the website, follow links to individual book pages, and store the data inside a csv file using the command scrapy runspider book_spider.py -o bookstore.csv.