Jazz transcriptions - follow links

scrapy for a single webpage

scrapy is a python package that helps with scraping. A simple example is as follows. For a single webpage, we can retrieve data using

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'jazzspider'
    start_urls = ['https://blueblackjazz.com/en/books']

    def parse(self, response):
        for link in response.css('div.row > div > ul > li > a'):
            yield {'title': link.css('::text').get(),
                    'url': link.attrib['href'] }

and later running the command scrapy runspider scr.py -o my_data.csv.

Following links

We continue to scrape data from the jazz transcriptions website. Starging from URL https://blueblackjazz.com/en/books, we may try to follow the links to individual transcriptions that can be seen on the right.

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'jazzspider'
    start_urls = ['https://blueblackjazz.com/en/books']

    def parse(self, response):
        for link in response.css('div.row > div > ul > li > a'):
            another_url = response.urljoin(link.attrib['href'])
            yield scrapy.Request(another_url, callback=self.parse_transcription_page)

    def parse_transcription_page(self, response):
        for title in response.css('h2.transcriptionTitle'):
            yield {'Musician': title.css('small::text').extract_first(),
                    'Title': ';'.join(title.css('::text').extract()).replace('\n', '_')
                    }