Jazz transcriptions - follow links
scrapy for a single webpage
scrapy
is a python package that helps with scraping. A simple example is as follows. For a single webpage, we can retrieve data using
import scrapy
class BlogSpider(scrapy.Spider):
= 'jazzspider'
name = ['https://blueblackjazz.com/en/books']
start_urls
def parse(self, response):
for link in response.css('div.row > div > ul > li > a'):
yield {'title': link.css('::text').get(),
'url': link.attrib['href'] }
and later running the command scrapy runspider scr.py -o my_data.csv
.
Following links
We continue to scrape data from the jazz transcriptions website. Starging from URL https://blueblackjazz.com/en/books, we may try to follow the links to individual transcriptions that can be seen on the right.
import scrapy
class BlogSpider(scrapy.Spider):
= 'jazzspider'
name = ['https://blueblackjazz.com/en/books']
start_urls
def parse(self, response):
for link in response.css('div.row > div > ul > li > a'):
= response.urljoin(link.attrib['href'])
another_url yield scrapy.Request(another_url, callback=self.parse_transcription_page)
def parse_transcription_page(self, response):
for title in response.css('h2.transcriptionTitle'):
yield {'Musician': title.css('small::text').extract_first(),
'Title': ';'.join(title.css('::text').extract()).replace('\n', '_')
}