Getting data off the web by scraping

Scraping data from websites

Web scraping in python

  • requests package: make pure HTTP requests
  • BeautifulSoup package: parse an HTML file from webpages to find data in specific parts of the page
  • lxml package: parse an HTML file from webpages to find data in specific parts of the page
  • selinium/playwright: render a webapge as a web browser would, including executing of any javascript. Provides programmatic access to the fully rendered webpage. playwright is the more recent iteration of this idea.

Installation:

The Document Object Model (DOM)

DOM

The Document Object Model (DOM)

Quote from https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction

What is the DOM? The Document Object Model (DOM) is a programming interface for web documents. It represents the page so that programs can change the document structure, style, and content. The DOM represents the document as nodes and objects; that way, programming languages can interact with the page

DOM: the Rutgers homepage

DOM Rutgers

How to navigate the HTML

  • navigate the tree of DOM nodes with loop
  • or: use css selelectors to select node(s)
  • or: use xpath (supported to lxml)

How to find CSS selectors from your browser devtools

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = "https://statistics.rutgers.edu/people-pages/faculty"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Find all elements with class "latestnews-item"
faculty_items = soup.find_all("div", class_="latestnews-item")
data = []

for item in faculty_items: 
    data.append({'name': item.find("span").text,
                 'title': item.find("span", class_="detail_data").text,
                 'email': item.find("a", href=lambda href: href and "mailto" in href).text})

pd.DataFrame(data)
name title email
0 Pierre Bellec Associate Professor pcb71@stat.rutgers.edu
1 Matteo Bonvini Assistant Professor mb1662@stat.rutgers.edu
2 Steve Buyske Associate Professor; Undergraduate Co-Director... buyske@stat.rutgers.edu
3 Javier Cabrera Professor cabrera@stat.rutgers.edu
4 Rong Chen Distinguished Professor and Chair rongchen@stat.rutgers.edu
5 Yaqing Chen Assistant Professor yqchen@stat.rutgers.edu
6 Harry Crane Professor hcrane@stat.rutgers.edu
7 Tirthankar DasGupta Professor and Co-Graduate Director tirthankar.dasgupta@rutgers.edu
8 Ruobin Gong Assistant Professor ruobin.gong@rutgers.edu
9 Zijian Guo Associate Professor zijguo@stat.rutgers.edu
10 Qiyang Han Associate Professor qh85@stat.rutgers.edu
11 Donald R. Hoover Professor drhoover@stat.rutgers.edu
12 Ying Hung Professor yhung@stat.rutgers.edu
13 Koulik Khamaru Assistant Professor kk1241@stat.rutgers.edu
14 John Kolassa Distinguished Professor kolassa@stat.rutgers.edu
15 Regina Y. Liu Distinguished Professor rliu@stat.rutgers.edu
16 Gemma Moran Assistant Professor gm845@stat.rutgers.edu
17 Nicole Pashley Assistant Professor np755@stat.rutgers.edu
18 Harold B. Sackrowitz Distinguished Professor and Undergraduate Dire... sackrowi@stat.rutgers.edu
19 Michael L. Stein Distinguished Professor ms2870@stat.rutgers.edu
20 Zhiqiang Tan Distinguished Professor ztan@stat.rutgers.edu
21 David E. Tyler Distinguished Professor dtyler@stat.rutgers.edu
22 Guanyang Wang Assistant Professor guanyang.wang@rutgers.edu
23 Sijian Wang Professor and Co-Director of FSRM and MSDS pro... sijian.wang@stat.rutgers.edu
24 Han Xiao Professor and Co-Graduate Director hxiao@stat.rutgers.edu
25 Minge Xie Distinguished Professor and Director, Office o... mxie@stat.rutgers.edu
26 Min Xu Assistant Professor mx76@stat.rutgers.edu
27 Cun-Hui Zhang Distinguished Professor and Co-Director of FSR... czhang@stat.rutgers.edu
28 Linjun Zhang Assistant Professor linjun.zhang@rutgers.edu