import pandas as pd
from bs4 import BeautifulSoup
import requests
= "https://statistics.rutgers.edu/people-pages/faculty"
url = requests.get(url) response
Scraping faculty page with BeautifulSoup
HTTP request
We start by making an HTTP request to get the page URL
BeautifulSoup object of the page
The next step is to create a BeautifulSoup object that will let us navigate the page.
= BeautifulSoup(response.text, "html.parser")
soup print(soup.prettify()[:2000])
<!DOCTYPE html>
<html dir="ltr" lang="en-gb">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="The School of Arts and Sciences, Rutgers, The State University of New Jersey" name="description"/>
<meta content="Joomla! - Open Source Content Management" name="generator"/>
<title>
Faculty
</title>
<link href="/media/templates/site/cassiopeia_sas/images/favicon.ico" rel="alternate icon" type="image/vnd.microsoft.icon"/>
<link color="#000" href="/media/system/images/joomla-favicon-pinned.svg" rel="mask-icon"/>
<link href="/media/system/css/joomla-fontawesome.min.css?7adb883af089f68d28309c3f304b5284" rel="lazy-stylesheet">
<noscript>
<link href="/media/system/css/joomla-fontawesome.min.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
</noscript>
<link href="/media/cache/com_latestnewsenhancedpro/style_articles_blog_100570.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
<link href="/media/syw/css/fonts.min.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
<link href="/media/vendor/chosen/css/chosen.css?1.8.7" rel="stylesheet"/>
<link href="/media/templates/site/cassiopeia/css/global/colors_standard.min.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
<link href="/media/templates/site/cassiopeia/css/template.min.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
<link href="/media/templates/site/cassiopeia/css/vendor/joomla-custom-elements/joomla-alert.min.css?0.2.0" rel="stylesheet"/>
<link href="/media/templates/site/cassiopeia_sas/css/user.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
<link href="/media/com_jce/site/css/content.min.css?badb4208be409b1335b815dde676300e" rel="stylesheet"/>
<style>
.element-invisible { position: absolute !important; height: 1px; width: 1px; overflow: hidden; clip: rect(1px, 1px, 1px, 1px); }
</style>
<style>
@media (min-width: 768px) {#lnepmodal {max-width
Retrieving nodes
By inspecting the webpage with devtools in a browser (Firefox, Chrome), we see that the information is in nodes with CSS class .latestnews-item
:
= soup.find_all("div", class_="latestnews-item")
faculty_items 0] faculty_items[
<div class="latestnews-item id-368 catid-130 head_left"> <div class="news"> <div class="innernews"> <div class="newshead picturetype"> <div class="picture"> <div class="innerpicture"> <a aria-label="Read more about Pierre Bellec" class="hasTooltip" href="/people-pages/faculty/people/368-pierre-bellec" title="Pierre Bellec"> <img alt="Pierre Bellec" height="200" loading="eager" src="/media/cache/com_latestnewsenhancedpro/thumb_articles_blog_100570_368.jpg?3ca7e6dd0d3720780e6df12574e73403" width="200"/> </a> </div> </div> </div> <div class="newsinfo"> <h2 class="newstitle"> <a aria-label="Read more about Pierre Bellec" class="hasTooltip" href="/people-pages/faculty/people/368-pierre-bellec" title="Pierre Bellec"> <span>Pierre Bellec</span> </a> </h2> <dl class="item_details before_text"><dt>Information</dt><dd class="newsextra"><span class="detail detail_jfield_text detail_jfield_2"><span class="detail_data">Associate Professor</span></span></dd><dd class="newsextra"><span class="detail detail_jfield_url detail_jfield_5"><span class="detail_data"><a href="mailto:pcb71@stat.rutgers.edu">pcb71@stat.rutgers.edu</a></span></span></dd></dl> </div> </div> </div> </div>
Here is an prettified HTML of this node:
print(faculty_items[0].prettify())
<div class="latestnews-item id-368 catid-130 head_left">
<div class="news">
<div class="innernews">
<div class="newshead picturetype">
<div class="picture">
<div class="innerpicture">
<a aria-label="Read more about Pierre Bellec" class="hasTooltip" href="/people-pages/faculty/people/368-pierre-bellec" title="Pierre Bellec">
<img alt="Pierre Bellec" height="200" loading="eager" src="/media/cache/com_latestnewsenhancedpro/thumb_articles_blog_100570_368.jpg?3ca7e6dd0d3720780e6df12574e73403" width="200"/>
</a>
</div>
</div>
</div>
<div class="newsinfo">
<h2 class="newstitle">
<a aria-label="Read more about Pierre Bellec" class="hasTooltip" href="/people-pages/faculty/people/368-pierre-bellec" title="Pierre Bellec">
<span>
Pierre Bellec
</span>
</a>
</h2>
<dl class="item_details before_text">
<dt>
Information
</dt>
<dd class="newsextra">
<span class="detail detail_jfield_text detail_jfield_2">
<span class="detail_data">
Associate Professor
</span>
</span>
</dd>
<dd class="newsextra">
<span class="detail detail_jfield_url detail_jfield_5">
<span class="detail_data">
<a href="mailto:pcb71@stat.rutgers.edu">
pcb71@stat.rutgers.edu
</a>
</span>
</span>
</dd>
</dl>
</div>
</div>
</div>
</div>
Retrieving data from a node
We can see that the faculty name is in the first span
:
0].find("span") faculty_items[
<span>Pierre Bellec</span>
or to get the text of the node:
0].find("span").text faculty_items[
'Pierre Bellec'
The title is in a span
tag with class detail_jfield_2
:
0].find("span", class_="detail_jfield_2").text faculty_items[
'Associate Professor'
Finally, the email is given in a span
tag with class detail_jfield_5
:
0].find("span", class_="detail_jfield_5").text faculty_items[
'pcb71@stat.rutgers.edu'
Full code to obtain a pandas DataFrame
import pandas as pd
from bs4 import BeautifulSoup
import requests
= "https://statistics.rutgers.edu/people-pages/faculty"
url = requests.get(url)
response = BeautifulSoup(response.text, "html.parser")
soup
# Find all elements with class "latestnews-item"
= soup.find_all("div", class_="latestnews-item")
faculty_items = []
data
for item in faculty_items:
'name': item.find("span").text,
data.append({'title': item.find("span", class_='detail_jfield_2').text,
'email': item.find("span", class_='detail_jfield_5').text})
pd.DataFrame(data)
name | title | ||
---|---|---|---|
0 | Pierre Bellec | Associate Professor | pcb71@stat.rutgers.edu |
1 | Matteo Bonvini | Assistant Professor | mb1662@stat.rutgers.edu |
2 | Steve Buyske | Associate Professor; Undergraduate Co-Director... | buyske@stat.rutgers.edu |
3 | Javier Cabrera | Professor | cabrera@stat.rutgers.edu |
4 | Rong Chen | Distinguished Professor and Chair | rongchen@stat.rutgers.edu |
5 | Yaqing Chen | Assistant Professor | yqchen@stat.rutgers.edu |
6 | Harry Crane | Professor | hcrane@stat.rutgers.edu |
7 | Tirthankar DasGupta | Professor and Co-Graduate Director | tirthankar.dasgupta@rutgers.edu |
8 | Ruobin Gong | Assistant Professor | ruobin.gong@rutgers.edu |
9 | Zijian Guo | Associate Professor | zijguo@stat.rutgers.edu |
10 | Qiyang Han | Associate Professor | qh85@stat.rutgers.edu |
11 | Donald R. Hoover | Professor | drhoover@stat.rutgers.edu |
12 | Ying Hung | Professor | yhung@stat.rutgers.edu |
13 | Koulik Khamaru | Assistant Professor | kk1241@stat.rutgers.edu |
14 | John Kolassa | Distinguished Professor | kolassa@stat.rutgers.edu |
15 | Regina Y. Liu | Distinguished Professor | rliu@stat.rutgers.edu |
16 | Gemma Moran | Assistant Professor | gm845@stat.rutgers.edu |
17 | Nicole Pashley | Assistant Professor | np755@stat.rutgers.edu |
18 | Harold B. Sackrowitz | Distinguished Professor and Undergraduate Dire... | sackrowi@stat.rutgers.edu |
19 | Michael L. Stein | Distinguished Professor | ms2870@stat.rutgers.edu |
20 | Zhiqiang Tan | Distinguished Professor | ztan@stat.rutgers.edu |
21 | David E. Tyler | Distinguished Professor | dtyler@stat.rutgers.edu |
22 | Guanyang Wang | Assistant Professor | guanyang.wang@rutgers.edu |
23 | Sijian Wang | Professor and Co-Director of FSRM and MSDS pro... | sijian.wang@stat.rutgers.edu |
24 | Han Xiao | Professor and Co-Graduate Director | hxiao@stat.rutgers.edu |
25 | Minge Xie | Distinguished Professor and Director, Office o... | mxie@stat.rutgers.edu |
26 | Min Xu | Assistant Professor | mx76@stat.rutgers.edu |
27 | Cun-Hui Zhang | Distinguished Professor and Co-Director of FSR... | czhang@stat.rutgers.edu |
28 | Linjun Zhang | Assistant Professor | linjun.zhang@rutgers.edu |