import pandas as pd
from bs4 import BeautifulSoup
import requests
url = "https://statistics.rutgers.edu/people-pages/faculty"
response = requests.get(url)Scraping faculty page with BeautifulSoup
HTTP request
We start by making an HTTP request to get the page URL
BeautifulSoup object of the page
The next step is to create a BeautifulSoup object that will let us navigate the page.
soup = BeautifulSoup(response.text, "html.parser")
print(soup.prettify()[:2000])<!DOCTYPE html>
<html dir="ltr" lang="en-gb">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="The School of Arts and Sciences, Rutgers, The State University of New Jersey" name="description"/>
<meta content="Joomla! - Open Source Content Management" name="generator"/>
<title>
Faculty
</title>
<link href="/media/templates/site/cassiopeia_sas/images/favicon.ico" rel="alternate icon" type="image/vnd.microsoft.icon"/>
<link color="#000" href="/media/system/images/joomla-favicon-pinned.svg" rel="mask-icon"/>
<link href="/media/system/css/joomla-fontawesome.min.css?7adb883af089f68d28309c3f304b5284" rel="lazy-stylesheet">
<noscript>
<link href="/media/system/css/joomla-fontawesome.min.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
</noscript>
<link href="/media/cache/com_latestnewsenhancedpro/style_articles_blog_100570.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
<link href="/media/syw/css/fonts.min.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
<link href="/media/vendor/chosen/css/chosen.css?1.8.7" rel="stylesheet"/>
<link href="/media/templates/site/cassiopeia/css/global/colors_standard.min.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
<link href="/media/templates/site/cassiopeia/css/template.min.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
<link href="/media/templates/site/cassiopeia/css/vendor/joomla-custom-elements/joomla-alert.min.css?0.2.0" rel="stylesheet"/>
<link href="/media/templates/site/cassiopeia_sas/css/user.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
<link href="/media/com_jce/site/css/content.min.css?badb4208be409b1335b815dde676300e" rel="stylesheet"/>
<style>
.element-invisible { position: absolute !important; height: 1px; width: 1px; overflow: hidden; clip: rect(1px, 1px, 1px, 1px); }
</style>
<style>
@media (min-width: 768px) {#lnepmodal {max-width
Retrieving nodes
By inspecting the webpage with devtools in a browser (Firefox, Chrome), we see that the information is in nodes with CSS class .latestnews-item:
faculty_items = soup.find_all("div", class_="latestnews-item")
faculty_items[0]<div class="latestnews-item id-368 catid-130 head_left"> <div class="news"> <div class="innernews"> <div class="newshead picturetype"> <div class="picture"> <div class="innerpicture"> <a aria-label="Read more about Pierre Bellec" class="hasTooltip" href="/people-pages/faculty/people/368-pierre-bellec" title="Pierre Bellec"> <img alt="Pierre Bellec" height="200" loading="eager" src="/media/cache/com_latestnewsenhancedpro/thumb_articles_blog_100570_368.jpg?3ca7e6dd0d3720780e6df12574e73403" width="200"/> </a> </div> </div> </div> <div class="newsinfo"> <h2 class="newstitle"> <a aria-label="Read more about Pierre Bellec" class="hasTooltip" href="/people-pages/faculty/people/368-pierre-bellec" title="Pierre Bellec"> <span>Pierre Bellec</span> </a> </h2> <dl class="item_details before_text"><dt>Information</dt><dd class="newsextra"><span class="detail detail_jfield_text detail_jfield_2"><span class="detail_data">Associate Professor</span></span></dd><dd class="newsextra"><span class="detail detail_jfield_url detail_jfield_5"><span class="detail_data"><a href="mailto:pcb71@stat.rutgers.edu">pcb71@stat.rutgers.edu</a></span></span></dd></dl> </div> </div> </div> </div>
Here is an prettified HTML of this node:
print(faculty_items[0].prettify())<div class="latestnews-item id-368 catid-130 head_left">
<div class="news">
<div class="innernews">
<div class="newshead picturetype">
<div class="picture">
<div class="innerpicture">
<a aria-label="Read more about Pierre Bellec" class="hasTooltip" href="/people-pages/faculty/people/368-pierre-bellec" title="Pierre Bellec">
<img alt="Pierre Bellec" height="200" loading="eager" src="/media/cache/com_latestnewsenhancedpro/thumb_articles_blog_100570_368.jpg?3ca7e6dd0d3720780e6df12574e73403" width="200"/>
</a>
</div>
</div>
</div>
<div class="newsinfo">
<h2 class="newstitle">
<a aria-label="Read more about Pierre Bellec" class="hasTooltip" href="/people-pages/faculty/people/368-pierre-bellec" title="Pierre Bellec">
<span>
Pierre Bellec
</span>
</a>
</h2>
<dl class="item_details before_text">
<dt>
Information
</dt>
<dd class="newsextra">
<span class="detail detail_jfield_text detail_jfield_2">
<span class="detail_data">
Associate Professor
</span>
</span>
</dd>
<dd class="newsextra">
<span class="detail detail_jfield_url detail_jfield_5">
<span class="detail_data">
<a href="mailto:pcb71@stat.rutgers.edu">
pcb71@stat.rutgers.edu
</a>
</span>
</span>
</dd>
</dl>
</div>
</div>
</div>
</div>
Retrieving data from a node
We can see that the faculty name is in the first span:
faculty_items[0].find("span")<span>Pierre Bellec</span>
or to get the text of the node:
faculty_items[0].find("span").text'Pierre Bellec'
The title is in a span tag with class detail_jfield_2:
faculty_items[0].find("span", class_="detail_jfield_2").text'Associate Professor'
Finally, the email is given in a span tag with class detail_jfield_5:
faculty_items[0].find("span", class_="detail_jfield_5").text'pcb71@stat.rutgers.edu'
Full code to obtain a pandas DataFrame
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = "https://statistics.rutgers.edu/people-pages/faculty"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# Find all elements with class "latestnews-item"
faculty_items = soup.find_all("div", class_="latestnews-item")
data = []
for item in faculty_items:
data.append({'name': item.find("span").text,
'title': item.find("span", class_='detail_jfield_2').text,
'email': item.find("span", class_='detail_jfield_5').text})
pd.DataFrame(data)| name | title | ||
|---|---|---|---|
| 0 | Pierre Bellec | Associate Professor | pcb71@stat.rutgers.edu |
| 1 | Matteo Bonvini | Assistant Professor | mb1662@stat.rutgers.edu |
| 2 | Steve Buyske | Associate Professor; Undergraduate Co-Director... | buyske@stat.rutgers.edu |
| 3 | Javier Cabrera | Professor | cabrera@stat.rutgers.edu |
| 4 | Rong Chen | Distinguished Professor and Chair | rongchen@stat.rutgers.edu |
| 5 | Yaqing Chen | Assistant Professor | yqchen@stat.rutgers.edu |
| 6 | Harry Crane | Professor | hcrane@stat.rutgers.edu |
| 7 | Tirthankar DasGupta | Professor and Co-Graduate Director | tirthankar.dasgupta@rutgers.edu |
| 8 | Ruobin Gong | Assistant Professor | ruobin.gong@rutgers.edu |
| 9 | Zijian Guo | Associate Professor | zijguo@stat.rutgers.edu |
| 10 | Qiyang Han | Associate Professor | qh85@stat.rutgers.edu |
| 11 | Donald R. Hoover | Professor | drhoover@stat.rutgers.edu |
| 12 | Ying Hung | Professor | yhung@stat.rutgers.edu |
| 13 | Koulik Khamaru | Assistant Professor | kk1241@stat.rutgers.edu |
| 14 | John Kolassa | Distinguished Professor | kolassa@stat.rutgers.edu |
| 15 | Regina Y. Liu | Distinguished Professor | rliu@stat.rutgers.edu |
| 16 | Gemma Moran | Assistant Professor | gm845@stat.rutgers.edu |
| 17 | Nicole Pashley | Assistant Professor | np755@stat.rutgers.edu |
| 18 | Harold B. Sackrowitz | Distinguished Professor and Undergraduate Dire... | sackrowi@stat.rutgers.edu |
| 19 | Michael L. Stein | Distinguished Professor | ms2870@stat.rutgers.edu |
| 20 | Zhiqiang Tan | Distinguished Professor | ztan@stat.rutgers.edu |
| 21 | David E. Tyler | Distinguished Professor | dtyler@stat.rutgers.edu |
| 22 | Guanyang Wang | Assistant Professor | guanyang.wang@rutgers.edu |
| 23 | Sijian Wang | Professor and Co-Director of FSRM and MSDS pro... | sijian.wang@stat.rutgers.edu |
| 24 | Han Xiao | Professor and Co-Graduate Director | hxiao@stat.rutgers.edu |
| 25 | Minge Xie | Distinguished Professor and Director, Office o... | mxie@stat.rutgers.edu |
| 26 | Min Xu | Assistant Professor | mx76@stat.rutgers.edu |
| 27 | Cun-Hui Zhang | Distinguished Professor and Co-Director of FSR... | czhang@stat.rutgers.edu |
| 28 | Linjun Zhang | Assistant Professor | linjun.zhang@rutgers.edu |