Scraping faculty page with BeautifulSoup

HTTP request

We start by making an HTTP request to get the page URL

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = "https://statistics.rutgers.edu/people-pages/faculty"
response = requests.get(url)

BeautifulSoup object of the page

The next step is to create a BeautifulSoup object that will let us navigate the page.

soup = BeautifulSoup(response.text, "html.parser")
print(soup.prettify()[:2000])
<!DOCTYPE html>
<html dir="ltr" lang="en-gb">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="The School of Arts and Sciences, Rutgers, The State University of New Jersey" name="description"/>
  <meta content="Joomla! - Open Source Content Management" name="generator"/>
  <title>
   Faculty
  </title>
  <link href="/media/templates/site/cassiopeia_sas/images/favicon.ico" rel="alternate icon" type="image/vnd.microsoft.icon"/>
  <link color="#000" href="/media/system/images/joomla-favicon-pinned.svg" rel="mask-icon"/>
  <link href="/media/system/css/joomla-fontawesome.min.css?7adb883af089f68d28309c3f304b5284" rel="lazy-stylesheet">
   <noscript>
    <link href="/media/system/css/joomla-fontawesome.min.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
   </noscript>
   <link href="/media/cache/com_latestnewsenhancedpro/style_articles_blog_100570.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
   <link href="/media/syw/css/fonts.min.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
   <link href="/media/vendor/chosen/css/chosen.css?1.8.7" rel="stylesheet"/>
   <link href="/media/templates/site/cassiopeia/css/global/colors_standard.min.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
   <link href="/media/templates/site/cassiopeia/css/template.min.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
   <link href="/media/templates/site/cassiopeia/css/vendor/joomla-custom-elements/joomla-alert.min.css?0.2.0" rel="stylesheet"/>
   <link href="/media/templates/site/cassiopeia_sas/css/user.css?7adb883af089f68d28309c3f304b5284" rel="stylesheet"/>
   <link href="/media/com_jce/site/css/content.min.css?badb4208be409b1335b815dde676300e" rel="stylesheet"/>
   <style>
    .element-invisible { position: absolute !important; height: 1px; width: 1px; overflow: hidden; clip: rect(1px, 1px, 1px, 1px); }
   </style>
   <style>
    @media (min-width: 768px) {#lnepmodal {max-width

Retrieving nodes

By inspecting the webpage with devtools in a browser (Firefox, Chrome), we see that the information is in nodes with CSS class .latestnews-item:

faculty_items = soup.find_all("div", class_="latestnews-item")
faculty_items[0]
<div class="latestnews-item id-368 catid-130 head_left"> <div class="news"> <div class="innernews"> <div class="newshead picturetype"> <div class="picture"> <div class="innerpicture"> <a aria-label="Read more about Pierre Bellec" class="hasTooltip" href="/people-pages/faculty/people/368-pierre-bellec" title="Pierre Bellec"> <img alt="Pierre Bellec" height="200" loading="eager" src="/media/cache/com_latestnewsenhancedpro/thumb_articles_blog_100570_368.jpg?3ca7e6dd0d3720780e6df12574e73403" width="200"/> </a> </div> </div> </div> <div class="newsinfo"> <h2 class="newstitle"> <a aria-label="Read more about Pierre Bellec" class="hasTooltip" href="/people-pages/faculty/people/368-pierre-bellec" title="Pierre Bellec"> <span>Pierre Bellec</span> </a> </h2> <dl class="item_details before_text"><dt>Information</dt><dd class="newsextra"><span class="detail detail_jfield_text detail_jfield_2"><span class="detail_data">Associate Professor</span></span></dd><dd class="newsextra"><span class="detail detail_jfield_url detail_jfield_5"><span class="detail_data"><a href="mailto:pcb71@stat.rutgers.edu">pcb71@stat.rutgers.edu</a></span></span></dd></dl> </div> </div> </div> </div>

Here is an prettified HTML of this node:

print(faculty_items[0].prettify())
<div class="latestnews-item id-368 catid-130 head_left">
 <div class="news">
  <div class="innernews">
   <div class="newshead picturetype">
    <div class="picture">
     <div class="innerpicture">
      <a aria-label="Read more about Pierre Bellec" class="hasTooltip" href="/people-pages/faculty/people/368-pierre-bellec" title="Pierre Bellec">
       <img alt="Pierre Bellec" height="200" loading="eager" src="/media/cache/com_latestnewsenhancedpro/thumb_articles_blog_100570_368.jpg?3ca7e6dd0d3720780e6df12574e73403" width="200"/>
      </a>
     </div>
    </div>
   </div>
   <div class="newsinfo">
    <h2 class="newstitle">
     <a aria-label="Read more about Pierre Bellec" class="hasTooltip" href="/people-pages/faculty/people/368-pierre-bellec" title="Pierre Bellec">
      <span>
       Pierre Bellec
      </span>
     </a>
    </h2>
    <dl class="item_details before_text">
     <dt>
      Information
     </dt>
     <dd class="newsextra">
      <span class="detail detail_jfield_text detail_jfield_2">
       <span class="detail_data">
        Associate Professor
       </span>
      </span>
     </dd>
     <dd class="newsextra">
      <span class="detail detail_jfield_url detail_jfield_5">
       <span class="detail_data">
        <a href="mailto:pcb71@stat.rutgers.edu">
         pcb71@stat.rutgers.edu
        </a>
       </span>
      </span>
     </dd>
    </dl>
   </div>
  </div>
 </div>
</div>

Retrieving data from a node

We can see that the faculty name is in the first span:

faculty_items[0].find("span")
<span>Pierre Bellec</span>

or to get the text of the node:

faculty_items[0].find("span").text
'Pierre Bellec'

The title is in a span tag with class detail_jfield_2:

faculty_items[0].find("span", class_="detail_jfield_2").text
'Associate Professor'

Finally, the email is given in a span tag with class detail_jfield_5:

faculty_items[0].find("span", class_="detail_jfield_5").text
'pcb71@stat.rutgers.edu'

Full code to obtain a pandas DataFrame

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = "https://statistics.rutgers.edu/people-pages/faculty"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Find all elements with class "latestnews-item"
faculty_items = soup.find_all("div", class_="latestnews-item")
data = []

for item in faculty_items: 
    data.append({'name': item.find("span").text,
                 'title': item.find("span", class_='detail_jfield_2').text,
                 'email': item.find("span", class_='detail_jfield_5').text})

pd.DataFrame(data)
name title email
0 Pierre Bellec Associate Professor pcb71@stat.rutgers.edu
1 Matteo Bonvini Assistant Professor mb1662@stat.rutgers.edu
2 Steve Buyske Associate Professor; Undergraduate Co-Director... buyske@stat.rutgers.edu
3 Javier Cabrera Professor cabrera@stat.rutgers.edu
4 Rong Chen Distinguished Professor and Chair rongchen@stat.rutgers.edu
5 Yaqing Chen Assistant Professor yqchen@stat.rutgers.edu
6 Harry Crane Professor hcrane@stat.rutgers.edu
7 Tirthankar DasGupta Professor and Co-Graduate Director tirthankar.dasgupta@rutgers.edu
8 Ruobin Gong Assistant Professor ruobin.gong@rutgers.edu
9 Zijian Guo Associate Professor zijguo@stat.rutgers.edu
10 Qiyang Han Associate Professor qh85@stat.rutgers.edu
11 Donald R. Hoover Professor drhoover@stat.rutgers.edu
12 Ying Hung Professor yhung@stat.rutgers.edu
13 Koulik Khamaru Assistant Professor kk1241@stat.rutgers.edu
14 John Kolassa Distinguished Professor kolassa@stat.rutgers.edu
15 Regina Y. Liu Distinguished Professor rliu@stat.rutgers.edu
16 Gemma Moran Assistant Professor gm845@stat.rutgers.edu
17 Nicole Pashley Assistant Professor np755@stat.rutgers.edu
18 Harold B. Sackrowitz Distinguished Professor and Undergraduate Dire... sackrowi@stat.rutgers.edu
19 Michael L. Stein Distinguished Professor ms2870@stat.rutgers.edu
20 Zhiqiang Tan Distinguished Professor ztan@stat.rutgers.edu
21 David E. Tyler Distinguished Professor dtyler@stat.rutgers.edu
22 Guanyang Wang Assistant Professor guanyang.wang@rutgers.edu
23 Sijian Wang Professor and Co-Director of FSRM and MSDS pro... sijian.wang@stat.rutgers.edu
24 Han Xiao Professor and Co-Graduate Director hxiao@stat.rutgers.edu
25 Minge Xie Distinguished Professor and Director, Office o... mxie@stat.rutgers.edu
26 Min Xu Assistant Professor mx76@stat.rutgers.edu
27 Cun-Hui Zhang Distinguished Professor and Co-Director of FSR... czhang@stat.rutgers.edu
28 Linjun Zhang Assistant Professor linjun.zhang@rutgers.edu