Pandas string manipulations

Some text values from a previous scraping lecture

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = "https://msds-stat.rutgers.edu/msds-academics/msds-coursedesc"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

serie = pd.Series([element.find('h2').text
           for element in soup.find_all("div", class_='blog-item')])
serie
0     \n                                    16:198:5...
1     \n                                    16:198:5...
2     \n                                    16:198:5...
3     \n                                    16:198:5...
4     \n                                    16:332:5...
5     \n                                    16:954:5...
6     \n                                    16:954:5...
7     \n                                    16:954:5...
8     \n                                    16:954:5...
9     \n                                    16:954:5...
10    \n                                    16:954:5...
11    \n                                    16:958:5...
12    \n                                    16:958:5...
13    \n                                    16:958:5...
dtype: object
serie = serie.astype('string')
serie.dtype
string[python]

Accessing string methods of panda series/dataframe string values

We can access the string methods of the DataFrame or series strings with .str.

serie.str.strip()
0     16:198:512 Introduction to Data Structures and...
1                     16:198:521 Linear Programming (3)
2            16:198:539 Database Management Systems (3)
3           16:198:541 Advanced Database Management (3)
4     16:332:509 Convex Optimization for Engineering...
5     16:954:534 Statistical Learning for Data Scien...
6       16:954:567 Statistical Models and Computing (3)
7     16:954:577 Advanced Analytics using Statistica...
8     16:954:581 Probability and Statistical Inferen...
9     16:954:596 Regression and Time Series Analysis...
10          16:954:597 Data Wrangling and Husbandry (3)
11           16:958:587 Advanced Simulation Methods (3)
12                 16:958:588 Financial Data Mining (3)
13    16:958:589 Advanced Programming for Financial ...
dtype: string
serie.str.strip().str.removesuffix('(3)')
0     16:198:512 Introduction to Data Structures and...
1                        16:198:521 Linear Programming 
2               16:198:539 Database Management Systems 
3              16:198:541 Advanced Database Management 
4     16:332:509 Convex Optimization for Engineering...
5     16:954:534 Statistical Learning for Data Science 
6          16:954:567 Statistical Models and Computing 
7     16:954:577 Advanced Analytics using Statistica...
8     16:954:581 Probability and Statistical Inferen...
9     16:954:596 Regression and Time Series Analysis...
10             16:954:597 Data Wrangling and Husbandry 
11              16:958:587 Advanced Simulation Methods 
12                    16:958:588 Financial Data Mining 
13    16:958:589 Advanced Programming for Financial ...
dtype: string

Counting occurence of a substring

serie.str.count(':')
0     2
1     2
2     2
3     2
4     2
5     2
6     2
7     2
8     2
9     2
10    2
11    2
12    2
13    2
dtype: Int64

Regexp in pandas

serie.str.strip()
0     16:198:512 Introduction to Data Structures and...
1                     16:198:521 Linear Programming (3)
2            16:198:539 Database Management Systems (3)
3           16:198:541 Advanced Database Management (3)
4     16:332:509 Convex Optimization for Engineering...
5     16:954:534 Statistical Learning for Data Scien...
6       16:954:567 Statistical Models and Computing (3)
7     16:954:577 Advanced Analytics using Statistica...
8     16:954:581 Probability and Statistical Inferen...
9     16:954:596 Regression and Time Series Analysis...
10          16:954:597 Data Wrangling and Husbandry (3)
11           16:958:587 Advanced Simulation Methods (3)
12                 16:958:588 Financial Data Mining (3)
13    16:958:589 Advanced Programming for Financial ...
dtype: string
expression = r"^(?P<school>\d{2}):(?P<program>\d{3})"
serie.str.strip().str.extract(expression)
school program
0 16 198
1 16 198
2 16 198
3 16 198
4 16 332
5 16 954
6 16 954
7 16 954
8 16 954
9 16 954
10 16 954
11 16 958
12 16 958
13 16 958
expression_full = r"(?P<school>\d{2}):(?P<program>\d{3}):(?P<course>\d{3})\s(?P<title>[\w\s]+)\s\((?P<credits>\d+)\)"
serie.str.strip().str.extract(expression_full)
school program course title credits
0 16 198 512 Introduction to Data Structures and Algorithms 3
1 16 198 521 Linear Programming 3
2 16 198 539 Database Management Systems 3
3 16 198 541 Advanced Database Management 3
4 16 332 509 Convex Optimization for Engineering Applications 3
5 16 954 534 Statistical Learning for Data Science 3
6 16 954 567 Statistical Models and Computing 3
7 16 954 577 Advanced Analytics using Statistical Software 3
8 16 954 581 Probability and Statistical Inference for Data... 3
9 16 954 596 Regression and Time Series Analysis for Data S... 3
10 16 954 597 Data Wrangling and Husbandry 3
11 16 958 587 Advanced Simulation Methods 3
12 16 958 588 Financial Data Mining 3
13 16 958 589 Advanced Programming for Financial Statistics ... 3