Lec 12 - Split-apply-combine paradigm

Split

Create ‘groups’ or ‘splits’ of the data, for instance:

along groups based on some categorical variables
along rolling windows for timeseries data

Apply (aggregate, transform, filter)

Aggregation: compute a summary statistic (or statistics) for each group.
→ Compute group sums or means, or size/counts within each groups
Transformation: perform some group-specific computations and return a like-indexed object. Some examples:
→ Standardize data (zscore or subtracting the min and dividing by max-min) within a group.
→ Filling NAs within groups with a value derived from each group (e.g., the mean within the group)
Filtration: discard some groups, according to a group-wise computation that evaluates to True or False. Some examples:
→ Discard data that belong to groups with fewver than 10 rows.
→ Filter out groups that have more than 30% of missing data.

Combine

After the aggregation, transformation or filtration is done, combine the data back together as one:

Combine the aggregated summary statistics of all groups in a pivot table
Combine the transformed columns back into a data-frame of the same length as the original one
After a filtration, get back the original data, with the filtered out rows discarded

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('rates.csv')
df.index = df['Time'].astype(np.dtype('datetime64[ns]'))
df = df.drop(columns='Time')
df = df.sort_index()
df.head()

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-01-18	1.0875	160.89	1.9558	24.734	7.4571	0.85773	0.9432
2024-01-19	1.0887	161.17	1.9558	24.813	7.4575	0.85825	0.9459
2024-01-22	1.0890	160.95	1.9558	24.758	7.4585	0.85575	0.9458
2024-01-23	1.0872	160.88	1.9558	24.824	7.4574	0.85493	0.9446
2024-01-24	1.0905	160.46	1.9558	24.786	7.4568	0.85543	0.9415

Row or column-wise function application

`df.apply(...)` with functions returning scalars (“reduce”)

df.apply('mean')

USD      1.082711
JPY    162.195484
BGN      1.955800
CZK     25.199081
DKK      7.456484
GBP      0.855201
CHF      0.957208
dtype: float64

f = lambda x: (x+1)**2
f(3)

# lambda is a quick way to construct a function, usually in one line
df.apply(lambda col: np.mean(col)/np.std(col))

USD    1.615824e+02
JPY    1.007120e+02
BGN    4.404070e+15
CZK    1.151246e+02
DKK    3.479387e+03
GBP    5.131824e+02
CHF    6.359713e+01
dtype: float64

`df.apply(...)` with functions returning Series (“broadcast)

# centering and normalizing
df.apply(lambda col: (col - np.mean(col)) / col.std(), axis=0)

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-01-18	0.708874	-0.804050	0.991903	-2.107564	0.285174	1.505410	-0.923163
2024-01-19	0.886510	-0.631598	0.991903	-1.749567	0.470313	1.814921	-0.745227
2024-01-22	0.930919	-0.767096	0.991903	-1.998805	0.933160	0.326887	-0.751817
2024-01-23	0.664465	-0.810209	0.991903	-1.699719	0.424028	-0.161188	-0.830900
2024-01-24	1.152964	-1.068888	0.991903	-1.871920	0.146319	0.136419	-1.035197
...	...	...	...	...	...	...	...
2024-04-10	0.486828	1.659558	0.991903	0.765476	1.349722	-0.030241	1.567942
2024-04-11	-1.452368	1.222268	0.991903	0.874235	1.812570	0.029281	1.416367
2024-04-12	-2.592200	0.594048	0.991903	0.624997	1.766285	-0.571885	0.948461
2024-04-15	-2.532988	1.142200	0.991903	0.566086	1.905139	-0.684975	1.007773
2024-04-16	-2.814246	1.443993	0.991903	0.049482	2.043993	-0.476651	0.922100

62 rows × 7 columns

df.loc['2023-11-07', 'USD'] = pd.NA
df.loc['2023-11-08', 'JPY'] = pd.NA
df.loc['2023-11-06', 'BGN'] = pd.NA
df.loc['2023-11-09', 'BGN'] = pd.NA
df.head()

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-01-18	1.0875	160.89	1.9558	24.734	7.4571	0.85773	0.9432
2024-01-19	1.0887	161.17	1.9558	24.813	7.4575	0.85825	0.9459
2024-01-22	1.0890	160.95	1.9558	24.758	7.4585	0.85575	0.9458
2024-01-23	1.0872	160.88	1.9558	24.824	7.4574	0.85493	0.9446
2024-01-24	1.0905	160.46	1.9558	24.786	7.4568	0.85543	0.9415

df.apply(pd.Series.interpolate).head()

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-01-18	1.0875	160.89	1.9558	24.734	7.4571	0.85773	0.9432
2024-01-19	1.0887	161.17	1.9558	24.813	7.4575	0.85825	0.9459
2024-01-22	1.0890	160.95	1.9558	24.758	7.4585	0.85575	0.9458
2024-01-23	1.0872	160.88	1.9558	24.824	7.4574	0.85493	0.9446
2024-01-24	1.0905	160.46	1.9558	24.786	7.4568	0.85543	0.9415

Aggregation API: several `.apply` at once

`.agg` with a single function is equivalent to `apply`

df.agg('mean')

USD      1.082711
JPY    162.195484
BGN      1.955800
CZK     25.199081
DKK      7.456484
GBP      0.855201
CHF      0.957208
dtype: float64

df.agg('mean') - df.apply('mean')

USD    0.0
JPY    0.0
BGN    0.0
CZK    0.0
DKK    0.0
GBP    0.0
CHF    0.0
dtype: float64

def maximum_plus_two(col):
    return col.max() + 2

df.agg(['mean',
        'median',
        lambda col: col.max(),
        maximum_plus_two
       ])

	USD	JPY	BGN	CZK	DKK	GBP	CHF
mean	1.082711	162.195484	1.9558	25.199081	7.456484	0.855201	0.957208
median	1.083900	162.150000	1.9558	25.297000	7.456500	0.855125	0.956150
<lambda>	1.093900	164.970000	1.9558	25.460000	7.460900	0.858460	0.984600
maximum_plus_two	3.093900	166.970000	3.9558	27.460000	9.460900	2.858460	2.984600

df.agg(['mean',
        'median',
        lambda col: col.max(),
        maximum_plus_two
       ], axis=1) # here applying this aggregation functions to rows does not really make sense

/home/pierro/mambaforge/lib/python3.10/site-packages/numpy/lib/nanfunctions.py:1215: RuntimeWarning:

Mean of empty slice

/home/pierro/mambaforge/lib/python3.10/site-packages/numpy/lib/nanfunctions.py:1215: RuntimeWarning:

Mean of empty slice

/home/pierro/mambaforge/lib/python3.10/site-packages/numpy/lib/nanfunctions.py:1215: RuntimeWarning:

Mean of empty slice

/home/pierro/mambaforge/lib/python3.10/site-packages/numpy/lib/nanfunctions.py:1215: RuntimeWarning:

Mean of empty slice

	mean	median	<lambda>	maximum_plus_two
Time
2024-01-18	28.275047	1.9558	160.89	162.89
2024-01-19	28.327021	1.9558	161.17	163.17
2024-01-22	28.287550	1.9558	160.95	162.95
2024-01-23	28.286276	1.9558	160.88	162.88
2024-01-24	28.220861	1.9558	160.46	162.46
...	...	...	...	...
2024-04-16	28.865143	1.9558	164.54	166.54
2023-11-07	NaN	NaN	NaN	NaN
2023-11-08	NaN	NaN	NaN	NaN
2023-11-06	NaN	NaN	NaN	NaN
2023-11-09	NaN	NaN	NaN	NaN

66 rows × 4 columns

Different aggregation functions to different columns

df.agg({"USD": ["mean", "median"],
        "JPY": "sum"})

	USD	JPY
mean	1.082711	NaN
median	1.083900	NaN
sum	NaN	10056.12

penguins = sns.load_dataset("penguins")
penguins.agg({'body_mass_g': ['mean', 'std'],
              'sex': 'unique'})

	body_mass_g	sex
mean	4201.754386	NaN
std	801.954536	NaN
unique	NaN	[Male, Female, nan]

Custom `.describe`

from functools import partial
q_05 = partial(pd.Series.quantile, q=0.05)
q_05.__name__ = "05%"
q_95 = partial(pd.Series.quantile, q=0.95)
q_95.__name__ = "95%"
df.agg(["count", "mean", "std", "min", q_05, "median", q_95, "max"])

	USD	JPY	BGN	CZK	DKK	GBP	CHF
count	62.000000	62.000000	6.200000e+01	62.000000	62.000000	62.000000	62.000000
mean	1.082711	162.195484	1.955800e+00	25.199081	7.456484	0.855201	0.957208
std	0.006755	1.623635	2.238572e-16	0.220672	0.002161	0.001680	0.015174
min	1.063700	158.960000	1.955800e+00	24.734000	7.453600	0.850980	0.931500
05%	1.071380	159.710500	1.955800e+00	24.759400	7.453905	0.852504	0.934705
median	1.083900	162.150000	1.955800e+00	25.297000	7.456500	0.855125	0.956150
95%	1.092455	164.682500	1.955800e+00	25.414800	7.460395	0.857946	0.981000
max	1.093900	164.970000	1.955800e+00	25.460000	7.460900	0.858460	0.984600

Transform API

The transform() method returns an object that is indexed the same (same size) as the original

df.transform(np.abs)

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-01-18	1.0875	160.89	1.9558	24.734	7.4571	0.85773	0.9432
2024-01-19	1.0887	161.17	1.9558	24.813	7.4575	0.85825	0.9459
2024-01-22	1.0890	160.95	1.9558	24.758	7.4585	0.85575	0.9458
2024-01-23	1.0872	160.88	1.9558	24.824	7.4574	0.85493	0.9446
2024-01-24	1.0905	160.46	1.9558	24.786	7.4568	0.85543	0.9415
...	...	...	...	...	...	...	...
2024-04-16	1.0637	164.54	1.9558	25.210	7.4609	0.85440	0.9712
2023-11-07	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-11-08	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-11-06	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-11-09	NaN	NaN	NaN	NaN	NaN	NaN	NaN

66 rows × 7 columns

df.transform(lambda col: np.mean(col)) # error: must conserve same index as the original

ValueError: Function did not transform

df.transform(lambda col: col - np.mean(col))

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-01-18	0.004789	-1.305484	-2.220446e-16	-0.465081	0.000616	0.002529	-0.014008
2024-01-19	0.005989	-1.025484	-2.220446e-16	-0.386081	0.001016	0.003049	-0.011308
2024-01-22	0.006289	-1.245484	-2.220446e-16	-0.441081	0.002016	0.000549	-0.011408
2024-01-23	0.004489	-1.315484	-2.220446e-16	-0.375081	0.000916	-0.000271	-0.012608
2024-01-24	0.007789	-1.735484	-2.220446e-16	-0.413081	0.000316	0.000229	-0.015708
...	...	...	...	...	...	...	...
2024-04-16	-0.019011	2.344516	-2.220446e-16	0.010919	0.004416	-0.000801	0.013992
2023-11-07	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-11-08	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-11-06	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-11-09	NaN	NaN	NaN	NaN	NaN	NaN	NaN

66 rows × 7 columns

def norm_mean_std(col):
    return (col - col.mean())/col.std()

def norm_min_max(col):
    return (col - col.min())/(col.max() - col.min())

df.transform([norm_mean_std, norm_min_max])

	USD		JPY		BGN		CZK		DKK		GBP		CHF
	norm_mean_std	norm_min_max	norm_mean_std	norm_min_max	norm_mean_std	norm_min_max	norm_mean_std	norm_min_max	norm_mean_std	norm_min_max	norm_mean_std	norm_min_max	norm_mean_std	norm_min_max
Time
2024-01-18	0.708874	0.788079	-0.804050	0.321131	-0.991903	NaN	-2.107564	0.000000	0.285174	0.479452	1.505410	0.902406	-0.923163	0.220339
2024-01-19	0.886510	0.827815	-0.631598	0.367720	-0.991903	NaN	-1.749567	0.108815	0.470313	0.534247	1.814921	0.971925	-0.745227	0.271186
2024-01-22	0.930919	0.837748	-0.767096	0.331115	-0.991903	NaN	-1.998805	0.033058	0.933160	0.671233	0.326887	0.637701	-0.751817	0.269303
2024-01-23	0.664465	0.778146	-0.810209	0.319468	-0.991903	NaN	-1.699719	0.123967	0.424028	0.520548	-0.161188	0.528075	-0.830900	0.246704
2024-01-24	1.152964	0.887417	-1.068888	0.249584	-0.991903	NaN	-1.871920	0.071625	0.146319	0.438356	0.136419	0.594920	-1.035197	0.188324
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2024-04-16	-2.814246	0.000000	1.443993	0.928453	-0.991903	NaN	0.049482	0.655647	2.043993	1.000000	-0.476651	0.457219	0.922100	0.747646
2023-11-07	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-11-08	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-11-06	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-11-09	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

66 rows × 14 columns

df.transform({'USD': [norm_min_max, norm_mean_std],
              'JPY': norm_min_max})

	USD		JPY
	norm_min_max	norm_mean_std	norm_min_max
Time
2024-01-18	0.788079	0.708874	0.321131
2024-01-19	0.827815	0.886510	0.367720
2024-01-22	0.837748	0.930919	0.331115
2024-01-23	0.778146	0.664465	0.319468
2024-01-24	0.887417	1.152964	0.249584
...	...	...	...
2024-04-16	0.000000	-2.814246	0.928453
2023-11-07	NaN	NaN	NaN
2023-11-08	NaN	NaN	NaN
2023-11-06	NaN	NaN	NaN
2023-11-09	NaN	NaN	NaN

66 rows × 3 columns

df[['USD', 'JPY']].transform([norm_min_max, norm_mean_std])

	USD		JPY
	norm_min_max	norm_mean_std	norm_min_max	norm_mean_std
Time
2024-01-18	0.788079	0.708874	0.321131	-0.804050
2024-01-19	0.827815	0.886510	0.367720	-0.631598
2024-01-22	0.837748	0.930919	0.331115	-0.767096
2024-01-23	0.778146	0.664465	0.319468	-0.810209
2024-01-24	0.887417	1.152964	0.249584	-1.068888
...	...	...	...	...
2024-04-16	0.000000	-2.814246	0.928453	1.443993
2023-11-07	NaN	NaN	NaN	NaN
2023-11-08	NaN	NaN	NaN	NaN
2023-11-06	NaN	NaN	NaN	NaN
2023-11-09	NaN	NaN	NaN	NaN

66 rows × 4 columns

Transforming only specific columns with certain functions

df.transform({'USD': [norm_mean_std, 
                      norm_min_max],
              'JPY': np.abs}).head(9)

	USD		JPY
	norm_mean_std	norm_min_max	absolute
Time
2024-01-18	0.708874	0.788079	160.89
2024-01-19	0.886510	0.827815	161.17
2024-01-22	0.930919	0.837748	160.95
2024-01-23	0.664465	0.778146	160.88
2024-01-24	1.152964	0.887417	160.46
2024-01-25	0.975328	0.847682	160.81
2024-01-26	0.649662	0.774834	160.62
2024-01-29	-0.060883	0.615894	160.13
2024-01-30	0.279586	0.692053	159.97

Entry-wise application of arbitrary functions with `.map`

# entry-wise function application
df.USD.map(lambda u: "more than 1" if u>1 else "less than 1") # convert entries to string

Time
2024-01-18    more than 1
2024-01-19    more than 1
2024-01-22    more than 1
2024-01-23    more than 1
2024-01-24    more than 1
                 ...     
2024-04-16    more than 1
2023-11-07    less than 1
2023-11-08    less than 1
2023-11-06    less than 1
2023-11-09    less than 1
Name: USD, Length: 66, dtype: object

# convert to string and replace dot with the text -DOT-
df.USD.map(lambda u: str(u).replace('.', '-DOT-'))

Time
2024-01-18    1-DOT-0875
2024-01-19    1-DOT-0887
2024-01-22     1-DOT-089
2024-01-23    1-DOT-0872
2024-01-24    1-DOT-0905
                 ...    
2024-04-16    1-DOT-0637
2023-11-07           nan
2023-11-08           nan
2023-11-06           nan
2023-11-09           nan
Name: USD, Length: 66, dtype: object

Rolling and expanding windows

Rolling windows

df = pd.read_csv('rates.csv')
df.index = df['Time'].astype(np.dtype('datetime64[ns]'))
df = df.drop(columns='Time')
df = df.sort_index()
df.head()

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-01-18	1.0875	160.89	1.9558	24.734	7.4571	0.85773	0.9432
2024-01-19	1.0887	161.17	1.9558	24.813	7.4575	0.85825	0.9459
2024-01-22	1.0890	160.95	1.9558	24.758	7.4585	0.85575	0.9458
2024-01-23	1.0872	160.88	1.9558	24.824	7.4574	0.85493	0.9446
2024-01-24	1.0905	160.46	1.9558	24.786	7.4568	0.85543	0.9415

df.rolling(window=3).std()

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-01-18	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2024-01-19	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2024-01-22	0.000794	0.147422	0.0	0.040501	0.000721	0.001319	0.001531
2024-01-23	0.000964	0.151327	0.0	0.035360	0.000608	0.001729	0.000723
2024-01-24	0.001652	0.265016	0.0	0.033126	0.000862	0.000413	0.002219
...	...	...	...	...	...	...	...
2024-04-10	0.002364	0.291433	0.0	0.013013	0.000306	0.001401	0.000624
2024-04-11	0.007773	0.434856	0.0	0.012000	0.000721	0.000827	0.001650
2024-04-12	0.010516	0.869617	0.0	0.027574	0.000551	0.000557	0.004900
2024-04-15	0.004335	0.555188	0.0	0.036097	0.000153	0.000645	0.003866
2024-04-16	0.001002	0.699595	0.0	0.069874	0.000300	0.000175	0.000666

62 rows × 7 columns

df.rolling(window=3, min_periods=2).sum()

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-01-18	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2024-01-19	2.1762	322.06	3.9116	49.547	14.9146	1.71598	1.8891
2024-01-22	3.2652	483.01	5.8674	74.305	22.3731	2.57173	2.8349
2024-01-23	3.2649	483.00	5.8674	74.395	22.3734	2.56893	2.8363
2024-01-24	3.2667	482.29	5.8674	74.368	22.3727	2.56611	2.8319
...	...	...	...	...	...	...	...
2024-04-10	3.2550	494.29	5.8674	76.102	22.3772	2.56973	2.9436
2024-04-11	3.2456	494.04	5.8674	76.140	22.3788	2.56703	2.9416
2024-04-12	3.2241	492.23	5.8674	76.097	22.3801	2.56464	2.9313
2024-04-15	3.2037	491.39	5.8674	76.053	22.3813	2.56354	2.9228
2024-04-16	3.1945	491.75	5.8674	75.871	22.3818	2.56269	2.9153

62 rows × 7 columns

df.rolling(window=5, center=True).mean()

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-01-18	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2024-01-19	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2024-01-22	1.08858	160.870	1.9558	24.7830	7.45746	0.856418	0.94420
2024-01-23	1.08894	160.854	1.9558	24.7874	7.45724	0.855948	0.94396
2024-01-24	1.08862	160.744	1.9558	24.7744	7.45672	0.855034	0.94270
...	...	...	...	...	...	...	...
2024-04-10	1.07862	164.326	1.9558	25.3662	7.45958	0.855844	0.97878
2024-04-11	1.07528	164.250	1.9558	25.3602	7.45994	0.855064	0.97714
2024-04-12	1.07068	164.164	1.9558	25.3262	7.46032	0.854618	0.97500
2024-04-15	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2024-04-16	NaN	NaN	NaN	NaN	NaN	NaN	NaN

62 rows × 7 columns

Expanding windows (cumulative since the start)

df.expanding().max() # useful to compute the highest exchange rate so far since 2023-11-06

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-01-18	1.0875	160.89	1.9558	24.734	7.4571	0.85773	0.9432
2024-01-19	1.0887	161.17	1.9558	24.813	7.4575	0.85825	0.9459
2024-01-22	1.0890	161.17	1.9558	24.813	7.4585	0.85825	0.9459
2024-01-23	1.0890	161.17	1.9558	24.824	7.4585	0.85825	0.9459
2024-01-24	1.0905	161.17	1.9558	24.824	7.4585	0.85825	0.9459
...	...	...	...	...	...	...	...
2024-04-10	1.0939	164.97	1.9558	25.460	7.4605	0.85846	0.9846
2024-04-11	1.0939	164.97	1.9558	25.460	7.4605	0.85846	0.9846
2024-04-12	1.0939	164.97	1.9558	25.460	7.4605	0.85846	0.9846
2024-04-15	1.0939	164.97	1.9558	25.460	7.4606	0.85846	0.9846
2024-04-16	1.0939	164.97	1.9558	25.460	7.4609	0.85846	0.9846

62 rows × 7 columns

Prediction of the value at the next timepoint, and avoiding contamination

df.rolling(window=2, closed='left').mean()

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-01-18	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2024-01-19	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2024-01-22	1.08810	161.030	1.9558	24.7735	7.45730	0.857990	0.94455
2024-01-23	1.08885	161.060	1.9558	24.7855	7.45800	0.857000	0.94585
2024-01-24	1.08810	160.915	1.9558	24.7910	7.45795	0.855340	0.94520
...	...	...	...	...	...	...	...
2024-04-10	1.08450	164.700	1.9558	25.3670	7.45890	0.857290	0.98130
2024-04-11	1.08635	164.930	1.9558	25.3740	7.45920	0.855890	0.98145
2024-04-12	1.07945	164.535	1.9558	25.3800	7.45990	0.855200	0.97985
2024-04-15	1.06905	163.670	1.9558	25.3645	7.46035	0.854745	0.97515
2024-04-16	1.06540	163.605	1.9558	25.3305	7.46045	0.854145	0.97205

62 rows × 7 columns

Example: predicting the current points with the average of the six previous (excluding the current point)

df['USD'].plot(label='USD exchangerate')
df['USD'].rolling(window=7, closed='left').mean().plot(
    label='estimated by moving average of the last 7 timepoints')
plt.legend()

EWM: exponential moving window

only a small number of aggregation functions are implemented for exponential weighted moving avererages: https://pandas.pydata.org/docs/reference/window.html#exponentially-weighted-window-functions

df['USD'].plot(
    label='USD exchangerate')
df.USD.ewm(0.5).mean().plot(
    label='Exponentially weighted average')
df.USD.ewm(halflife='10 days', times=df.index).mean().plot(
    label='Exponentially weighted average')
plt.legend()

Group by: the ‘split’ in split-apply-combine

import seaborn as sns

penguins = sns.load_dataset("penguins")
penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female

Iterating over groups, selecting groups

penguins.groupby('species').groups.keys()

dict_keys(['Adelie', 'Chinstrap', 'Gentoo'])

penguins.groupby('island').get_group('Dream')

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
30	Adelie	Dream	39.5	16.7	178.0	3250.0	Female
31	Adelie	Dream	37.2	18.1	178.0	3900.0	Male
32	Adelie	Dream	39.5	17.8	188.0	3300.0	Female
33	Adelie	Dream	40.9	18.9	184.0	3900.0	Male
34	Adelie	Dream	36.4	17.0	195.0	3325.0	Female
...	...	...	...	...	...	...	...
215	Chinstrap	Dream	55.8	19.8	207.0	4000.0	Male
216	Chinstrap	Dream	43.5	18.1	202.0	3400.0	Female
217	Chinstrap	Dream	49.6	18.2	193.0	3775.0	Male
218	Chinstrap	Dream	50.8	19.0	210.0	4100.0	Male
219	Chinstrap	Dream	50.2	18.7	198.0	3775.0	Female

124 rows × 7 columns

grouped = penguins.groupby(['island'])

for name, df_of_group in grouped[['species', 'bill_depth_mm']]:
    print(name)
    print(df_of_group.head())

Biscoe
   species  bill_depth_mm
20  Adelie           18.3
21  Adelie           18.7
22  Adelie           19.2
23  Adelie           18.1
24  Adelie           17.2
Dream
   species  bill_depth_mm
30  Adelie           16.7
31  Adelie           18.1
32  Adelie           17.8
33  Adelie           18.9
34  Adelie           17.0
Torgersen
  species  bill_depth_mm
0  Adelie           18.7
1  Adelie           17.4
2  Adelie           18.0
3  Adelie            NaN
4  Adelie           19.3

Aggregate within groups

grouped[['bill_depth_mm', 'body_mass_g']].agg(["sum", "mean", "std"])

	bill_depth_mm			body_mass_g
	sum	mean	std	sum	mean	std
island
Biscoe	2651.1	15.874850	1.820721	787575.0	4716.017964	782.855743
Dream	2274.7	18.344355	1.133116	460400.0	3712.903226	416.644112
Torgersen	939.9	18.429412	1.339447	189025.0	3706.372549	445.107940

Aggregation with named functions, specifying specific columns

grouped.agg(
    min_depth=pd.NamedAgg(column="bill_depth_mm", aggfunc="min"),
    max_depth=pd.NamedAgg(column="bill_depth_mm", aggfunc="max"),
    average_weight=pd.NamedAgg(column="body_mass_g", aggfunc="mean"),
)

	min_depth	max_depth	average_weight
island
Biscoe	13.1	21.1	4716.017964
Dream	15.5	21.2	3712.903226
Torgersen	15.9	21.5	3706.372549

grouped.agg({
    "body_mass_g": "sum",
    "bill_depth_mm": "max"
})

	body_mass_g	bill_depth_mm
island
Biscoe	787575.0	21.1
Dream	460400.0	21.2
Torgersen	189025.0	21.5

Filtration

Filtering groups that have only very few members

penguins.shape

(344, 7)

grouped = penguins.groupby(['island'])
grouped.filter(lambda x: len(x) > 60).shape # discard groups with fewver than 60 members

(292, 7)

penguins.island.unique()

array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)

grouped.filter(lambda x: len(x) > 60).island.unique() # the Torgersen was discarded

array(['Biscoe', 'Dream'], dtype=object)

(penguins.island == 'Torgersen').sum() # indeed, this is less than 60

Filtering out groups with missing values

def true_if_no_missing_value(x):
    return not x.isna().values.any()

df.loc['2023-11-07', 'USD'] = pd.NA
df.loc['2023-11-08', 'JPY'] = pd.NA
df.loc['2023-11-06', 'BGN'] = pd.NA
df.loc['2023-11-09', 'BGN'] = pd.NA
df.head()

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-01-18	1.0875	160.89	1.9558	24.734	7.4571	0.85773	0.9432
2024-01-19	1.0887	161.17	1.9558	24.813	7.4575	0.85825	0.9459
2024-01-22	1.0890	160.95	1.9558	24.758	7.4585	0.85575	0.9458
2024-01-23	1.0872	160.88	1.9558	24.824	7.4574	0.85493	0.9446
2024-01-24	1.0905	160.46	1.9558	24.786	7.4568	0.85543	0.9415

true_if_no_missing_value(df['CHF'])

False

true_if_no_missing_value(df['USD'])

False

penguins

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female
...	...	...	...	...	...	...	...
339	Gentoo	Biscoe	NaN	NaN	NaN	NaN	NaN
340	Gentoo	Biscoe	46.8	14.3	215.0	4850.0	Female
341	Gentoo	Biscoe	50.4	15.7	222.0	5750.0	Male
342	Gentoo	Biscoe	45.2	14.8	212.0	5200.0	Female
343	Gentoo	Biscoe	49.9	16.1	213.0	5400.0	Male

344 rows × 7 columns

penguins.loc[0, 'body_mass_g'] = pd.NA
penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	NaN	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female

Filtering any groups that have a missing value

grouped['body_mass_g'].filter(lambda s: true_if_no_missing_value(s))

30     3250.0
31     3900.0
32     3300.0
33     3900.0
34     3325.0
        ...  
215    4000.0
216    3400.0
217    3775.0
218    4100.0
219    3775.0
Name: body_mass_g, Length: 124, dtype: float64

# discard groups with any missing value in body_mass_g or bill_length_mm columns
grouped.filter(lambda d: true_if_no_missing_value(d['body_mass_g']) and
                         true_if_no_missing_value(d['bill_length_mm']))

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
30	Adelie	Dream	39.5	16.7	178.0	3250.0	Female
31	Adelie	Dream	37.2	18.1	178.0	3900.0	Male
32	Adelie	Dream	39.5	17.8	188.0	3300.0	Female
33	Adelie	Dream	40.9	18.9	184.0	3900.0	Male
34	Adelie	Dream	36.4	17.0	195.0	3325.0	Female
...	...	...	...	...	...	...	...
215	Chinstrap	Dream	55.8	19.8	207.0	4000.0	Male
216	Chinstrap	Dream	43.5	18.1	202.0	3400.0	Female
217	Chinstrap	Dream	49.6	18.2	193.0	3775.0	Male
218	Chinstrap	Dream	50.8	19.0	210.0	4100.0	Male
219	Chinstrap	Dream	50.2	18.7	198.0	3775.0	Female

124 rows × 7 columns

Transformation: assign the group average for missing values

grouped['body_mass_g'].transform(lambda x: x.fillna(value=x.mean()))

0      3705.500000
1      3800.000000
2      3250.000000
3      3705.500000
4      3450.000000
          ...     
339    4716.017964
340    4850.000000
341    5750.000000
342    5200.000000
343    5400.000000
Name: body_mass_g, Length: 344, dtype: float64

grouped['body_mass_g'].agg('mean')

island
Biscoe       4716.017964
Dream        3712.903226
Torgersen    3705.500000
Name: body_mass_g, dtype: float64

from pandas.api.types import is_numeric_dtype
# impute the mean, over each group, for all numeric columns
grouped.transform(lambda x: x.fillna(value=x.mean()) if is_numeric_dtype(x) else x)

	species	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	39.100000	18.700000	181.000000	3705.500000	Male
1	Adelie	39.500000	17.400000	186.000000	3800.000000	Female
2	Adelie	40.300000	18.000000	195.000000	3250.000000	Female
3	Adelie	38.950980	18.429412	191.196078	3705.500000	NaN
4	Adelie	36.700000	19.300000	193.000000	3450.000000	Female
...	...	...	...	...	...	...
339	Gentoo	45.257485	15.874850	209.706587	4716.017964	NaN
340	Gentoo	46.800000	14.300000	215.000000	4850.000000	Female
341	Gentoo	50.400000	15.700000	222.000000	5750.000000	Male
342	Gentoo	45.200000	14.800000	212.000000	5200.000000	Female
343	Gentoo	49.900000	16.100000	213.000000	5400.000000	Male

344 rows × 6 columns

Split

Apply (aggregate, transform, filter)

Combine

Row or column-wise function application

df.apply(...) with functions returning scalars (“reduce”)

df.apply(...) with functions returning Series (“broadcast)

Aggregation API: several .apply at once

.agg with a single function is equivalent to apply

Different aggregation functions to different columns

Custom .describe

Transform API

Transforming only specific columns with certain functions

Entry-wise application of arbitrary functions with .map

Rolling and expanding windows

Rolling windows

Expanding windows (cumulative since the start)

Prediction of the value at the next timepoint, and avoiding contamination

Example: predicting the current points with the average of the six previous (excluding the current point)

EWM: exponential moving window

Group by: the ‘split’ in split-apply-combine

Iterating over groups, selecting groups

Aggregate within groups

Aggregation with named functions, specifying specific columns

Filtration

Filtering groups that have only very few members

Filtering out groups with missing values

Filtering any groups that have a missing value

Transformation: assign the group average for missing values

`df.apply(...)` with functions returning scalars (“reduce”)

`df.apply(...)` with functions returning Series (“broadcast)

Aggregation API: several `.apply` at once

`.agg` with a single function is equivalent to `apply`

Custom `.describe`

Entry-wise application of arbitrary functions with `.map`