Lec 1 - Introduction to data-wrangling

Email from colleague

Dear data-science colleague,

We will need to run a report/visualization daily or weekly to plot the change in exchange rates for USD, BGN, DKK, GBP and CHF. The European Central Bank publishes these rates but I don’t understand the format:https://www.ecb.europa.eu/stats/policy_and_exchange_rates/euro_reference_exchange_rates/html/index.en.html

The most recent data should be pulled automatically so the plots can be updated as new rates are published in the future.

Please make at least two plots, one with EUR as a baseline, and one with USD as a baseline.

Thanks, Your colleague

Parsing XML with XSL sheets

ecb_sheet.xsl
<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01"
xmlns:ecb="http://www.ecb.int/vocabulary/2002-08-01/eurofxref"
exclude-result-prefixes="gesmes ecb">
<xsl:output method="text" omit-xml-declaration="yes" indent="no"/>

<xsl:template match="/gesmes:Envelope">
    <xsl:text>Time,USD,JPY,BGN,CZK,DKK,GBP,CHF&#10;</xsl:text>

    <xsl:for-each select="ecb:Cube/ecb:Cube">
        <xsl:value-of select="@time"/>
        <xsl:text>,</xsl:text>
        <xsl:value-of select="ecb:Cube[@currency='USD']/@rate"/>
        <xsl:text>,</xsl:text>
        <xsl:value-of select="ecb:Cube[@currency='JPY']/@rate"/>
        <xsl:text>,</xsl:text>
        <xsl:value-of select="ecb:Cube[@currency='BGN']/@rate"/>
        <xsl:text>,</xsl:text>
        <xsl:value-of select="ecb:Cube[@currency='CZK']/@rate"/>
        <xsl:text>,</xsl:text>
        <xsl:value-of select="ecb:Cube[@currency='DKK']/@rate"/>
        <xsl:text>,</xsl:text>
        <xsl:value-of select="ecb:Cube[@currency='GBP']/@rate"/>
        <xsl:text>,</xsl:text>
        <xsl:value-of select="ecb:Cube[@currency='CHF']/@rate"/>
        <xsl:text>&#10;</xsl:text>
    </xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Makefile
rates.csv: eurofxref-hist-90d.xml
	xsltproc ecb_sheet.xsl eurofxref-hist-90d.xml > rates.csv

eurofxref-hist-90d.xml:
	wget https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist-90d.xml

From this, we can execute both commands in the Makefile with make rates.csv.

Import the csv into a pandas DataFrame in python

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

df = pd.read_csv('rates.csv') # wrong, dates are not ordered correctly
df.dtypes
Time     object
USD     float64
JPY     float64
BGN     float64
CZK     float64
DKK     float64
GBP     float64
CHF     float64
dtype: object
# make sure the first column is interpreted as a date
df = pd.read_csv('rates.csv', parse_dates=['Time'])
df.dtypes
Time    datetime64[ns]
USD            float64
JPY            float64
BGN            float64
CZK            float64
DKK            float64
GBP            float64
CHF            float64
dtype: object
df.columns = ['Time', 'rate_USD', 'rate_JPY', 'rate_BGN', 'rate_CZK', 'rate_DKK', 'rate_GBP', 'rate_CHF']
df.columns
df
Time rate_USD rate_JPY rate_BGN rate_CZK rate_DKK rate_GBP rate_CHF
0 2024-04-16 1.0637 164.54 1.9558 25.210 7.4609 0.85440 0.9712
1 2024-04-15 1.0656 164.05 1.9558 25.324 7.4606 0.85405 0.9725
2 2024-04-12 1.0652 163.16 1.9558 25.337 7.4603 0.85424 0.9716
3 2024-04-11 1.0729 164.18 1.9558 25.392 7.4604 0.85525 0.9787
4 2024-04-10 1.0860 164.89 1.9558 25.368 7.4594 0.85515 0.9810
... ... ... ... ... ... ... ... ...
57 2024-01-24 1.0905 160.46 1.9558 24.786 7.4568 0.85543 0.9415
58 2024-01-23 1.0872 160.88 1.9558 24.824 7.4574 0.85493 0.9446
59 2024-01-22 1.0890 160.95 1.9558 24.758 7.4585 0.85575 0.9458
60 2024-01-19 1.0887 161.17 1.9558 24.813 7.4575 0.85825 0.9459
61 2024-01-18 1.0875 160.89 1.9558 24.734 7.4571 0.85773 0.9432

62 rows × 8 columns

Visualization

Once we are within python, we can comfortably transform the DataFrame into a suitable format fore visualizations.

df_long = pd.wide_to_long(df, stubnames=['rate_'], i='Time', j='cur', suffix='[A-Z]*')
sns.relplot(df_long, x='Time', y='rate_', hue='cur', kind='line')
plt.ylim(0.7, 1.2)

df_long.columns
Index(['rate_'], dtype='object')