Python Library for Russian Macroeconomics Data

The article introduces a Python library called sophisthse that provides access to Russian macroeconomic statistics time series published by the National Research University Higher School of Economics (HSE). The library aims to simplify the process of working with Russian macroeconomic data and facilitate the analysis of economic trends in the country. The article demonstrates how to use the library to list available time series, download data, and visualize the time series data.

Acknowledgments

The library is a port of the sophisthse R package developed by Boris Demeshev. The original package provides similar functionality for working with Russian macroeconomic data in R. The Python version of the library is designed to offer the same features to Python users interested in analyzing Russian economic indicators.

Source Data

The sophist.hse.ru website contains wide range of Russian macroeconomic statistics time series, including data on GDP, inflation, unemployment, industrial production, retail sales, and more. The data is originally sourced from the Federal State Statistics Service of the Russian Federation (Rosstat) and other official sources.

Features

The sophisthse library provides the following features:

Listing available tables with time series data
Downloading and caching time series data

Installation

You can install the library using pip:

pip install sophisthse

Usage

Here is an example of how to use the library to download and plot the time series data.

List Available Time Series

Time series are published on the sophist.hse.ru website. Initializing sophisthse class will download the list of available tables. You can list the available tables using the list_tables method:

from sophisthse import sophisthse

sph = sophisthse()
tables = sph.list_tables()
tables

	date	name
0	2020-05-27 15:46:00	AGR_M_I - Copy
1	2024-08-08 09:10:00	AGR_M_I
2	2024-08-08 09:10:00	AGR_Q_I
3	2024-02-11 19:10:00	AGR_Y_DIRI
4	2024-07-26 20:04:00	APCI3N
...	...	...
284	2010-03-29 12:57:00	vrp
285	2010-03-30 12:41:00	vrp_r
286	2024-08-08 09:12:00	WAG_M
287	2024-06-07 15:03:00	WAG_Q
288	2024-03-13 19:36:00	WAG_Y

289 rows × 2 columns

The column named date contains timestamp of the latest update. The dates in this column hint that some tables may be out of date. Let’s take a look at the date of the latest update of the tables.

tables["year"] = tables["date"].dt.year
tables.groupby("year").size().plot(kind="bar")

The 116 out of 289 are being updated through 2024. Let’s take a look at the most recently updated tables.

tables["date"] = tables["date"].dt.date
(
    tables.groupby("date")
    .agg({"name": lambda x: ", ".join(x)})
    .reset_index()
    .sort_values("date", ascending=False)
    .head()
)

	date	name
38	2024-08-08	AGR_M_I, AGR_Q_I, BBR_EA2_M_I, BBR_EA2_Q_I, CN...
37	2024-07-26	APCI3N, BMPM3N, CCPM3N, CPPNF3N, ECOG3N, EMAM3...
36	2024-06-07	WAG_Q
35	2024-06-06	GOV_Q, M_Q
34	2024-04-05	INVFC_Q, INVFC_Q_I

Download and Plot Time Series

According to the sophist.hse.ru website table AGR_M_I provides the index of real agricultural production in column AGR_M_DIRI, and seasonally adjusted one in the AGR_M_DIRI_SA. Let’s download it and take a look at the data.

agr_m_i = sph.get_table("AGR_M_I")
agr_m_i.tail()

	AGR_M_DIRI	AGR_M_DIRI_SA
T
2024-02	110.0	353.3
2024-03	169.6	355.7
2024-04	190.3	357.2
2024-05	208.4	358.0
2024-06	213.2	358.5

Note

After calling get_table, the data is cached locally. The next time get_table is called, the data will be loaded from the cache. However, if the table is updated on the server, which is checked when the sophisthse class is initialized, the new data will be loaded from the website.

The method returns a pandas DataFrame, so we can plot the data using the plot method.

agr_m_i.plot()

Additional Example

According to the sophist.hse.ru website table FINMAR_M contains data on the financial market.

finmar = sph.get_table("FINMAR_M")
finmar.tail()

	RDEXRO_M	RDEXRM_M	RTS_M	IB_M	GKO_M	DEP_M	CR_M
T
2024-03	92.37	92.35	1131.21	NaN	NaN	NaN	NaN
2024-04	91.78	93.05	1165.23	NaN	NaN	NaN	NaN
2024-05	89.79	90.05	1185.57	NaN	NaN	NaN	NaN
2024-06	85.75	85.75	1134.30	NaN	NaN	NaN	NaN
2024-07	86.33	86.11	1092.61	NaN	NaN	NaN	NaN

The first columns are:

RDEXRO_M - Official exchange rate of the ruble against the US dollar.
RDEXRM_M - Exchange rate of the US dollar on the Moscow Exchange.
RTS_M - Average RTS index.

The last four columns show NaNs for the most recent periods. Let’s take a look at the missing values.

sns.heatmap(~finmar.isnull(), cbar=False, cmap="Blues")

The heatmap shows that the data in the IB_M, GKO_M, DEP_M, and CR_M columns is not available for the recent years, while the RDEXRO_M, RDEXRM_M, and RTS_M columns have complete data.

Let’s plot the RTS_M column. Due to the economic fluctuations, the RTS index has a lot of volatility. We will add the moving average to the plot.

finmar["RTS_M"].plot(label="RTS index")
finmar["RTS_M"].rolling(12, center=True).mean().plot(label="12-months MA")
plt.legend()

Further Investigation

Looking at the tables, one can assume that the suffixes _M, _Q and _Y in the names mean that the data was sampled monthly, quarterly and annually, respectively. Let’s check this assumption.

def guess_period(name: str):
    if "_M" in name:
        return "month"
    if "_Q" in name:
        return "quarter"
    if "_Y" in name:
        return "year"
    return "unknown"


tables["period"] = tables["name"].apply(guess_period)

tables.groupby("period").size().plot(kind="bar")

The bar plot shows that the assumption is partially correct. There are tables with unknown sampling frequency. This information can be drawn from the actual series after downloading data. It’s not necessary for demonstrative purposes, so we will skip this step.

Let’s calculate the correlation matrix for the monthly data.

from time import sleep
from random import randint
import pandas as pd


monthly_tables = tables.query("period == 'month' & year >= 2023")

df = pd.DataFrame(index=pd.period_range(start="2014-01-01", end="2025-01-01", freq="M"))
for table in monthly_tables["name"]:
    df_add = sph.get_table(table)
    # We don't want to overload the server
    # sleep(randint(1, 3))
    df = df.merge(df_add, left_index=True, right_index=True)

# We do not want to deal with NaNs, at least at this time :)
df = df.dropna(how="any", axis=1)

# Keep only columns with "_M" suffix to skip derived columns
columns = [col for col in df.columns if col.endswith("_M")]

corr = df[columns].corr()

sns.heatmap(
    corr,
    vmin=-1.0,
    vmax=1.0,
    xticklabels=corr.columns.values,  # type: ignore
    yticklabels=corr.columns.values,  # type: ignore
    cmap="coolwarm",
)

The last column (or row) is the WAG_C_M. This is a monthly series of the average nominal wage. It correlates with many economics indicators, and it worth noting that CBEX_M (Consolidated Budget Expenditures) explains almost 84% of WAG_C_M variability. Let’s take a look at these two series.

df_norm = df[columns].apply(lambda x: (x - x.mean()) / x.std())
df_norm[["WAG_C_M", "CBEX_M"]].plot()

The plot shows that the WAG_C_M series follows the CBEX_M series with great accuracy. This is an interesting observation that can be used in further analysis.

Conclusion

The sophisthse library provides a convenient way to access and work with Russian macroeconomic time series data. The library simplifies the process of downloading and caching data, making it easier to analyze economic trends in Russia. The library is a valuable resource for researchers, analysts, and anyone interested in studying the Russian economy.