library(dplyr)
library(tidyr)
library(WDI)
library(fixest)
Tidy Fixed Effects Regressions: fixest vs pyfixest
Fixed effects regressions are crucial in econometrics for controlling for unobservable variables that vary across entities but are constant over time. They allow researchers to isolate the impact of independent variables on the dependent variable, which is for instance, essential for educing omitted variable bias in causal inference.
The fixest
R package is a powerful and efficient tool for econometric analysis, specializing in fixed effects models. It excels in handling large datasets and supports a wide range of regression models including linear, count, and logistic models. Its key strength lies in the fast estimation of models with multiple fixed effects, crucial for controlling unobserved heterogeneity in econometric analysis. With a user-friendly syntax, it integrates well with the tidyverse
, facilitating easy data manipulation and visualization. fixest
offers robust features for hypothesis testing, clustering standard errors, and instrumental variable estimation, all backed by C++ code for speed.
pyfixest
is a Python implementation of the fixest
package with the goal to mimic fixest
syntax and functionality as closely as Python allows. The great thing is, if you know either package, then you can switch between R and Python (almost) seamlessly!
To illustrate both packages, I use real-world data and a meaningful example: we download annual country-level indicators and run various regressions, accounting for different types of unobserved heterogeneity. Don’t take the results to seriously, though, as this post serves an illustrative purpose and is not part of a proper research project.
Loading packages
For data manipulation, we use dplyr
in R and pandas
in Python. We also load the WDI
package and wbdata
library to download World Development Indicators (WDI) from the World Bank database.
The WDI dataset provides a comprehensive set of economic development data, including social, economic, financial, natural resources, and environmental indicators for over 200 countries over many years. This data is excellent to illustrate country-level fixed effects regressions.
If you want to load pyfixest
in RStudio using reticulate
on Mac, then you might run into a very peculiar error: https://github.com/rstudio/rstudio/issues/13967. However, pyfixest
should run in Jupyter or Python console in any case.
import numpy as np
import pandas as pd
from wbdata import get_dataframe
from pyfixest.estimation import feols
from pyfixest.summarize import etable
Downloading and preparing data
We want to estimate a simple model where we use GDP per capita as the dependent variable and education expenditure, health expenditure, and CO2 emissions as independent variables. We divide the education and health expenditure shares by 100 to get actual percentages and we log GDP per capita and CO2 emissions to reduce the impact of their skewed distributions.
<- c(
indicators "gdp_per_capita" = "NY.GDP.PCAP.KD",
"edu_exp_share" = "SE.XPD.TOTL.GD.ZS",
"health_exp_share" = "SH.XPD.CHEX.GD.ZS",
"co2_emissions" = "EN.ATM.CO2E.PC",
"age_dependency_ratio" = "SP.POP.DPND"
)
<- WDI(
wdi_data country = "all", start = 2002, end = 2022,
indicators,
)
<- as_tibble(wdi_data) |>
wdi_data mutate(
log_gdp_per_capita = log(gdp_per_capita),
edu_exp_share = edu_exp_share / 100,
health_exp_share = health_exp_share / 100,
log_co2_emissions = log(co2_emissions),
age_dependency_ratio = age_dependency_ratio / 100
|>
) drop_na(log_gdp_per_capita, edu_exp_share,
health_exp_share, log_co2_emissions)
= {
indicators "NY.GDP.PCAP.KD": "gdp_per_capita",
"SE.XPD.TOTL.GD.ZS": "edu_exp_share",
"SH.XPD.CHEX.GD.ZS": "health_exp_share",
"EN.ATM.CO2E.PC": "co2_emissions",
"SP.POP.DPND": "age_dependency_ratio"
}
= (get_dataframe(indicators, date = ("2002", "2022"))
wdi_data
.reset_index()= {"date": "year"})
.rename(columns
)
= (pd.DataFrame(wdi_data)
wdi_data
.assign(= lambda x: np.log(x["gdp_per_capita"]),
log_gdp_per_capita = lambda x: x["edu_exp_share"] / 100,
edu_exp_share = lambda x: x["health_exp_share"] / 100,
health_exp_share = lambda x: np.log(x["co2_emissions"]),
log_co2_emissions = lambda x: x["age_dependency_ratio"] / 100
age_dependency_ratio
)= ["log_gdp_per_capita", "edu_exp_share",
.dropna(subset "health_exp_share", "log_co2_emissions",
"age_dependency_ratio"])
)
Linear regressions
We focus on linear regressions with fixed effects in the examples below because these models are the most common and simple ones. By using fixed effects, the model controls for all time-invariant differences between the countries, which means that any unobserved factors that do not change over time and that could influence the dependent variable (GDP per capita) are accounted for. This significantly reduces the bias in the estimated coefficients of the independent variables, leading to more reliable and interpretable results.
<- feols(
fe_model ~ edu_exp_share + health_exp_share + log_co2_emissions | country + year,
log_gdp_per_capita data = wdi_data,
vcov = "iid")
summary(fe_model)
OLS estimation, Dep. Var.: log_gdp_per_capita
Observations: 3,387
Fixed-effects: country: 226, year: 19
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
edu_exp_share -0.784350 0.226210 -3.46736 5.3266e-04 ***
health_exp_share -1.006972 0.196732 -5.11851 3.2650e-07 ***
log_co2_emissions 0.280575 0.008417 33.33364 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.082722 Adj. R2: 0.996154
Within R2: 0.278266
= feols(
fe_model "log_gdp_per_capita ~ edu_exp_share + health_exp_share + log_co2_emissions | country + year",
= wdi_data,
data = "iid")
vcov
fe_model.summary()
###
Estimation: OLS
Dep. var.: log_gdp_per_capita, Fixed effects: country+year
Inference: iid
Observations: 3387
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5 % | 97.5 % |
|:------------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| edu_exp_share | -0.784 | 0.218 | -3.600 | 0.000 | -1.212 | -0.357 |
| health_exp_share | -1.007 | 0.190 | -5.314 | 0.000 | -1.379 | -0.635 |
| log_co2_emissions | 0.281 | 0.008 | 34.605 | 0.000 | 0.265 | 0.296 |
---
RMSE: 0.083 R2: 0.996 R2 Within: 0.278
Clustering standard errors
Standard errors are a critical part of each FE estimation, in particular when you want to draw inference. Which standard errors to pick depends on your setting and convention in your field. fixest
comes with many different implementations of standard errors as you can see in the documentation here.
In financial economics (where I come from), you typically use clustered standard errors. So in the application below, I cluster standard errors by country. We already see a significant drop in \(t\)-statistics through clustering compared to the iid results from above.
= feols(
fe_model_clustered ~ edu_exp_share + health_exp_share + log_co2_emissions | country + year,
log_gdp_per_capita data = wdi_data,
vcov = ~country)
summary(fe_model_clustered)
OLS estimation, Dep. Var.: log_gdp_per_capita
Observations: 3,387
Fixed-effects: country: 226, year: 19
Standard-errors: Clustered (country)
Estimate Std. Error t value Pr(>|t|)
edu_exp_share -0.784350 0.421681 -1.86005 0.064183 .
health_exp_share -1.006972 0.622018 -1.61888 0.106874
log_co2_emissions 0.280575 0.028154 9.96560 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.082722 Adj. R2: 0.996154
Within R2: 0.278266
= feols(
fe_model_clustered "log_gdp_per_capita ~ edu_exp_share + health_exp_share + log_co2_emissions | country + year",
= wdi_data,
data = {"CRV1": "country"})
vcov
fe_model_clustered.summary()
###
Estimation: OLS
Dep. var.: log_gdp_per_capita, Fixed effects: country+year
Inference: CRV1
Observations: 3387
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5 % | 97.5 % |
|:------------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| edu_exp_share | -0.784 | 0.420 | -1.865 | 0.063 | -1.613 | 0.044 |
| health_exp_share | -1.007 | 0.620 | -1.623 | 0.106 | -2.229 | 0.215 |
| log_co2_emissions | 0.281 | 0.028 | 9.994 | 0.000 | 0.225 | 0.336 |
---
RMSE: 0.083 R2: 0.996 R2 Within: 0.278
Multiple models
It usually makes sense to run multiple models with various specifications to better understand the impact of fixed effects on coefficient estimates. fixest
and pyfixest
come with an amazing set of tools to support multiple estimations with minimal syntax (see the documentation here).
In the example below, we run the regression from above but iteratively add fixed effects using the csw0()
helper function. Check out the documentation for more helpers in this direction. Note that I generally prefer looking at \(t\)-stats rather than standard errors because the latter are typically very hard to interpret across coefficients that vary in size. The \(t\)-statistics provide a consistent way to interpret changes in estimation uncertainty across different model specifications. Unfortunately, pyfixest
does not yet support \(t\)-stats in etable()
, but the feature will come soon (see this issue).
= feols(
fe_models ~ edu_exp_share + health_exp_share + log_co2_emissions | csw0(country, year),
log_gdp_per_capita data = wdi_data,
vcov = ~country)
etable(fe_models, coefstat = "tstat", digits = 3, digits.stats = 3)
fe_models.1 fe_models.2 fe_models.3
Dependent Var.: log_gdp_per_capita log_gdp_per_capita log_gdp_per_capita
Constant 7.39*** (60.4)
edu_exp_share -2.67 (-1.04) -0.215 (-0.333) -0.784. (-1.86)
health_exp_share 11.5*** (7.26) 3.01** (3.14) -1.01 (-1.62)
log_co2_emissions 0.803*** (29.7) 0.441*** (11.1) 0.281*** (9.96)
Fixed-Effects: ------------------ ------------------ ------------------
country No Yes Yes
year No No Yes
_________________ __________________ __________________ __________________
VCOV: Clustered by: country by: country by: country
Observations 3,387 3,387 3,387
R2 0.828 0.992 0.996
Within R2 -- 0.315 0.278
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
= feols(
fe_models "log_gdp_per_capita ~ edu_exp_share + health_exp_share + log_co2_emissions | csw0(country, year)",
= wdi_data,
data = {"CRV1": "country"})
vcov
for j in range(len(fe_models.all_fitted_models))],
etable([fe_models.fetch_model(j) = 3) digits
Model: log_gdp_per_capita~edu_exp_share+health_exp_share+log_co2_emissions
Model: log_gdp_per_capita~edu_exp_share+health_exp_share+log_co2_emissions|country
Model: log_gdp_per_capita~edu_exp_share+health_exp_share+log_co2_emissions|country+year
est1 est2 est3
----------------- ------------------ ------------------ ------------------
depvar log_gdp_per_capita log_gdp_per_capita log_gdp_per_capita
-----------------------------------------------------------------------------
Intercept 7.388*** (0.122)
edu_exp_share -2.673 (2.558) -0.215 (0.647) -0.784 (0.42)
health_exp_share 11.543*** (1.59) 3.005** (0.956) -1.007 (0.62)
log_co2_emissions 0.803*** (0.027) 0.441*** (0.04) 0.281*** (0.028)
-----------------------------------------------------------------------------
country - x x
year - - x
-----------------------------------------------------------------------------
R2 0.828 0.992 0.996
S.E. type by: country by: country by: country
Observations 3387 3387 3387
-----------------------------------------------------------------------------
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001
Instrumental variables
Let’s consider an economic model where we are interested in the impact of education expenditure (as a percentage of GDP) on GDP per capita growth. The challenge is that education expenditure might be endogenous; countries might allocate their spending based on expected future GDP growth, or there might be omitted variables affecting both.
A good instrument in this context could be something that influences education expenditure but does not directly affect GDP per capita growth except through its impact on education. An example might be the age dependency ratio outside the working age population (% of working-age population), under the assumption that countries with a higher dependency ratio might prioritize education differently, affecting their education expenditures without directly influencing GDP growth except through this channel.
The examples below show to implement IV regressions. Unfortunately, our model does not indicate any statistically meaningful relationship for our channel, but that’s fine as its purpose is illustrative anyway.
<- feols(
iv_model ~ health_exp_share + log_co2_emissions | country + year | edu_exp_share ~ age_dependency_ratio,
log_gdp_per_capita data = wdi_data,
vcov = ~country)
summary(iv_model)
TSLS estimation, Dep. Var.: log_gdp_per_capita, Endo.: edu_exp_share, Instr.: age_dependency_ratio
Second stage: Dep. Var.: log_gdp_per_capita
Observations: 3,387
Fixed-effects: country: 226, year: 19
Standard-errors: Clustered (country)
Estimate Std. Error t value Pr(>|t|)
fit_edu_exp_share 17.987284 26.101642 0.689125 0.49145486
health_exp_share -1.527878 1.466760 -1.041668 0.29868341
log_co2_emissions 0.238962 0.068493 3.488869 0.00058341 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.147817 Adj. R2: 0.987721
Within R2: -1.30454
F-test (1st stage), edu_exp_share: stat = 7.13703, p = 0.007587, on 1 and 3,365 DoF.
Wu-Hausman: stat = 14.7 , p = 1.306e-4, on 1 and 3,139 DoF.
= feols(
iv_model "log_gdp_per_capita ~ health_exp_share + log_co2_emissions | country + year | edu_exp_share ~ age_dependency_ratio",
= wdi_data,
data = {"CRV1": "country"})
vcov
iv_model.summary()
###
Estimation: IV
Dep. var.: log_gdp_per_capita, Fixed effects: country+year
Inference: CRV1
Observations: 3387
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5 % | 97.5 % |
|:------------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| edu_exp_share | 17.987 | 26.028 | 0.691 | 0.490 | -33.303 | 69.278 |
| health_exp_share | -1.528 | 1.463 | -1.045 | 0.297 | -4.410 | 1.354 |
| log_co2_emissions | 0.239 | 0.068 | 3.499 | 0.001 | 0.104 | 0.374 |
---
Conclusion
I am a big fan of fixest
and its sibling pyfixest
, in particular the simple and powerful syntax are winning features. The summary tables are also concise and focus on the most relevant information. I think these packages are ideal for economists and data scientists who require a reliable and efficient solution for complex econometric models.