Tidy Fixed Effects Regressions: fixest vs pyfixest

R
Python
Modeling
A comparison of packages for fast fixed-effects estimation in R and Python
Author

Christoph Scheuch

Follow
Published

February 6, 2024

Fixed effects regressions are crucial in econometrics for controlling for unobservable variables that vary across entities but are constant over time. They allow researchers to isolate the impact of independent variables on the dependent variable, which is for instance, essential for educing omitted variable bias in causal inference.

The fixest R package is a powerful and efficient tool for econometric analysis, specializing in fixed effects models. It excels in handling large datasets and supports a wide range of regression models including linear, count, and logistic models. Its key strength lies in the fast estimation of models with multiple fixed effects, crucial for controlling unobserved heterogeneity in econometric analysis. With a user-friendly syntax, it integrates well with the tidyverse, facilitating easy data manipulation and visualization. fixest offers robust features for hypothesis testing, clustering standard errors, and instrumental variable estimation, all backed by C++ code for speed.

pyfixest is a Python implementation of the fixest package with the goal to mimic fixest syntax and functionality as closely as Python allows. The great thing is, if you know either package, then you can switch between R and Python (almost) seamlessly!

To illustrate both packages, I use real-world data and a meaningful example: we download annual country-level indicators and run various regressions, accounting for different types of unobserved heterogeneity. Don’t take the results to seriously, though, as this post serves an illustrative purpose and is not part of a proper research project.

Loading packages

For data manipulation, we use dplyr in R and pandas in Python. We also load the WDI package and wbdata library to download World Development Indicators (WDI) from the World Bank database.

The WDI dataset provides a comprehensive set of economic development data, including social, economic, financial, natural resources, and environmental indicators for over 200 countries over many years. This data is excellent to illustrate country-level fixed effects regressions.

library(dplyr)
library(tidyr)
library(WDI)
library(fixest)
Note

If you want to load pyfixest in RStudio using reticulate on Mac, then you might run into a very peculiar error: https://github.com/rstudio/rstudio/issues/13967. However, pyfixest should run in Jupyter or Python console in any case.

import numpy as np
import pandas as pd

from wbdata import get_dataframe
from pyfixest.estimation import feols
from pyfixest.summarize import etable

Downloading and preparing data

We want to estimate a simple model where we use GDP per capita as the dependent variable and education expenditure, health expenditure, and CO2 emissions as independent variables. We divide the education and health expenditure shares by 100 to get actual percentages and we log GDP per capita and CO2 emissions to reduce the impact of their skewed distributions.

indicators <- c(
  "gdp_per_capita" = "NY.GDP.PCAP.KD",
  "edu_exp_share" = "SE.XPD.TOTL.GD.ZS",
  "health_exp_share" = "SH.XPD.CHEX.GD.ZS",
  "co2_emissions" = "EN.ATM.CO2E.PC",
  "age_dependency_ratio" = "SP.POP.DPND"
)  

wdi_data <- WDI(
  indicators, country = "all", start = 2002, end = 2022,
) 

wdi_data <- as_tibble(wdi_data) |> 
  mutate(
    log_gdp_per_capita = log(gdp_per_capita),
    edu_exp_share = edu_exp_share / 100,
    health_exp_share = health_exp_share / 100,
    log_co2_emissions = log(co2_emissions),
    age_dependency_ratio = age_dependency_ratio / 100
  ) |> 
  drop_na(log_gdp_per_capita, edu_exp_share,
          health_exp_share, log_co2_emissions)
indicators = {
  "NY.GDP.PCAP.KD": "gdp_per_capita",
  "SE.XPD.TOTL.GD.ZS": "edu_exp_share",
  "SH.XPD.CHEX.GD.ZS": "health_exp_share",
  "EN.ATM.CO2E.PC": "co2_emissions",
  "SP.POP.DPND": "age_dependency_ratio"
}

wdi_data = (get_dataframe(indicators, date = ("2002", "2022"))
  .reset_index()
  .rename(columns = {"date": "year"})
)

wdi_data = (pd.DataFrame(wdi_data)
  .assign(
    log_gdp_per_capita = lambda x: np.log(x["gdp_per_capita"]),
    edu_exp_share = lambda x: x["edu_exp_share"] / 100,
    health_exp_share = lambda x: x["health_exp_share"] / 100,
    log_co2_emissions = lambda x: np.log(x["co2_emissions"]),
    age_dependency_ratio = lambda x: x["age_dependency_ratio"] / 100
  )
  .dropna(subset = ["log_gdp_per_capita", "edu_exp_share",
                    "health_exp_share", "log_co2_emissions",
                    "age_dependency_ratio"])
)

Linear regressions

We focus on linear regressions with fixed effects in the examples below because these models are the most common and simple ones. By using fixed effects, the model controls for all time-invariant differences between the countries, which means that any unobserved factors that do not change over time and that could influence the dependent variable (GDP per capita) are accounted for. This significantly reduces the bias in the estimated coefficients of the independent variables, leading to more reliable and interpretable results.

fe_model <- feols(
  log_gdp_per_capita ~ edu_exp_share + health_exp_share + log_co2_emissions | country + year, 
  data = wdi_data,
  vcov = "iid")

summary(fe_model)
OLS estimation, Dep. Var.: log_gdp_per_capita
Observations: 3,387 
Fixed-effects: country: 226,  year: 19
Standard-errors: IID 
                   Estimate Std. Error  t value   Pr(>|t|)    
edu_exp_share     -0.784350   0.226210 -3.46736 5.3266e-04 ***
health_exp_share  -1.006972   0.196732 -5.11851 3.2650e-07 ***
log_co2_emissions  0.280575   0.008417 33.33364  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.082722     Adj. R2: 0.996154
                 Within R2: 0.278266
fe_model = feols(
  "log_gdp_per_capita ~ edu_exp_share + health_exp_share + log_co2_emissions | country + year", 
  data = wdi_data,
  vcov = "iid")
  
fe_model.summary()
###

Estimation:  OLS
Dep. var.: log_gdp_per_capita, Fixed effects: country+year
Inference:  iid
Observations:  3387

| Coefficient       |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:------------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| edu_exp_share     |     -0.784 |        0.218 |    -3.600 |      0.000 |  -1.212 |   -0.357 |
| health_exp_share  |     -1.007 |        0.190 |    -5.314 |      0.000 |  -1.379 |   -0.635 |
| log_co2_emissions |      0.281 |        0.008 |    34.605 |      0.000 |   0.265 |    0.296 |
---
RMSE: 0.083   R2: 0.996   R2 Within: 0.278

Clustering standard errors

Standard errors are a critical part of each FE estimation, in particular when you want to draw inference. Which standard errors to pick depends on your setting and convention in your field. fixest comes with many different implementations of standard errors as you can see in the documentation here.

In financial economics (where I come from), you typically use clustered standard errors. So in the application below, I cluster standard errors by country. We already see a significant drop in t-statistics through clustering compared to the iid results from above.

fe_model_clustered = feols(
  log_gdp_per_capita ~ edu_exp_share + health_exp_share + log_co2_emissions | country + year, 
  data = wdi_data,
  vcov = ~country)

summary(fe_model_clustered)
OLS estimation, Dep. Var.: log_gdp_per_capita
Observations: 3,387 
Fixed-effects: country: 226,  year: 19
Standard-errors: Clustered (country) 
                   Estimate Std. Error  t value  Pr(>|t|)    
edu_exp_share     -0.784350   0.421681 -1.86005  0.064183 .  
health_exp_share  -1.006972   0.622018 -1.61888  0.106874    
log_co2_emissions  0.280575   0.028154  9.96560 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.082722     Adj. R2: 0.996154
                 Within R2: 0.278266
fe_model_clustered = feols(
  "log_gdp_per_capita ~ edu_exp_share + health_exp_share + log_co2_emissions | country + year", 
  data = wdi_data,
  vcov = {"CRV1": "country"})
  
fe_model_clustered.summary()
###

Estimation:  OLS
Dep. var.: log_gdp_per_capita, Fixed effects: country+year
Inference:  CRV1
Observations:  3387

| Coefficient       |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:------------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| edu_exp_share     |     -0.784 |        0.420 |    -1.865 |      0.063 |  -1.613 |    0.044 |
| health_exp_share  |     -1.007 |        0.620 |    -1.623 |      0.106 |  -2.229 |    0.215 |
| log_co2_emissions |      0.281 |        0.028 |     9.994 |      0.000 |   0.225 |    0.336 |
---
RMSE: 0.083   R2: 0.996   R2 Within: 0.278

Multiple models

It usually makes sense to run multiple models with various specifications to better understand the impact of fixed effects on coefficient estimates. fixest and pyfixest come with an amazing set of tools to support multiple estimations with minimal syntax (see the documentation here).

In the example below, we run the regression from above but iteratively add fixed effects using the csw0() helper function. Check out the documentation for more helpers in this direction. Note that I generally prefer looking at t-stats rather than standard errors because the latter are typically very hard to interpret across coefficients that vary in size. The t-statistics provide a consistent way to interpret changes in estimation uncertainty across different model specifications. Unfortunately, pyfixest does not yet support t-stats in etable(), but the feature will come soon (see this issue).

fe_models = feols(
  log_gdp_per_capita ~ edu_exp_share + health_exp_share + log_co2_emissions | csw0(country, year), 
  data = wdi_data,
  vcov = ~country)
etable(fe_models, coefstat = "tstat", digits = 3, digits.stats = 3)
                         fe_models.1        fe_models.2        fe_models.3
Dependent Var.:   log_gdp_per_capita log_gdp_per_capita log_gdp_per_capita
                                                                          
Constant              7.39*** (60.4)                                      
edu_exp_share          -2.67 (-1.04)    -0.215 (-0.333)    -0.784. (-1.86)
health_exp_share      11.5*** (7.26)      3.01** (3.14)      -1.01 (-1.62)
log_co2_emissions    0.803*** (29.7)    0.441*** (11.1)    0.281*** (9.96)
Fixed-Effects:    ------------------ ------------------ ------------------
country                           No                Yes                Yes
year                              No                 No                Yes
_________________ __________________ __________________ __________________
VCOV: Clustered          by: country        by: country        by: country
Observations                   3,387              3,387              3,387
R2                             0.828              0.992              0.996
Within R2                         --              0.315              0.278
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fe_models = feols(
  "log_gdp_per_capita ~ edu_exp_share + health_exp_share + log_co2_emissions | csw0(country, year)", 
  data = wdi_data,
  vcov = {"CRV1": "country"})
  
etable([fe_models.fetch_model(j) for j in range(len(fe_models.all_fitted_models))],
       digits = 3)
Model:  log_gdp_per_capita~edu_exp_share+health_exp_share+log_co2_emissions
Model:  log_gdp_per_capita~edu_exp_share+health_exp_share+log_co2_emissions|country
Model:  log_gdp_per_capita~edu_exp_share+health_exp_share+log_co2_emissions|country+year
                                 est1                est2                est3
-----------------  ------------------  ------------------  ------------------
depvar             log_gdp_per_capita  log_gdp_per_capita  log_gdp_per_capita
-----------------------------------------------------------------------------
Intercept            7.388*** (0.122)
edu_exp_share          -2.673 (2.558)      -0.215 (0.647)       -0.784 (0.42)
health_exp_share     11.543*** (1.59)     3.005** (0.956)       -1.007 (0.62)
log_co2_emissions    0.803*** (0.027)     0.441*** (0.04)    0.281*** (0.028)
-----------------------------------------------------------------------------
country                             -                   x                   x
year                                -                   -                   x
-----------------------------------------------------------------------------
R2                              0.828               0.992               0.996
S.E. type                 by: country         by: country         by: country
Observations                     3387                3387                3387
-----------------------------------------------------------------------------
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001

Instrumental variables

Let’s consider an economic model where we are interested in the impact of education expenditure (as a percentage of GDP) on GDP per capita growth. The challenge is that education expenditure might be endogenous; countries might allocate their spending based on expected future GDP growth, or there might be omitted variables affecting both.

A good instrument in this context could be something that influences education expenditure but does not directly affect GDP per capita growth except through its impact on education. An example might be the age dependency ratio outside the working age population (% of working-age population), under the assumption that countries with a higher dependency ratio might prioritize education differently, affecting their education expenditures without directly influencing GDP growth except through this channel.

The examples below show to implement IV regressions. Unfortunately, our model does not indicate any statistically meaningful relationship for our channel, but that’s fine as its purpose is illustrative anyway.

iv_model <- feols(
  log_gdp_per_capita ~ health_exp_share + log_co2_emissions | country + year | edu_exp_share ~ age_dependency_ratio,
  data = wdi_data,
  vcov = ~country)

summary(iv_model)
TSLS estimation, Dep. Var.: log_gdp_per_capita, Endo.: edu_exp_share, Instr.: age_dependency_ratio
Second stage: Dep. Var.: log_gdp_per_capita
Observations: 3,387 
Fixed-effects: country: 226,  year: 19
Standard-errors: Clustered (country) 
                   Estimate Std. Error   t value   Pr(>|t|)    
fit_edu_exp_share 17.987284  26.101642  0.689125 0.49145486    
health_exp_share  -1.527878   1.466760 -1.041668 0.29868341    
log_co2_emissions  0.238962   0.068493  3.488869 0.00058341 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.147817     Adj. R2: 0.987721
                 Within R2: -1.30454
F-test (1st stage), edu_exp_share: stat =  7.13703, p = 0.007587, on 1 and 3,365 DoF.
                       Wu-Hausman: stat = 14.7    , p = 1.306e-4, on 1 and 3,139 DoF.
iv_model = feols(
  "log_gdp_per_capita ~ health_exp_share + log_co2_emissions | country + year | edu_exp_share ~ age_dependency_ratio",
  data = wdi_data,
  vcov = {"CRV1": "country"})
  
iv_model.summary()
###

Estimation:  IV
Dep. var.: log_gdp_per_capita, Fixed effects: country+year
Inference:  CRV1
Observations:  3387

| Coefficient       |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:------------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| edu_exp_share     |     17.987 |       26.028 |     0.691 |      0.490 | -33.303 |   69.278 |
| health_exp_share  |     -1.528 |        1.463 |    -1.045 |      0.297 |  -4.410 |    1.354 |
| log_co2_emissions |      0.239 |        0.068 |     3.499 |      0.001 |   0.104 |    0.374 |
---

Conclusion

I am a big fan of fixest and its sibling pyfixest, in particular the simple and powerful syntax are winning features. The summary tables are also concise and focus on the most relevant information. I think these packages are ideal for economists and data scientists who require a reliable and efficient solution for complex econometric models.