multi-level categorical variable in felm linear regression

Hello,

I am trying to do a two-way fixed effect regression using lfe:felm in Rstudio. I followed this instruction for doing it. Although that example works with the lm command.

Problem is, the regression output reports coefficients for all levels of the categorical variable , instead of dropping one of them automatically due to collinearity issues. There was an earlier post about this problem here, but had received to replies.

Can someone help me with this? Thank you.

I can't answer with confidence simply by reading this explainer.

A reprex (see the FAQ) would be helpful.

The problem that felm() addresses is that an lm() model in the form

lm(y ~ x1+x2+x3 + f1+f2+f3)

where f1,f2,f3 are arbitrary factors, and x1,x2,x3 are other covariates

that performs satisfactorily when the number of factor levels is not large may not when the number of levels is large because of collinearities between factors and other covariants. When modeling a high-N model with a number of levels equal to the number of subjects (observations) in a large dataset, for example, neither lm() nor sparse matrix approaches in {Matrix} are computationally feasible. That implies that felm() may not be suitable for datasets with a relatively small number of levels in factors.

The case of a single-factor model, likewise, does not appear to call for felm() as the factor can be eliminated through the within groups transformation. It is the case with two or more factors in the presence of non-factor covariates that felm() is intended to address. It does so through "projecting" out the factor with the highest number of levels, coding the others as dummy variables. As can be seen in the following reprex the effect is to omit coefficients for factor (categorical) variables from the model , leaving only the non-factor covariates. Compared to the full model, the projected model has only as many coefficients as the non-factor variables, corresponding to fewer degrees of freedom in equal measure.

library(lfe)
#> Loading required package: Matrix
## Simulate data
set.seed(42)
n <- 1e3

d <- data.frame(
  # Covariates
  x1 = rnorm(n),
  x2 = rnorm(n),
  # Individuals and firms
  id = factor(sample(20, n, replace = TRUE)),
  firm = factor(sample(13, n, replace = TRUE)),
  # Noise
  u = rnorm(n)
)

# Effects for individuals and firms
id.eff <- rnorm(nlevels(d$id))
firm.eff <- rnorm(nlevels(d$firm))

# Left hand side
d$y <- d$x1 + 0.5 * d$x2 + id.eff[d$id] + firm.eff[d$firm] + d$u

## Estimate the model and print the results
est <- felm(y ~ x1 + x2 | id + firm, data = d)
summary(est)
#> 
#> Call:
#>    felm(formula = y ~ x1 + x2 | id + firm, data = d) 
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.3751 -0.6768  0.0088  0.6883  2.7803 
#> 
#> Coefficients:
#>    Estimate Std. Error t value Pr(>|t|)    
#> x1  1.04326    0.03228   32.32   <2e-16 ***
#> x2  0.49041    0.03254   15.07   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.005 on 966 degrees of freedom
#> Multiple R-squared(full model): 0.7539   Adjusted R-squared: 0.7455 
#> Multiple R-squared(proj model): 0.5696   Adjusted R-squared: 0.5549 
#> F-statistic(full model):89.69 on 33 and 966 DF, p-value: < 2.2e-16 
#> F-statistic(proj model): 639.2 on 2 and 966 DF, p-value: < 2.2e-16
# Compare with lm
summary(lm(y ~ x1 + x2 + id + firm - 1, data = d))
#> 
#> Call:
#> lm(formula = y ~ x1 + x2 + id + firm - 1, data = d)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.3751 -0.6768  0.0088  0.6883  2.7803 
#> 
#> Coefficients:
#>        Estimate Std. Error t value Pr(>|t|)    
#> x1      1.04326    0.03228  32.319  < 2e-16 ***
#> x2      0.49041    0.03254  15.072  < 2e-16 ***
#> id1     3.74166    0.17650  21.199  < 2e-16 ***
#> id2     0.96200    0.17927   5.366 1.01e-07 ***
#> id3     1.02686    0.20249   5.071 4.74e-07 ***
#> id4     2.13960    0.17190  12.447  < 2e-16 ***
#> id5     1.12131    0.17503   6.406 2.32e-10 ***
#> id6     0.85863    0.18845   4.556 5.87e-06 ***
#> id7     0.85256    0.17839   4.779 2.03e-06 ***
#> id8     1.25744    0.18396   6.835 1.45e-11 ***
#> id9    -0.95332    0.19765  -4.823 1.64e-06 ***
#> id10    0.50332    0.18943   2.657 0.008014 ** 
#> id11    1.29660    0.18697   6.935 7.44e-12 ***
#> id12    2.00367    0.17489  11.457  < 2e-16 ***
#> id13   -0.02849    0.20090  -0.142 0.887257    
#> id14    0.66788    0.18563   3.598 0.000337 ***
#> id15   -0.07461    0.17510  -0.426 0.670153    
#> id16    1.51743    0.17799   8.525  < 2e-16 ***
#> id17    2.10649    0.18372  11.466  < 2e-16 ***
#> id18    1.18966    0.17464   6.812 1.69e-11 ***
#> id19    1.34483    0.18893   7.118 2.13e-12 ***
#> id20   -1.20084    0.18328  -6.552 9.21e-11 ***
#> firm2  -1.50725    0.17093  -8.818  < 2e-16 ***
#> firm3  -1.87472    0.17236 -10.877  < 2e-16 ***
#> firm4  -1.24848    0.16611  -7.516 1.29e-13 ***
#> firm5  -0.74181    0.15959  -4.648 3.81e-06 ***
#> firm6   0.11010    0.16544   0.665 0.505893    
#> firm7  -1.01232    0.16797  -6.027 2.37e-09 ***
#> firm8  -2.48896    0.16741 -14.868  < 2e-16 ***
#> firm9  -1.52025    0.16137  -9.421  < 2e-16 ***
#> firm10 -1.31793    0.15813  -8.334 2.66e-16 ***
#> firm11 -1.14281    0.15977  -7.153 1.68e-12 ***
#> firm12 -0.60866    0.17645  -3.449 0.000586 ***
#> firm13 -1.28568    0.16513  -7.786 1.78e-14 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.005 on 966 degrees of freedom
#> Multiple R-squared:  0.7542, Adjusted R-squared:  0.7455 
#> F-statistic: 87.17 on 34 and 966 DF,  p-value: < 2.2e-16

Created on 2023-05-23 with reprex v2.0.2

1 Like

Thank you so much for your thorough and informative response. My factor variable takes only four values. The main reason I have chosen to work with lfe:felm, is that I have a two-way fixed effect model, that controls for time and location.

I was not aware of this particular application of lfe that you explained above. Thank you for that.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.