Problem with Clustered SE's

Hi,
So I was trying to replicate results from one of the papers in JDE.
The thing is that when the data is analyzed in Stata, Stata fits the model and corrects for Clustered SE's on 32,915 Observations but R fits the same model and corrects for Clustered SE's on 34,576 observations.
Due to this there is a slight change in the estimated coefficients at 3rd or 4th decimal place.
I don't understand why there is a discrepancy in two estimates even when the mechanics of the correction in two software is same.
I am attaching the data set and results and codes in Stata and R.
Any help will be appreciated.

data1= subset(master_data_clean, master_data_clean$haryana==1)
model1= lm(latrine~ mboy_post +mboy +post, data= subset(master_data_clean, master_data_clean$haryana==1))
vcov1= cluster.vcov(model1, ~data1$district)
coeftest(model1, vcov1)
summary(model1)
coef_test(model1, vcov = "CR1S", cluster = data1$district, test = "naive-t")
model1_clustered= felm(latrine~ mboy_post +mboy +post | 0 | 0 | district, data=data1)
summary(model1_clustered)![2|690x361](upload://nrYhLZJrSufeFzk1GVvvvCP3kmr.jpg)![1|690x388](upload://nKsvSA3HmmMjEdL6KA5MmpQe6BE.jpg)

The data set is available on the following link:
https://drive.google.com/file/d/1lEF3QfOPYZ6NtLizsSBhQecmva91kFaZ/view?usp=sharing

Could you please turn this into a self-contained reprex (short for minimal reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

Right now the best way to install reprex is:

# install.packages("devtools")
devtools::install_github("tidyverse/reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

For pointers specific to the community site, check out the reprex FAQ, linked to below.

library(lmtest)
#> Loading required package: zoo
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
library(lfe)
#> Loading required package: Matrix
#> 
#> Attaching package: 'lfe'
#> The following object is masked from 'package:lmtest':
#> 
#>     waldtest
data1= subset(master_data_clean, master_data_clean$haryana==1)
#> Error in subset(master_data_clean, master_data_clean$haryana == 1): object 'master_data_clean' not found
model1= lm(latrine~ mboy_post +mboy +post, data= subset(master_data_clean, master_data_clean$haryana==1))
#> Error in subset(master_data_clean, master_data_clean$haryana == 1): object 'master_data_clean' not found
vcov1= cluster.vcov(model1, ~data1$district)
#> Error in cluster.vcov(model1, ~data1$district): could not find function "cluster.vcov"
coeftest(model1, vcov1)
#> Error in coeftest(model1, vcov1): object 'model1' not found
summary(model1)
#> Error in summary(model1): object 'model1' not found
coef_test(model1, vcov = "CR1S", cluster = data1$district, test = "naive-t")
#> Error in coef_test(model1, vcov = "CR1S", cluster = data1$district, test = "naive-t"): could not find function "coef_test"
sum(master_data_clean$haryana)
#> Error in eval(expr, envir, enclos): object 'master_data_clean' not found
install.packages("clusterSEs")
#> also installing the dependencies 'miscTools', 'bdsmatrix', 'maxLik', 'statmod', 'plm', 'mlogit'
#> package 'miscTools' successfully unpacked and MD5 sums checked
#> package 'bdsmatrix' successfully unpacked and MD5 sums checked
#> package 'maxLik' successfully unpacked and MD5 sums checked
#> package 'statmod' successfully unpacked and MD5 sums checked
#> package 'plm' successfully unpacked and MD5 sums checked
#> package 'mlogit' successfully unpacked and MD5 sums checked
#> package 'clusterSEs' successfully unpacked and MD5 sums checked
#> 
#> The downloaded binary packages are in
#>  C:\Users\Tej\AppData\Local\Temp\RtmpKa9qgx\downloaded_packages

model2= lm(latrine~ as.factor(mboy_post)+ as.factor(mboy) +as.factor(post)+ hhsize+ agehead+ agewife+ educhead+ educwife+ wealth, data= data1)
#> Error in is.data.frame(data): object 'data1' not found
coef_test(model2, vcov = "CR1S", cluster = data1$district, test = "naive-t")
#> Error in coef_test(model2, vcov = "CR1S", cluster = data1$district, test = "naive-t"): could not find function "coef_test"
vcov2= cluster.vcov(model2, ~data1$district)
#> Error in cluster.vcov(model2, ~data1$district): could not find function "cluster.vcov"
coeftest(model2, vcov2)
#> Error in coeftest(model2, vcov2): object 'model2' not found
model1_clustered= felm(latrine~ mboy_post +mboy +post | 0 | 0 | district, data=data1)
#> Error in ..pdata.coerce..(data1): object 'data1' not found
summary(model1_clustered)
#> Error in summary(model1_clustered): object 'model1_clustered' not found
Created on 2018-03-22 by the reprex package (v0.2.0).

I am not an expert in this area, but I know there are a number of ways to calculate robust standard errors.

To find out how this package and Stata implement their particular methods, you'll want to check out the documentation.

For the lfe package, check out the docs here: https://cran.r-project.org/web/packages/lfe/lfe.pdf
The line "The package may optionally compute standard errors for the group effects by bootstrapping, but this is a very time and memory-consuming process compared to finding the point estimates." one source of possible discrepancy between the two.

Looking at stata's documentation https://www.stata.com/manuals13/xtvce_options.pdf
Note the line under clustered sandwich estimator Methods and formulas; "By default, Stata’s maximum likelihood estimators display standard errors based on variance estimates given by the inverse of the negative Hessian (second derivative) matrix. If vce(robust), vce(cluster clustvar), or pweights are specified, standard errors are based on the robust variance estimator..."

I have a feeling you might also be asking "and, so how do I replicate Stata's robust SE in R?"

I very-much might be wrong here, but this discussion on Cross Validated I think walks through the method in R to get the same SEs.

  • In tchakravarty's reply,
  • Note option 3, "3. To reproduce the Stata default behavior..."

I don't have your data or a Stata license to help you double check this.

library(haven)
master_data_clean <- read_dta("C:/Users/Tej/Desktop/Study Materials/JEP Datasets+R/replication/output/master-data-clean.dta")
library(multiwayvcov)
library(clubSandwich)
library(lmtest)
#> Loading required package: zoo
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
library(lfe)
#> Loading required package: Matrix
#> 
#> Attaching package: 'lfe'
#> The following object is masked from 'package:lmtest':
#> 
#>     waldtest
data1= subset(master_data_clean, master_data_clean$haryana==1)
model1= lm(latrine~ mboy_post +mboy +post, data= subset(master_data_clean, master_data_clean$haryana==1))
vcov1= cluster.vcov(model1, ~data1$district)
coeftest(model1, vcov1)
#> 
#> t test of coefficients:
#> 
#>              Estimate Std. Error t value  Pr(>|t|)    
#> (Intercept) 0.2901024  0.0218073 13.3030 < 2.2e-16 ***
#> mboy_post   0.0658099  0.0134400  4.8966 9.796e-07 ***
#> mboy        0.0069083  0.0065960  1.0473 0.2949458    
#> post        0.1417305  0.0388896  3.6444 0.0002684 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model1)
#> 
#> Call:
#> lm(formula = latrine ~ mboy_post + mboy + post, data = subset(master_data_clean, 
#>     master_data_clean$haryana == 1))
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.5046 -0.2970 -0.2901  0.5682  0.7099 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 0.290102   0.004498  64.496  < 2e-16 ***
#> mboy_post   0.065810   0.010428   6.311  2.8e-10 ***
#> mboy        0.006908   0.007001   0.987    0.324    
#> post        0.141731   0.006679  21.221  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4746 on 34572 degrees of freedom
#> Multiple R-squared:  0.03272,    Adjusted R-squared:  0.03264 
#> F-statistic: 389.8 on 3 and 34572 DF,  p-value: < 2.2e-16
coef_test(model1, vcov = "CR1S", cluster = data1$district, test = "naive-t")
#>          Coef Estimate     SE p-val (naive-t) Sig.
#> 1 (Intercept)  0.29010 0.0218         < 0.001  ***
#> 2   mboy_post  0.06581 0.0134         < 0.001  ***
#> 3        mboy  0.00691 0.0066         0.30743     
#> 4        post  0.14173 0.0389         0.00161   **
sum(master_data_clean$haryana)
#> [1] 34576
install.packages("clusterSEs")
#> package 'clusterSEs' successfully unpacked and MD5 sums checked
#> 
#> The downloaded binary packages are in
#>  C:\Users\Tej\AppData\Local\Temp\RtmpCWB669\downloaded_packages

model2= lm(latrine~ as.factor(mboy_post)+ as.factor(mboy) +as.factor(post)+ hhsize+ agehead+ agewife+ educhead+ educwife+ wealth, data= data1)
coef_test(model2, vcov = "CR1S", cluster = data1$district, test = "naive-t")
#>                     Coef  Estimate       SE p-val (naive-t) Sig.
#> 1            (Intercept)  0.018196 0.050059           0.720     
#> 2  as.factor(mboy_post)1  0.068390 0.013280          <0.001  ***
#> 3       as.factor(mboy)1 -0.028986 0.006624          <0.001  ***
#> 4       as.factor(post)1  0.113887 0.035029           0.004   **
#> 5                 hhsize -0.002828 0.002317           0.237     
#> 6                agehead  0.000719 0.000695           0.314     
#> 7                agewife  0.003014 0.000535          <0.001  ***
#> 8               educhead  0.011417 0.002361          <0.001  ***
#> 9               educwife  0.015674 0.001849          <0.001  ***
#> 10                wealth  0.202665 0.008192          <0.001  ***
vcov2= cluster.vcov(model2, ~data1$district)
coeftest(model2, vcov2)
#> 
#> t test of coefficients:
#> 
#>                          Estimate  Std. Error t value  Pr(>|t|)    
#> (Intercept)            0.01819564  0.05005903  0.3635   0.71625    
#> as.factor(mboy_post)1  0.06839049  0.01327998  5.1499 2.621e-07 ***
#> as.factor(mboy)1      -0.02898613  0.00662440 -4.3757 1.214e-05 ***
#> as.factor(post)1       0.11388699  0.03502857  3.2513   0.00115 ** 
#> hhsize                -0.00282751  0.00231716 -1.2202   0.22238    
#> agehead                0.00071881  0.00069532  1.0338   0.30125    
#> agewife                0.00301402  0.00053464  5.6375 1.739e-08 ***
#> educhead               0.01141733  0.00236074  4.8363 1.328e-06 ***
#> educwife               0.01567401  0.00184873  8.4783 < 2.2e-16 ***
#> wealth                 0.20266465  0.00819171 24.7402 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model1_clustered= felm(latrine~ mboy_post +mboy +post | 0 | 0 | district, data=data1)
summary(model1_clustered)
#> 
#> Call:
#>    felm(formula = latrine ~ mboy_post + mboy + post | 0 | 0 | district,      data = data1) 
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.5046 -0.2970 -0.2901  0.5682  0.7099 
#> 
#> Coefficients:
#>             Estimate Cluster s.e. t value Pr(>|t|)    
#> (Intercept) 0.290102     0.021807  13.303  < 2e-16 ***
#> mboy_post   0.065810     0.013440   4.897  9.8e-07 ***
#> mboy        0.006908     0.006596   1.047 0.294946    
#> post        0.141731     0.038890   3.644 0.000268 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4746 on 34572 degrees of freedom
#> Multiple R-squared(full model): 0.03272   Adjusted R-squared: 0.03264 
#> Multiple R-squared(proj model): 0.03272   Adjusted R-squared: 0.03264 
#> F-statistic(full model, *iid*):389.8 on 3 and 34572 DF, p-value: < 2.2e-16 
#> F-statistic(proj model): 39.02 on 3 and 20 DF, p-value: 1.508e-08
Created on 2018-03-22 by the reprex package (v0.2.0).

You might also check out the r package bucky

The problem seems to be that one of the districts that Stata treats as missing, R is treating as valid. The district is named "", literally just a character vector of length zero. This is an issue relating to how the data were transferred from Stata format to R. Here's my code that I used to reproduce the Stata output.

dat <- haven::read_dta("master-data-clean.dta")
dat1 <- subset(dat, haryana == 1)
dat1 <- subset(dat1, district != "")
model <- lm(latrine ~ mboy_post + mboy + post, data = dat1)

library(jtools) # for `summ` function
summ(model, robust = TRUE, robust.type = "HC1", cluster = "district")
#> MODEL INFO:
#> Observations: 32915
#> Dependent Variable: latrine
#> 
#> MODEL FIT: 
#> F(3,32911) = 400.19, p = 0
#> R-squared = 0.04
#> Adj. R-squared = 0.04
#> 
#> Standard errors: Cluster-robust, type = HC1
#>             Est. S.E. t val.    p    
#> (Intercept) 0.28 0.02  12.28 0    ***
#> mboy_post   0.07 0.01   4.78 0    ***
#> mboy        0.01 0.01   0.97 0.33    
#> post        0.15 0.04   3.9  0    ***

There's no specific need for the jtools package — I'm its developer and so used it because I know how it works and knew that it would work. The robust SEs are calculated using the sandwich package which you could do separately. You might want to use the digits argument to summ to match more decimal places with the Stata output, but it matches exactly on my computer (other than the degrees of freedom for the F test, which is very low on Stata for reasons I don't understand).

As a side note, in R it is typical to model interactions using the formula syntax (e.g., latrine ~ mboy * post rather than create a new variable that is the product of those two variables.

1 Like

Yes thanks. I also figured it out.

If you figured it out, would you mind please either choosing a solution, or briefly describing the solution and marking it as such? Thanks.