COMPUTING DIFFERENCE-IN-DIFFERENCES ESTIMATE between two years

context: focusing on the effect of mergers on delivery costs
using the data of one year (2007) which is the pre-treatment period and 2008 (the first year of the post treatment period) i would like to compute the difference in difference estimates of the effect of "municipal mergers" on public service delivery costs.

does anyone know the relevant formula to compute this?

Hi @Hash and welcome to the RStudio Community :partying_face: :partying_face: :partying_face: :partying_face: :partying_face:

Quick tip, whenever you ask a coding question on here, it helps people reading it (your potential helpers) if you can also include your data (or any data that has the features of your original data... in case you can't share it). I highly suggest you take a look at this awesome article: FAQ: How to do a minimal reproducible example ( reprex ) for beginners

Having said that, the difference-in-differences (DID) in R is actually fairly simple to implement. It requires just a bit of manipulation of your data and the standard lm() function. I could have provided help if you had shared a sample dataset :slight_smile:

hi Gueyenono, thank you for the reply.

As the data set is quite large and a csv i am not able to upload it, is there any other way i could upload it?

kind regards.

@Hash,

Yes, you can share just a subset of the data. Let's assume that you import the data in a variable called mydata. Run the code: dput(mydata[1:50, ] and paste the result here. This will be only the first 50 rows of your dataset.

dput(sample[1:50, ])
structure(list(year = c(2005L, 2006L, 2007L, 2008L, 2009L, 2010L,
2011L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2005L,
2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2005L, 2006L, 2007L,
2008L, 2009L, 2010L, 2011L, 2005L, 2006L, 2007L, 2008L, 2009L,
2010L, 2011L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L,
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2005L), Y = c(4173.217045,
4277.996451, 4319.290767, 4549.83007, 4435.450694, 4280.088368,
4020.781806, 5877.274684, 5976.119041, 6014.478399, 6216.265093,
6236.914349, 6114.028861, 5889.205762, 6079.081436, 6205.04293,
6146.156668, 6404.103999, 6459.669141, 6366.698522, 6068.932465,
6046.077215, 6147.887524, 6109.346106, 6361.920033, 6372.34715,
6235.21232, 6079.461274, 5685.307721, 5647.604075, 5694.551862,
5985.826031, 6017.33036, 5964.862862, 5760.725342, 5704.078622,
5431.702292, 5582.624809, 5883.925628, 5832.585978, 5891.687208,
5638.702178, 5869.447414, 5945.162792, 5954.481229, 6159.511579,
6189.853063, 6019.20501, 5841.450154, 6081.946856), municipality = c("mu_1",
"mu_1", "mu_1", "mu_1", "mu_1", "mu_1", "mu_1", "mu_2", "mu_2",
"mu_2", "mu_2", "mu_2", "mu_2", "mu_2", "mu_3", "mu_3", "mu_3",
"mu_3", "mu_3", "mu_3", "mu_3", "mu_4", "mu_4", "mu_4", "mu_4",
"mu_4", "mu_4", "mu_4", "mu_5", "mu_5", "mu_5", "mu_5", "mu_5",
"mu_5", "mu_5", "mu_6", "mu_6", "mu_6", "mu_6", "mu_6", "mu_6",
"mu_6", "mu_7", "mu_7", "mu_7", "mu_7", "mu_7", "mu_7", "mu_7",
"mu_8"), region = c("re_re_1", "re_re_1", "re_re_1", "re_re_1",
"re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1",
"re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1",
"re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1",
"re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1",
"re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1",
"re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1",
"re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1",
"re_re_1", "re_re_1", "re_re_1", "re_re_1"), treatment = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L)), row.names = c(NA, 50L), class = "data.frame")

Here are links to a few resources on Difference-in-Differences in R:

@Hash It turns out that the chunk of the dataset that I asked you to share is not very representative of the whole dataset. For example, the region variable has a single value: re_re_1. The treatment variable as well has a single value: 0.

Would you please answer these questions for me to be exactly sure of what you are trying to achieve:

  • Do you want the pre-treatment period to be the year 2007 only or all the years up to 2007 (i.e. 2005, 2006 and 2007)? In the same way, should the post-treatment period be the year 2008 only or all the years from 2008 onward (i.e. 2008, 2009, 2010 and 2011)?

  • What are the control and treatment groups? I suspect that they are in the region column. Do you have a second region (re_re_2 maybe?) which corresponds to the treatment column being equal to 1?

Hi @Hash,

After providing me with the data privately, I was able to look at it and write code that will help you. Here, I assume that the treatment column refers to the control group (when treatment is 0) and to the treatment group (when treatment is 1). I add many comments to the code in order to guide you through the process.

# Download the full data
cost <- read.csv("diff_in_diff/cost_data.csv.csv")

# Subset the data to keep 2007 and 2008 data only
cost0708 <- cost[cost$year %in% c(2007, 2008), ]

# Create a dummy variable for the time (2007: 0, 2008: 1)
cost0708$time <- ifelse(cost0708$year == 2007, 0, 1)

# Create a variable for the interaction between treatment and group
cost0708$interaction <- cost0708$treatment * cost0708$time

# Run the difference-in-differences estimator (explicit method)
mod_did <- lm(Y ~ treatment + time + interaction, data = cost0708)
summary(mod_did)

Call:
lm(formula = Y ~ treatment + time + interaction, data = cost0708)

Residuals:
     Min       1Q   Median       3Q      Max 
-1590.84  -108.27     3.91   129.22   497.07 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5886.83      48.06 122.501  < 2e-16 ***
treatment    -450.51      59.37  -7.588 2.04e-12 ***
time          253.84      67.96   3.735 0.000256 ***
interaction  -322.76      83.96  -3.844 0.000171 ***
---
Signif. codes:  0 β€˜***’ 0.001 β€˜**’ 0.01 β€˜*’ 0.05 β€˜.’ 0.1 β€˜ ’ 1

Residual standard error: 263.2 on 170 degrees of freedom
Multiple R-squared:  0.5732,	Adjusted R-squared:  0.5657 
F-statistic: 76.12 on 3 and 170 DF,  p-value: < 2.2e-16

Your variable of interest in this regression output is interaction (also known as your difference-in-differences estimator). It has a very low p-value, which shows significance at the 1% significance level. In other words, there is strong evidence in the data that the treatment (whatever it is... it was not specified in your question) has an impact on the outcome variable Y.

Just for the sake of completeness, there is another way you can run this regression. You do not really need to calculate the interaction variable before running the DID estimator. You can just use the interaction operator * in the lm() function:

# Run the difference-in-differences estimator (implicit method)
mod_did2 <- lm(Y ~ treatment*time, data = cost0708)
summary(mod_did2)

Call:
lm(formula = Y ~ treatment * time, data = cost0708)

Residuals:
     Min       1Q   Median       3Q      Max 
-1590.84  -108.27     3.91   129.22   497.07 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5886.83      48.06 122.501  < 2e-16 ***
treatment       -450.51      59.37  -7.588 2.04e-12 ***
time             253.84      67.96   3.735 0.000256 ***
treatment:time  -322.76      83.96  -3.844 0.000171 ***
---
Signif. codes:  0 β€˜***’ 0.001 β€˜**’ 0.01 β€˜*’ 0.05 β€˜.’ 0.1 β€˜ ’ 1

Residual standard error: 263.2 on 170 degrees of freedom
Multiple R-squared:  0.5732,	Adjusted R-squared:  0.5657 

I hope this helps you.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.