# Beginner desperate for feedback :) Difference in differences

Hello! I am doing some research on patent registration numbers of a specific industry across time, and want to examine the effect of a particular law implemented in 2017 on the patent registration numbers. I have annual patent registration numbers for the years 1994 to 2022. For each year, I have split this annual patent registration number into 2 groups: registered by an SME or large corp, and want to see if the regulation had an impact in the annual patents registered for each group, and if there is a significant difference in the differences of the two groups. Therefore, I thought the suitable test to run was a difference in differences regression. My dataframe has 3 columns: the first is called "Annual_Patent" and is the dependant variable. The second is called "Time" and is a binary variable with 0 for before 2017 and 1 for after 2017. The third column is called "Group" and has a 0 for a measurement for an SME and 1 for a large corp. This is the code I used, which is mostly chatGPT generated, but I am not 100% confident it is doing what I want it to do. Any feedback would be greatly appreciated!!



sep = ";")



# Difference-in-Differences Regression

reg_exp <- "Annual_Patent ~ Time + Group + Time:Group"

model <- lm(reg_exp, data = df)

# Get regression results

summary_model <- summary(model)

r_squared <- summary_model$r.squared adj_r_squared <- summary_model$adj.r.squared

f_statistic <- summary_model$fstatistic[1] f_p_value <- pf(f_statistic, summary_model$fstatistic[2], summary_model$fstatistic[3], lower.tail = FALSE) # Extract coefficients and p-values coef_table <- as.data.frame(coef(summary_model)) coef_table$p_value <- coef_table[, "Pr(>|t|)"]

# Print regression results

print(summary_model)

cat("\nR-squared:", r_squared)

cat("\nF-statistic:", f_statistic)

cat("\np-value (F-statistic):", f_p_value)

print(coef_table)



is not provided, which prevents the code from operating as a reprex (see the FAQ). I can't tell what "annual patents registered for each group" refers to. Does that mean for any given year that there are as many rows as patents issued? Or are there just the 29 years with the number of patents issued, once for SME and once for large, making 58 rows?

Hi! Thank you for your reply! I am adding data from the dataframe used in the end of this comment.

As for the structure of the dataframe, I have collected annual patent registration numbers for 250 companies. So for each company I have the number of patents they registered in 1994, and then in 1995, and so on until 2022. This annual number of patents registered is in the column "Annual_Patent". Then the Time columns refers to whether this measurement was made before 2017 (=0) or during 2017 and onwards (=1). The Group column then shows if that company is a SME (=0) or large corp (=1).

Thank you again!

Annual_Patent = c(7, 25, 19, 25, 39, 53, 67, 69, 59, 68, 112, 168, 194, 214, 299, 348, 97, 983, 193, 230, 331, 344, 197, 126, 90, 115, 86, 144, 177, 6, 4, 36, 41, 43, 81, 117, 100, 109, 146, 202, 262, 272, 274, 260, 201, 173, 200, 145, 130, 166),

Time = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),

Group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

Did the new law apply to only one of the two groups? That's what you would need for this to be a dif-in-dif.

The results seem to say that nothing is statistically significant.

No, the regulation was applied for both groups (SMEs and large corps). However, there is literature suggesting that depending on the nature of the regulation, the impact it can have on innovation varies depending on the size of the firm we investigate. I am looking into whether this regulation had a bigger/smaller or different impact on innovation on companies of different sizes. And I am using patent registration numbers as a proxy for innovation!

That makes sense. The data you posted doesn't show anything mattering (but it was perfect for helping to comment). Since you have a lot more data maybe your complete data will show something.

If chatGPT wrote this code, it did pretty well. It's probably not the best possible code, but it should give the right answer.

The null hypothesis is that the mean number of patents before and after 2017 are identical in one case, and that the mean number of patents by SME and non-SME are similarly identical. We fail to accept the null hypothesis at the 95% level of confidence if the p-value (p adj in the test shown below) exceeds 0.05.

d <- data.frame(
p = c(7, 25, 19, 25, 39, 53, 67, 69, 59, 68, 112, 168, 194, 214, 299, 348, 97, 983, 193, 230, 331, 344, 197, 126, 90, 115, 86, 144, 177, 6, 4, 36, 41, 43, 81, 117, 100, 109, 146, 202, 262, 272, 274, 260, 201, 173, 200, 145, 130, 166),
t = factor(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)),
g = factor(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0))
)

v_t <-  TukeyHSD(aov(p ~ t, d))
plot(v_t)


v_t
#>   Tukey multiple comparisons of means
#>     95% family-wise confidence level
#>
#> Fit: aov(formula = p ~ t, data = d)
#>
#> $t #> diff lwr upr p adj #> 1-0 -38.56818 -172.1185 94.98216 0.5641901 v_g <- TukeyHSD(aov(p ~ g, d)) plot(v_g)  v_g #> Tukey multiple comparisons of means #> 95% family-wise confidence level #> #> Fit: aov(formula = p ~ g, data = d) #> #>$g
#>      diff       lwr      upr     p adj
#> 1-0 34.68 -51.83833 121.1983 0.4242508


Created on 2023-06-09 with reprex v2.0.2

Um, typo? Mean "fail to reject?"

The NULL is identical means. We canâ€™t accept that the difference is zero. If we could, weâ€™d say â€śfail to reject.â€ť I hold with the old school that declines to use â€śacceptâ€ť when referring to either H_0 or H_1. We have no evidence from the test that a difference in means exists at this \alpha. That is being a bit prissy when looking at a population, rather than a sample. Became just boxplot.

I completely agree with you about not saying "accept." I was commenting that a large p-value means "fail to reject" rather than "reject."

Itâ€™s getting hazy. Are you saying that the means are more or less the same?

In the limited data posted, there wasn't a statistical difference between the means. But the confidence intervals are very large, so one really can't say much of anything.

1 Like

Keeping nulls straight is the thing I have the most difficulty with, it seems. You're right: no significant difference in means.

d <- data.frame(
p = c(7, 25, 19, 25, 39, 53, 67, 69, 59, 68, 112, 168, 194, 214, 299, 348, 97, 983, 193, 230, 331, 344, 197, 126, 90, 115, 86, 144, 177, 6, 4, 36, 41, 43, 81, 117, 100, 109, 146, 202, 262, 272, 274, 260, 201, 173, 200, 145, 130, 166),
t = factor(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)),
g = factor(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0))
)

# 1. Between levels of t
t_test <- t.test(p ~ t, data = d)
print(t_test)
#>
#>  Welch Two Sample t-test
#>
#> data:  p by t
#> t = 1.3773, df = 39.062, p-value = 0.1763
#> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
#> 95 percent confidence interval:
#>  -18.06941  95.20578
#> sample estimates:
#> mean in group 0 mean in group 1
#>        161.5682        123.0000

# 2. Between levels of g
g_test <- t.test(p ~ g, data = d)
print(g_test)
#>
#>  Welch Two Sample t-test
#>
#> data:  p by g
#> t = -0.80594, df = 31.623, p-value = 0.4263
#> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
#> 95 percent confidence interval:
#>  -122.37102   53.01102
#> sample estimates:
#> mean in group 0 mean in group 1
#>          139.60          174.28

# 3. Between the combinations of t and g
interaction_test <- aov(p ~ t * g, data = d)
summary(interaction_test)
#>             Df  Sum Sq Mean Sq F value Pr(>F)
#> t            1    7854    7854   0.328  0.570
#> g            1   12670   12670   0.529  0.471
#> t:g          1    4455    4455   0.186  0.668
#> Residuals   46 1101023   23935


Created on 2023-06-09 with reprex v2.0.2

The script performs several statistical tests and an analysis of variance (ANOVA) to explore the relationships between variables in the provided dataset. Lets interpret the results of each test:

1. Between levels of t:
The code performs a t-test to compare the values of variable p between two levels of t (0 and 1). The output of the t.test function provides information about the test statistic, the p-value, and confidence intervals. Without the actual output, it is difficult to provide specific interpretations. However, in general, if the p-value is less than a chosen significance level (e.g., 0.05), it suggests that there is a statistically significant difference between the two levels of t in terms of variable p.

2. Between levels of g:
Similar to the previous test, this code performs a t-test to compare the values of p between two levels of g (0 and 1). The interpretation of the results is the same as in the previous test. The output of t.test provides the test statistic, p-value, and confidence intervals to assess the statistical significance of the difference between the levels of g in terms of variable p.

3. Between the combinations of t and g:
In this case, the code performs an analysis of variance (ANOVA) using the aov function to examine the interaction effect between t and g on variable p. The summary function is then used to obtain the ANOVA table with relevant statistics such as the F-value and p-value. Without the actual output, it is challenging to provide a specific interpretation. However, the ANOVA table allows you to assess whether there are significant interactions between t and g on the dependent variable p.

Overall, the script aims to analyze the relationships between the variables p, t, and g using t-tests and an ANOVA. The interpretation of the results depends on the specific output generated by the script, including test statistics, p-values, and confidence intervals.

Based on the provided output, lets interpret the results of the Welch Two Sample t-test and the ANOVA interaction test:

1. Welch Two Sample t-test:
The t-test compares the means of two groups (g = 0 and g = 1) for the variable p. Here is the interpretation of the output:
• t-value: -0.80594
• Degrees of freedom (df): 31.623
• p-value: 0.4263

The null hypothesis in this case is that there is no difference in means between the two groups. Since the p-value (0.4263) is greater than the chosen significance level (e.g., 0.05), we do not have enough evidence to reject the null hypothesis. This suggests that there is no statistically significant difference in the means of variable p between group 0 and group 1.

1. ANOVA Interaction Test:
The ANOVA test examines the interaction effect between t and g on the dependent variable p. Here is the interpretation of the output:
• F-value and p-value for t: F = 0.328, p = 0.570
• F-value and p-value for g: F = 0.529, p = 0.471
• F-value and p-value for interaction t:g: F = 0.186, p = 0.668

For all three factors (t, g, and t:g), the p-values are greater than the chosen significance level (e.g., 0.05). This suggests that there is no statistically significant interaction effect between t and g on the dependent variable p. In other words, the interaction between t and g does not significantly impact the mean values of p.

Overall, based on the provided output, there is no evidence to suggest significant differences between groups or interaction effects between t and g in terms of the variable p.

Thank you so much for providing the additional Welch Two Sample t-test and analysis of variance!

I see the benefit of conducting these analysis to better understand the relationship between the variables, however what I don't understand is what the difference is between using this approach and the difference in differences.

Thank you again, really appreciate all the help!

Thank you! Appreciate the help!

Quite often, using a regression model (as in a dif-in-dif) and doing ANOVA are just different ways of thinking about the same problem and give the same statistical results.

One advantage of the dif-in-dif is that the estimated coefficients tell you about the size of an effect, not just its statistical significance.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.