Regression Discontinuity Design

I am having a difficult time coming up with a functional form for a RDD I want to run. I am kind of inexperienced with RStudio, and even more so with RDD within Rstudio. Basically, I have a list of salesman who can sell two products, product A and product B. As of 01/01/2018 these salesman receive bonuses for selling over a dollar amount of product A(let's say the threshold is $100,000). I have data for three months before 01/01/2018 and three months after for each salesman, their sales of each product before and after.

What I would like to do is run an RDD to see if the bonus program is affecting the sales of each product. What would be the best functional form to do this with? Once I have that, what is the best way to run the RDD?

I would check out

In particular, Calonico, Cattaneo and Titiunik (2015): rdrobust: An R Package for Robust Nonparametric Inference in Regression-Discontinuity Designs, R Journal 7(1): 38-51.


There's a package, rdd, which I haven't used, that implements regression discontinuity resign. I'd recommend, before plunging in, however, some warm up exercises.

When you have quantitative variables, even with categorical variables thrown in, it often pays to start out with ordinary least squares regression.

First set up your data structure. These are the columns in a data frame or tibble, called df

sales_id <chr>
prodA_4Qv <dbl>
prodB_4Qv <dbl>
bonus <dbL>
prodA_1Qv <dbl>
prodB_1Qv <dbl>

Populate df with your data and do a little data exploratory analysis

  1. What are the 4Q ratio and 1Q product ratios for each sales_id?
  2. What is the difference between the ratios?
  3. If there's no difference, are you still interested in the effect of the bonus?
  4. If there is a difference, what test statistic and p-value is appropriate to use to test whether the difference is due to chance (more formally, the null hypothesis is that the ratios are not statistically different). Depending on the number of observations, you might use Student's t, for example, if you only have 20 or so sales_id records.

OK, assuming there is a statistically significant difference, let's develop two models.

The first model is trivial f(x) = y, where x is the volume of product A and y is the bonus, which illustrates that we don't really need to model at all for the first half of the data.

The second model has a psr product sales ratio as the response variable, bp the prior bonus, and bc the current bonus.

fit <- lm(psr ~ bp + bc, data = df)

You might also try adding an interaction term

fit2 <- lm(psr ~ bp + bc + bp*bc, data = df)

or just the interaction

fit3 <- lm(psr ~ bp*bc, data = df)

Armed with these results, I think you'll have a better idea of how to construct an rdd model.

1 Like

Thank you so much for your incredibly detailed answer! A couple of follow up questions:

Let's say that I wanted my dependent variable to be a ratio (product A/product B), is that possible, and would I have any issues with multicollinearity?

Since I have essentially two data sets, one for the first time period and one for the second, does it make sense to include them both in the model? If so, how would you go about doing that? For example, i was thinking:

Y(productA/ProductB(timeframe 2)) = Bo +B1X1(Product A period 2- Product A period 1) + B2(dummy variable for if salesman met bonus threshold, 0-1) + e

Am I at all thinking about this correctly. Basically I want to know what effect, if any, the institution of the bonus program had on the composition of sales between product A and product B, given the data before and after the bonus program was implemented.

Well, sure, that's what I was trying to convey with the product sale ratio psr in the model. I chose a ratio to offset the effect of overall volume changes due to market conditions, etc.

You definitely need to check for collinearity issues.

It doesn't hurt to add in the delta between volumes, but I'd scale it, I think, since what you're looking for in the model initially is variability. And, of course \epsilon is ever present, but it is, in effect a model output shown by the residuals.

I'm not sure you need the dummy, because that is already captured in B1, right?

I think you're on the right track.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.