Linking CRSP and Compustat in R

Hello everyone,

I have a question with regards to combining two data bases with a linking table:
Has anyone ever tried to combine the monthly stock return data table of CRSP with the yearly accounting data tables of Compustat through the CCM table? WRDS only offers a SAS solution, but unfortunately not one for R.
In detail: I have a table with all stock returns month-wise and their respective PERMNO-code (CRSP). I also have the yearly accounting data of these stocks with their GVKEY (Compustat). Obviously, the accounting data is the same for the company during the year and only changes every 12 months. Now I have the linking table which assigns each GVKEY to a PERMNO. I would like to combine them the following way:

Company | Month/Year | Return for this month | Accounting data columns | Stock Return columns for the prior 20 months

Does anyone have experience in this field?
Thanks in advance!

There are two ways of doing it. If you have ccm then use the link permno, if not use cusip. Just translate the sas code into r, it should be very straightforward.

Hi,
so I have to admit, I am not that proficient with R and having troubles translating the code. Maybe just as clues:

  1. Which packages do I need?
  2. I don't fully understand how the SAS code does it (https://wrds-www.wharton.upenn.edu/pages/support/applications/linking-databases/linking-crsp-and-compustat/). I have all three files as an excel, so I don't have to do a sql query, right? Does that change the code significantly?
    Thanks!

I have this same problem 3 years ago. Are you a PhD student in finance or accounting
?

I am a finance major. So how did you solve it 3 years ago. For me it says, that my data is too large I think (if I run the final merge command based on the PERMNO, the console gives back an error "Error: cannot allocate vector of size 317.9 Mb").

When you download data in CSV and text from wrds, and then read it into r, you may exceed the memory limit. What is your computer RAM, how many rows does your crsp data have? I use wrds cloud with r, so I do not have this issue. If you really want do it locally, use dplyr and data.table

A good introduction to seeking help from coding forums is to pose your questions as a [reproducible example (what many folks call a reprex for short)](FAQ: What's a reproducible example (`reprex`) and how do I do one?, the reprex-r-package is a handy tool to help make this easy).

With a question like this, you might provide samples (small samples) of the two tables you'd like to merge, and a copy of the merged data you'd like to get. (Here's a discussion of how you can take data from your R console and offer it in a reprex-friendly format for coding forums like this, I prefer dput)

In terms of tools for merging two datasets, I personally like the tidyverse's dplyr's two-table verbs for stuff like this.. But I'd be worried about suggesting this until we more clearly understand your desired outcome - since the time series model you want to apply later may require your data is structured in a particular way.

Here is some R code to replicate the Fama French 1992 paper.

https://drive.google.com/file/d/0BxvBvE2V-dFTVnZuLUFhZWNuazA/view

The author links the Compustat data with the CRSP data. He also uses the CCM LNKHIST data as you require. Unfortunately I don`t have access to the CCM data so I have to link the data through the GVKEY, CUSIP / some other method.

I will post my code if anybody is interested.

1 Like

I am one of the authors of this code. If anyone has question about it, feel free to ask.

There are two ways of merging compustat and CRSP. If you do not have the CCM data, this code does not work for you.

Hello Peter,

For my master thesis I am trying to combine the monthly stock return data of CRSP with the annual fundamental data of Compustat. I'm reading the R code as shared by msmith01 and can follow the logic of the code as it is clearly written. However, I have a question about the first section.

The annual data is retrieved from Compustat and the link table from CRSP/Compustat Merged. The link table is merged with the Compustat dataset. The combined data is merged with CRSP.
What is the advantage of retrieving the fundamental data from Compustat and combine that with the link table over directly retrieving the fundamental data from the CRSP/Compustat Merged dataset? As far as I can see the CRSP/Compustat Merged dataset includes the desired linking variables for merging with CRSP. Are there disadvantages with using the CRSP/Compustat Merged dataset?

Thanks in advance.

Kind regards,
Patrick

Patrick,

The CRSP and Compustat have more data than the CCM data. Moreover, merging these two data by yourself allow you to choose data frequency more flexibly. For example, daily CRSP + quarterly Compustat, etc.

Thank you for your reply. I was not aware that the data of CCM was not that extensive as Compustat or CRSP.
Will use your code to merge the two datasets.

I am trying to run your code to see if I can reproduce the result. Unfortunately I am running into an error. I am trying to run the code as is (only change is username/password). The code I am running is from:

When I get to this section:

###############################################################################################################
### COMPUSTAT CLEANING AND VAR CALC ###

# load("180619 data.ccm.RData")
data.comp <- data.ccm %>%
  rename(PERMNO=permno) %>% data.table %>% # ensure col names match crsp's
  group_by(PERMNO) %>% 
  mutate(datadate = as.yearmon(datadate),
         comp.count = row(.)) %>% # number of years in data; future option to cut first year data; works but leads to warnings
    # tests based on BE spread show FF no longer impose this condition (even though mentioned in FF'93)
  ungroup %>% arrange(datadate, PERMNO) %>% data.frame %>%
  distinct(datadate, PERMNO, .keep_all = TRUE) # hasn't been issue but just in case

I get the following error:

Error: Column `comp.count` must be length 1 (the group size), not 12099238
In addition: There were 50 or more warnings (use warnings() to see the first 50)

Is there an input that I am missing?

Could you change comp.count = row(.) to comp.count = n()?

When I have time, I will update the code using a pure data.table approach. Over time, I personally think data.table may be better at handling this kind of daily stock return data.

That worked! Thank you very much. Apologies for the delay in responding, I need to figure out a way to receive email notifications.