How to work with "three-dimensional" data


I am very new to R, so please bear with me. I am mostly used to Excel but need to use R due to the size of the dateset I am working with.

My dataset contains monthly observations of 15 different types of financial data (e.g. share price, dividend, earnings, etc.) for a large number (~8,000) stocks over a 25 year period. However, currently my dataset is two-dimensional such that the 15 types of data for stock 1 is in the first 15 columns, then the same data for stock 2 in columns 16-30, etc... you get the idea.

For each individual stock I need to be able create new vectors for each stock by multiplying current vectors, as well as creating new vectors for each stock based on the value of one of the vectors relative to the cross-section of stocks (e.g. is the stock in the top 10% of highest dividend yields in the sample).

Intuititvely I feel it would be optimal to have the data in "three dimensions" but is this feasible or even possible, and if yes, how would i optimally get to there? Or are there any other good ways to work around this when the dataset reaches this size (currently ~120,000 columns).

Thank you very much!

1 Like

Hi, and welcome!

Your data is suffering from a bad case of untidy, meaning that it too wide to handle well. Plus, you have competing units of observation, the ticker and the date with multiple characteristics recorded at different dates on the same ticker or different tickers on the same date (which you have now).

My instinct is to organize by date, because its a time series and what the econometrician boffins call panel data.

Brief high-level introduction to R. Under the hood, it can look similar to other languages, but to the user it presents primarily as school algebra writ large, f(x) = y. Every function takes an object as its argument. So, the quest is to find an appropriate object for the analysis.

One way to do this is to organize the data into


where metrics is a list object of the characteristics that are to be captured.

That isn't necessarily (or even likely) the best object. I suggest taking a look at the CRAN Task View on Finance under the Finance header at the packages available. I feel confident that this problem has been confronted before now and you'll find the appropriate object, with some digging.

There may be some disaggregation of your data set required. I hope that the packages in the Task View have the tools to do that. Otherwise, it will be necessary to do the preprocessing using system tools like awk or cut or within R. I also hope there is a way to prevent having to deal with 8,000 different objects before reassembly.

As you experiment, I strongly suggest that you take 5 tickers over 5 years and work from there.

Good luck, and come back if you get stuck.

Hi Technocrat,

Thanks for the welcome and response - much appreciated! I am working with NicolaiB on this issue, and we have a follow-up question.

I have attached a snip of how our dataframe looks. So right now along the columns, our data is organized as [Company Name - Financial Metric].

We are trying to define new variables, for each Company, based on these financial metrics. Given that we have hundreds of Company, we need to create a loop that computes it. We know the pattern of occurrence of the financial metrics, e.g. Novo Nordisk A/S - Working Capital will always be the first column entry for a Company and we also know that each Company has exactly 19 financial variables associated to it.

So e.g. we want to create a loop that create a new variable associated to each Company by dividing its Book Value per share by its share price.

Can you help us with such a loop? We have trouble creating a sequence that refers to the columns we need for creating our computation. For book-value per share, it is always the column 4 position for a given Company and share price is always column 14.

Specifically, can you help us with a loop that does the following:

n = number of companies = ncol(dataframe)/financial metrics - 1
for (i to n) { new variable <- column [4 + (i-1)*19]/column [14 + (i-1)*19]},

Thanks for your help!

As @technocrat suggested, your data is what may be called 'untidy' -- a somewhat technical term that applies because there is data in your column names themselves, namely, the names of the companies.

Once the data is tidied, it may be straightforward to accomplish your task, but it would help to share a toy, say, 10-by-10 table that mimics your current structure, and a toy task that mimics the task you want to accomplish. Then folks may be able to help you get started.

1 Like

I would suggest looking at this chapter of R4DS as it will help you take the data and make it tidy. but this example might get you started

#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>     date
exampdat <- tibble(
  Dates=seq(ymd('2000-02-19'),ymd('2001-01-19'), by = 'months'),
  NovoNordisk_Capital=c(rep(6116000, 11), 7858000),
  NovoNordisk_Assets =c(rep(24593000, 11), 2890500),
  NovoNordisk_DividendYield=c(9.73, .7, .94, .73, .74, .66, .58, .56, .58, .59, .62, .55),
  OtherComp_Capital=c(rep(6116000, 11), 7858000),
  OtherComp_Assets =c(rep(24593000, 11), 2890500),
  OtherComp_DividendYield=c(9.73, .7, .94, .73, .74, .66, .58, .56, .58, .59, .62, .55)

exampdat %>%
  pivot_longer(-Dates) %>%
  separate(name, into=c("Company", "Type")) %>%
  pivot_wider(names_from="Type", values_from="value")
#> # A tibble: 24 x 5
#>    Dates      Company     Capital   Assets DividendYield
#>    <date>     <chr>         <dbl>    <dbl>         <dbl>
#>  1 2000-02-19 NovoNordisk 6116000 24593000          9.73
#>  2 2000-02-19 OtherComp   6116000 24593000          9.73
#>  3 2000-03-19 NovoNordisk 6116000 24593000          0.7 
#>  4 2000-03-19 OtherComp   6116000 24593000          0.7 
#>  5 2000-04-19 NovoNordisk 6116000 24593000          0.94
#>  6 2000-04-19 OtherComp   6116000 24593000          0.94
#>  7 2000-05-19 NovoNordisk 6116000 24593000          0.73
#>  8 2000-05-19 OtherComp   6116000 24593000          0.73
#>  9 2000-06-19 NovoNordisk 6116000 24593000          0.74
#> 10 2000-06-19 OtherComp   6116000 24593000          0.74
#> # ... with 14 more rows

Created on 2020-02-24 by the reprex package (v0.3.0)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.