Rolling mean and standard deviation in data.table

Peter_Griffin · October 22, 2018, 2:37am

I need to calculate rolling mean and standard deviations for a couple of columns in a large data (30 million rows and 11 columns). I use the rollify function in tibbletime with data.table, but the code seems very slow.

I want to know how to do it quickly in data.table without having to use functions that are slow. My code is as below


mywindow <- 90

MyMean <- function(x){
  return(mean(x, na.rm = TRUE))
}


mean_roll <- rollify(MyMean, window = mywindow)


MySD <- function(x){
  return(sd(x, na.rm = TRUE))
}

sd_roll <- rollify(MySD, window = mywindow)

MySkewness <- function(x){
  return(skewness(x, na.rm = TRUE))
}

skewness_roll <- rollify(MySkewness, window = mywindow)


MyKurtosis <- function(x){
  return(kurtosis(x, na.rm = TRUE))
         }
  
kurtosis_roll <- rollify(MyKurtosis, window = mywindow)



sample_filter <- crsp_all[, .N, by = permno][N >= 91][,list(permno)]

crsp_all <- merge(crsp_all, sample_filter, by = "permno")

rm(sample_filter)

crsp_all[
  , ':='(prc = abs(prc), impact = abs(ret)/(prc*vol), 
         baspread = coalesce(ask-bid, askhi-bidlo)/coalesce((ask+bid)/2, (askhi+bidlo)/2),
         stockto = vol/shrout)
  ][
    ,':='(mean_ret = mean_roll(ret), sd_ret = sd_roll(ret), 
          skewness_ret = skewness_roll(ret), kurtosis_ret = kurtosis_roll(ret),
          amihud = mean_roll(impact), stockTO = mean_roll(stockto),
          BAspread = mean_roll(baspread)), by = permno
    
  ]

rensa · October 22, 2018, 2:44am

rollify uses purrr under the hood, so I can't imagine it's going to be super performant. If it's simple statistics you're interested in, you could check out some of the functions in the zoo package. It has rollapply(), which takes an analogous approach to rollify but uses apply instead (so maybe not a big performance increase), and rollmean(), which is a performance-optimised rolling mean. The latter will probably give you the best performance for the mean, but if the others aren't fast enough for the SD, you might have to look into writing a rolling SD function using rcpp()

Peter_Griffin · October 22, 2018, 2:47am

That is what I am thinking. I used to use zoo::rollapply and I will try it now. I really like the ease of use provided by tidyverse ecosystem, but it seems functions from it have a performance issue.

jcblum · October 22, 2018, 3:06am

Is this helpful, at all?

Peter_Griffin · October 22, 2018, 3:22am

I think this could be a great alternative. I was wondering why data.table does not have its own version of rollapply.

Peter_Griffin · October 22, 2018, 3:45am

The rollapply function is still pretty slow, I will use the rcpproll package.

rensa · October 22, 2018, 4:45am

Oooh yeah, that looks great!

rensa · October 22, 2018, 4:49am

Yeah Rolling functions tend to be slow in R because they require iteration, and applying an arbitrary function iteratively means doing the iteration in R, which introduces a lot of overhead. Functions like zoo::rollmean() and those in RcppRoll have been compiled with the iteration built-in (because the function is explicitly defined, not arbitrary), so they tend to be faster.

Peter_Griffin · October 22, 2018, 3:29pm

The RcppRoll really saves my life! Do you know if there are any other high performance packages like this one? Thanks!

mara · October 22, 2018, 3:35pm

There's a High-Performance and Parallel Computing with R CRAN tast view maintained by Dirk Eddelbuettel, which you might want to take a look at.

fderyckel · October 24, 2018, 12:14pm

Hi
The package Rolling Window has been a winner for me.
Check my blog and the comments on rolling functions
https://fderyckel.github.io/2017-05-14-rollin_rollin_rollin_on_the_river/