I need to calculate rolling mean and standard deviations for a couple of columns in a large data (30 million rows and 11 columns). I use the rollify function in tibbletime with data.table, but the code seems very slow.
I want to know how to do it quickly in data.table without having to use functions that are slow. My code is as below
rollify uses purrr under the hood, so I can't imagine it's going to be super performant. If it's simple statistics you're interested in, you could check out some of the functions in the zoo package. It has rollapply(), which takes an analogous approach to rollify but uses apply instead (so maybe not a big performance increase), and rollmean(), which is a performance-optimised rolling mean. The latter will probably give you the best performance for the mean, but if the others aren't fast enough for the SD, you might have to look into writing a rolling SD function using rcpp()
That is what I am thinking. I used to use zoo::rollapply and I will try it now. I really like the ease of use provided by tidyverse ecosystem, but it seems functions from it have a performance issue.
Yeah Rolling functions tend to be slow in R because they require iteration, and applying an arbitrary function iteratively means doing the iteration in R, which introduces a lot of overhead. Functions like zoo::rollmean() and those in RcppRoll have been compiled with the iteration built-in (because the function is explicitly defined, not arbitrary), so they tend to be faster.