I'm trying to filter a dataframe based on a cumulative volume
column. The idea is that the values in the values in the volume
column are cumulatively summed and as soon as this cumulative sum exceeds some threshold, t
, the corresponding row is filtered.
This is easily coded in a loop, but of course loops do not scale very well (1 000 000+ rows). I would love for a vectorized solution in purr or something tidyverse specific, but at this point anything that manages to avoid the loop or speed it up is welcome. As reference, an identical loop coded in python on 2 000 000 rows runs in seconds. I'm looking for a similar R solution.
I've tried looking online without much success to find a problem like mine. There isn't a solution, or I just didn't word my searches properly. Here is a reproducible example:
library(tibble)
num_rows <- 1000000
# create the dataframe
df <-
tibble(
id = 1:num_rows,
vol = runif(num_rows),
name = replicate(num_rows, paste(sample(letters, 3, replace=TRUE), collapse=''))
)
# set the threshold, t, and loop parameters
t <- 5
cum_vol <- 0
idx <- vector(mode = "numeric")
sampled_vol <- vector(mode = "numeric")
for (i in 1:nrow(df)) {
cum_vol <- cum_vol + df$vol[i]
if (cum_vol >= t) {
idx <- c(idx, i)
sampled_vol <- c(sampled_vol, cum_vol)
cum_vol <- 0
}
}
sample <- tibble(id = idx, cum_vol = sampled_vol)
sampled_df <- inner_join(df, sample, by ="id")