Filter based on a cumulative sum exceeding a threshold value and then resetting

leeutjie98 · April 6, 2023, 2:19pm

I'm trying to filter a dataframe based on a cumulative volume column. The idea is that the values in the values in the volume column are cumulatively summed and as soon as this cumulative sum exceeds some threshold, t, the corresponding row is filtered.

This is easily coded in a loop, but of course loops do not scale very well (1 000 000+ rows). I would love for a vectorized solution in purr or something tidyverse specific, but at this point anything that manages to avoid the loop or speed it up is welcome. As reference, an identical loop coded in python on 2 000 000 rows runs in seconds. I'm looking for a similar R solution.

I've tried looking online without much success to find a problem like mine. There isn't a solution, or I just didn't word my searches properly. Here is a reproducible example:

library(tibble)

num_rows <- 1000000

# create the dataframe
df <- 
  tibble(
    id = 1:num_rows, 
    vol = runif(num_rows), 
    name = replicate(num_rows, paste(sample(letters, 3, replace=TRUE), collapse=''))
  )

# set the threshold, t, and loop parameters
t <- 5
cum_vol <- 0
idx <- vector(mode = "numeric")
sampled_vol <- vector(mode = "numeric")

for (i in 1:nrow(df)) {
  
  cum_vol <- cum_vol + df$vol[i]
  
  if (cum_vol >= t) {
    idx <- c(idx, i)
    sampled_vol <- c(sampled_vol, cum_vol)
    cum_vol <- 0
  }
  
}

sample <- tibble(id = idx, cum_vol = sampled_vol)
sampled_df <- inner_join(df, sample, by ="id")

FJCC · April 6, 2023, 3:03pm

I'll think about a tidyverse version, but this modification runs in about 2 seconds. All I did is predefine the length of the output vectors. You would then have to filter them with

idx <- idx[idx != 0]
sampled_vol <- sampled_vol[sampled_vol != 0]

but that is very fast.

idx <- vector(mode = "numeric", length = 200000)
sampled_vol <- vector(mode = "numeric", length = 200000)
j <- 1
for (i in 1:nrow(df)) {
   
   cum_vol <- cum_vol + df$vol[i]
   
   if (cum_vol >= t) {
     idx[j] <- i
     sampled_vol[j] <- cum_vol
     cum_vol <- 0
     j <- j + 1
   }
   
 }"

system · April 13, 2023, 3:04pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.