dplyr/vectorised approach to computing on lagged value

The Problem:

Translating a for loop which uses the result of its previous iteration to calculate the current value.

Im calculating the current stock of medication a person has based on:

  • The first prescription amount V_0
  • The daily dose used (linear decrease per day).
  • The stock is updated by new prescriptions V_1
  • The amount is positive or 0 since you cannot have a negative ammount of medication

Calculation current medication stock

V_{current} = V_0 - 500_{mg} + V_1
with V_{current} \ge 0

Code

  • Each row in the data represents one day, therefore with each iteration the daily dosis (e.g. 500mg) is substracted
  • The loop starts at row 2 since row 1 contains the initial stock of medication.
roll_fun_long <- function(.,...){
  for (i in 2:nrow(.)){
    .[i, "vol_c"] <- ifelse((.[i-1, "vol_c"] - 500) < 0,
                                0 + .[i, "vol_c"],
                                .[i-1, "vol_c"] - 500+ .[i, "v_1"])
  }
  return(.)
}

Thus the loop references the last value it calculated (i.e. the stock of medication at t_{-1} to calulate the current amount of medication V_{current} at t_1.

Applying the function

  • Splitting the data.frame along person identifiers
  • Applying the loop
  • Recombining into one dataframe
vol_long%<>% split(.$person) %>% map_df(roll_fun_long)

Details

I dont have an idea how to vectorise this function or put it into dplyr / tidyverse logic since all functions I know operate on vectors and you cannot reference intermediate row results. The process is fairly inefficient and not feasable for larger Datasets. I though about using data.table because it supports modifying in place instead of rewriting the object hundrets of times.

Thanks for your suggestions

can you provide a sample of representative data ?

Here the script to arrive at the sample starting data:

sample_data <- data.frame(
     stringsAsFactors = FALSE,
                   id = c("A", "A", "A", "B", "B", "B", "B", "C", "C"),
                 date = c("2010-01-01",
                          "2010-02-01","2010-05-01","2010-01-01","2010-03-15",
                          "2010-04-02","2010-04-08","2010-04-03","2010-05-18"),
  volume_prescription = c(5000, 9800, 3000, 0, 7000, 8000, 9000, 4000, 9000),
       rolling_volume = c(NA, NA, NA, NA, NA, NA, NA, NA, NA)
)


sample_data %>% mutate(date = as.Date(date)) %>%
  group_by(id) %>% complete(id,
                            date = seq.Date(min(date),
                                            max(date),
                                            by = "day"),
                            fill = list(volume_prescription = 0))

The End result should be something like this:

  id    date       volume_prescription rolling_volume
   <chr> <date>                   <dbl>          <dbl>
 1 A     2010-01-01                2000           1500
 2 A     2010-01-02                   0           1000
 3 A     2010-01-03                   0            500
 4 A     2010-01-04                   0              0
 5 A     2010-01-05                   0              0
 6 A     2010-01-06                   5000        5000
 7 A     2010-01-07                   0           4500
 8 A     2010-01-08                   0           4000

@nirgrahamuk Hope you don't mind me butting in. I just found this puzzle quite interesting and wanted to take a crack at it.

@16mc1r How about this solution? I've temporarily created a new variable called volume_filled just so that you can check the work. You could just as well use volume_prescription if you don't mind overwriting the NA values.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

sample_data <- data.frame(
  stringsAsFactors = FALSE,
  id = c("A", "A", "A", "B", "B", "B", "B", "C", "C"),
  date = c(
    "2010-01-01",
    "2010-02-01", "2010-05-01", "2010-01-01", "2010-03-15",
    "2010-04-02", "2010-04-08", "2010-04-03", "2010-05-18"
  ),
  volume_prescription = c(5000, 9800, 3000, 0, 7000, 8000, 9000, 4000, 9000),
  rolling_volume = c(NA, NA, NA, NA, NA, NA, NA, NA, NA)
)

complete_data <- sample_data %>%
  mutate(date = as.Date(date)) %>%
  group_by(id) %>%
  complete(id,
    date = seq.Date(min(date),
      max(date),
      by = "day"
    ),
    fill = list(volume_prescription = NA)
  )

complete_data %>%
  mutate(volume_filled = volume_prescription, consumption = 500, .before = rolling_volume) %>%
  fill(volume_filled, .direction = "down") %>%
  group_by(id, volume_filled) %>%
  mutate(rolling_volume = lag(if_else(volume_filled - cumsum(consumption) < 0,
    0, volume_filled - cumsum(consumption)
  ))) %>%
  ungroup()
#> # A tibble: 265 x 6
#>    id    date       volume_prescription volume_filled consumption rolling_volume
#>    <chr> <date>                   <dbl>         <dbl>       <dbl>          <dbl>
#>  1 A     2010-01-01                5000          5000         500             NA
#>  2 A     2010-01-02                  NA          5000         500           4500
#>  3 A     2010-01-03                  NA          5000         500           4000
#>  4 A     2010-01-04                  NA          5000         500           3500
#>  5 A     2010-01-05                  NA          5000         500           3000
#>  6 A     2010-01-06                  NA          5000         500           2500
#>  7 A     2010-01-07                  NA          5000         500           2000
#>  8 A     2010-01-08                  NA          5000         500           1500
#>  9 A     2010-01-09                  NA          5000         500           1000
#> 10 A     2010-01-10                  NA          5000         500            500
#> # ... with 255 more rows

Created on 2020-09-14 by the reprex package (v0.3.0)

I certainly don't mind.
I think the simple truth is that dplyr/tidyr isnt intended for this sort of 'algorithm' so attempts to fit it in will be inherintly clumsy, probably a base R solution is best overall.

I think there is a bug in the attempted solution, I modified example data slightly in order to tease it out.

sample_data <- data.frame(
  stringsAsFactors = FALSE,
  id = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C"),
  date = c(
    "2010-01-01",
    "2010-01-05",
    "2010-02-01", "2010-05-01", "2010-01-01", "2010-03-15",
    "2010-04-02", "2010-04-08", "2010-04-03", "2010-05-18"
  ),
  volume_prescription  = c(5000, 1000,9800, 3000, 0, 7000, 8000, 9000, 4000, 9000)
) %>% mutate(rolling_volume = NA)

using this as start point and applying the solution gives

 A tibble: 265 x 6
   id    date       volume_prescription volume_filled consumption rolling_volume
   <chr> <date>                   <dbl>         <dbl>       <dbl>          <dbl>
 1 A     2010-01-01                5000          5000         500             NA
 2 A     2010-01-02                  NA          5000         500           4500
 3 A     2010-01-03                  NA          5000         500           4000
 4 A     2010-01-04                  NA          5000         500           3500
 5 A     2010-01-05                1000          1000         500             NA
 6 A     2010-01-06                  NA          1000         500            500
 7 A     2010-01-07                  NA          1000         500              0
 8 A     2010-01-08                  NA          1000         500              0
 9 A     2010-01-09                  NA          1000         500              0
10 A     2010-01-10                  NA          1000         500              0
# ... with 255 more rows

which unfortunately doesn't look correct for the additional 1k on 5th of Jan

ok, here is my solution,
I made adjustment to the original 'slow solution' that I felt were required for the algorithm to be correct and relatively efficient in its way of working. I then used Rcpp to make a custom C++ function which is like cumulative sum except throttled to positive number results only.
When I benchmark the two approaches (and ignore the Rcpp function compilation time (which might be discountable if you store it and load it)) then the new approach is about 500 times faster...

library(tidyverse)

roll_fun_long <- function(., ...) {
  for (i in 2:nrow(.)) {
    .[i, "vol_c"] <- max(0, .[[i - 1, "vol_c"]] - 500 + .[[i - 1, "v_1"]])

    # .[i, "vol_c"] <- ifelse((.[i-1, "vol_c"] - 500)  < 0,
    #                         0 + .[i, "vol_c"],
    #                         .[i-1, "vol_c"] - 500+ .[i, "v_1"])
  }
  return(.)
}

sample_data0 <- data.frame(
  stringsAsFactors = FALSE,
  id = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C"),
  date = c(
    "2010-01-01",
    "2010-01-05",
    "2010-02-01", "2010-05-01", "2010-01-01", "2010-03-15",
    "2010-04-02", "2010-04-08", "2010-04-03", "2010-05-18"
  ),
  v_1 = c(5000, 1000, 9800, 3000, 0, 7000, 8000, 9000, 4000, 9000)
) %>% mutate(vol_c = 0)



sample_data <- mutate(sample_data0,
  date = as.Date(date)
) %>%
  group_by(id) %>%
  complete(id,
    date = seq.Date(min(date),
      max(date),
      by = "day"
    ),
    fill = list(
      v_1 = 0,
      vol_c = 0
    )
  )




library("Rcpp")
cppFunction("
NumericVector cumsum_only_pos(NumericVector x){
 	// initialize an accumulator variable
 	double acc = 0;
 	
 	// initialize the result vector
 	NumericVector res(x.size());
 	
 	for(int i = 0; i < x.size(); i++){
 		acc += x[i];
 		acc = std::max(acc,0.0);
 		res[i] = acc;
 	}
 	return res;
 }")
# cumsum_only_pos(c(25,0,10,1)-10)
# cumsum(c(25,0,10,1) -10)


microbenchmark::microbenchmark(
  orig_way = orig_result <- sample_data %>% split(.$id) %>% map_df(roll_fun_long),
  new_way = new_result <- sample_data %>% mutate(vol_c = cumsum_only_pos(replace_na(lag(v_1), 0) - 500)),
  times = 8L
)

all_equal(new_result, orig_result)
1 Like

Thats what I suspected too, but I wasnt sure since there might be a way I dont know of with dplyr or some other hidden R gem.
Also big thanks for the Rcpp solution, I hadnt thought of going that route, but now I finally have an excuse to learn C++ / Rcpp for work :).