dplyr/vectorised approach to computing on lagged value

16mc1r · September 14, 2020, 10:38am

The Problem:

Translating a for loop which uses the result of its previous iteration to calculate the current value.

Im calculating the current stock of medication a person has based on:

The first prescription amount V_0
The daily dose used (linear decrease per day).
The stock is updated by new prescriptions V_1
The amount is positive or 0 since you cannot have a negative ammount of medication

Calculation current medication stock

V_{current} = V_0 - 500_{mg} + V_1
with V_{current} \ge 0

Code

Each row in the data represents one day, therefore with each iteration the daily dosis (e.g. 500mg) is substracted
The loop starts at row 2 since row 1 contains the initial stock of medication.

roll_fun_long <- function(.,...){
  for (i in 2:nrow(.)){
    .[i, "vol_c"] <- ifelse((.[i-1, "vol_c"] - 500) < 0,
                                0 + .[i, "vol_c"],
                                .[i-1, "vol_c"] - 500+ .[i, "v_1"])
  }
  return(.)
}

Thus the loop references the last value it calculated (i.e. the stock of medication at t_{-1} to calulate the current amount of medication V_{current} at t_1.

Applying the function

Splitting the data.frame along person identifiers
Applying the loop
Recombining into one dataframe

vol_long%<>% split(.$person) %>% map_df(roll_fun_long)

Details

I dont have an idea how to vectorise this function or put it into dplyr / tidyverse logic since all functions I know operate on vectors and you cannot reference intermediate row results. The process is fairly inefficient and not feasable for larger Datasets. I though about using data.table because it supports modifying in place instead of rewriting the object hundrets of times.

Thanks for your suggestions

nirgrahamuk · September 14, 2020, 10:47am

can you provide a sample of representative data ?

16mc1r · September 14, 2020, 1:24pm

Here the script to arrive at the sample starting data:

sample_data <- data.frame(
     stringsAsFactors = FALSE,
                   id = c("A", "A", "A", "B", "B", "B", "B", "C", "C"),
                 date = c("2010-01-01",
                          "2010-02-01","2010-05-01","2010-01-01","2010-03-15",
                          "2010-04-02","2010-04-08","2010-04-03","2010-05-18"),
  volume_prescription = c(5000, 9800, 3000, 0, 7000, 8000, 9000, 4000, 9000),
       rolling_volume = c(NA, NA, NA, NA, NA, NA, NA, NA, NA)
)


sample_data %>% mutate(date = as.Date(date)) %>%
  group_by(id) %>% complete(id,
                            date = seq.Date(min(date),
                                            max(date),
                                            by = "day"),
                            fill = list(volume_prescription = 0))

The End result should be something like this:

  id    date       volume_prescription rolling_volume
   <chr> <date>                   <dbl>          <dbl>
 1 A     2010-01-01                2000           1500
 2 A     2010-01-02                   0           1000
 3 A     2010-01-03                   0            500
 4 A     2010-01-04                   0              0
 5 A     2010-01-05                   0              0
 6 A     2010-01-06                   5000        5000
 7 A     2010-01-07                   0           4500
 8 A     2010-01-08                   0           4000

siddharthprabhu · September 14, 2020, 3:38pm

@nirgrahamuk Hope you don't mind me butting in. I just found this puzzle quite interesting and wanted to take a crack at it.

@16mc1r How about this solution? I've temporarily created a new variable called volume_filled just so that you can check the work. You could just as well use volume_prescription if you don't mind overwriting the NA values.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

sample_data <- data.frame(
  stringsAsFactors = FALSE,
  id = c("A", "A", "A", "B", "B", "B", "B", "C", "C"),
  date = c(
    "2010-01-01",
    "2010-02-01", "2010-05-01", "2010-01-01", "2010-03-15",
    "2010-04-02", "2010-04-08", "2010-04-03", "2010-05-18"
  ),
  volume_prescription = c(5000, 9800, 3000, 0, 7000, 8000, 9000, 4000, 9000),
  rolling_volume = c(NA, NA, NA, NA, NA, NA, NA, NA, NA)
)

complete_data <- sample_data %>%
  mutate(date = as.Date(date)) %>%
  group_by(id) %>%
  complete(id,
    date = seq.Date(min(date),
      max(date),
      by = "day"
    ),
    fill = list(volume_prescription = NA)
  )

complete_data %>%
  mutate(volume_filled = volume_prescription, consumption = 500, .before = rolling_volume) %>%
  fill(volume_filled, .direction = "down") %>%
  group_by(id, volume_filled) %>%
  mutate(rolling_volume = lag(if_else(volume_filled - cumsum(consumption) < 0,
    0, volume_filled - cumsum(consumption)
  ))) %>%
  ungroup()
#> # A tibble: 265 x 6
#>    id    date       volume_prescription volume_filled consumption rolling_volume
#>    <chr> <date>                   <dbl>         <dbl>       <dbl>          <dbl>
#>  1 A     2010-01-01                5000          5000         500             NA
#>  2 A     2010-01-02                  NA          5000         500           4500
#>  3 A     2010-01-03                  NA          5000         500           4000
#>  4 A     2010-01-04                  NA          5000         500           3500
#>  5 A     2010-01-05                  NA          5000         500           3000
#>  6 A     2010-01-06                  NA          5000         500           2500
#>  7 A     2010-01-07                  NA          5000         500           2000
#>  8 A     2010-01-08                  NA          5000         500           1500
#>  9 A     2010-01-09                  NA          5000         500           1000
#> 10 A     2010-01-10                  NA          5000         500            500
#> # ... with 255 more rows

^{Created on 2020-09-14 by the reprex package (v0.3.0)}

nirgrahamuk · September 14, 2020, 3:58pm

I certainly don't mind.
I think the simple truth is that dplyr/tidyr isnt intended for this sort of 'algorithm' so attempts to fit it in will be inherintly clumsy, probably a base R solution is best overall.

I think there is a bug in the attempted solution, I modified example data slightly in order to tease it out.

sample_data <- data.frame(
  stringsAsFactors = FALSE,
  id = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C"),
  date = c(
    "2010-01-01",
    "2010-01-05",
    "2010-02-01", "2010-05-01", "2010-01-01", "2010-03-15",
    "2010-04-02", "2010-04-08", "2010-04-03", "2010-05-18"
  ),
  volume_prescription  = c(5000, 1000,9800, 3000, 0, 7000, 8000, 9000, 4000, 9000)
) %>% mutate(rolling_volume = NA)

using this as start point and applying the solution gives

 A tibble: 265 x 6
   id    date       volume_prescription volume_filled consumption rolling_volume
   <chr> <date>                   <dbl>         <dbl>       <dbl>          <dbl>
 1 A     2010-01-01                5000          5000         500             NA
 2 A     2010-01-02                  NA          5000         500           4500
 3 A     2010-01-03                  NA          5000         500           4000
 4 A     2010-01-04                  NA          5000         500           3500
 5 A     2010-01-05                1000          1000         500             NA
 6 A     2010-01-06                  NA          1000         500            500
 7 A     2010-01-07                  NA          1000         500              0
 8 A     2010-01-08                  NA          1000         500              0
 9 A     2010-01-09                  NA          1000         500              0
10 A     2010-01-10                  NA          1000         500              0
# ... with 255 more rows

which unfortunately doesn't look correct for the additional 1k on 5th of Jan

nirgrahamuk · September 15, 2020, 11:00am

ok, here is my solution,
I made adjustment to the original 'slow solution' that I felt were required for the algorithm to be correct and relatively efficient in its way of working. I then used Rcpp to make a custom C++ function which is like cumulative sum except throttled to positive number results only.
When I benchmark the two approaches (and ignore the Rcpp function compilation time (which might be discountable if you store it and load it)) then the new approach is about 500 times faster...

library(tidyverse)

roll_fun_long <- function(., ...) {
  for (i in 2:nrow(.)) {
    .[i, "vol_c"] <- max(0, .[[i - 1, "vol_c"]] - 500 + .[[i - 1, "v_1"]])

    # .[i, "vol_c"] <- ifelse((.[i-1, "vol_c"] - 500)  < 0,
    #                         0 + .[i, "vol_c"],
    #                         .[i-1, "vol_c"] - 500+ .[i, "v_1"])
  }
  return(.)
}

sample_data0 <- data.frame(
  stringsAsFactors = FALSE,
  id = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C"),
  date = c(
    "2010-01-01",
    "2010-01-05",
    "2010-02-01", "2010-05-01", "2010-01-01", "2010-03-15",
    "2010-04-02", "2010-04-08", "2010-04-03", "2010-05-18"
  ),
  v_1 = c(5000, 1000, 9800, 3000, 0, 7000, 8000, 9000, 4000, 9000)
) %>% mutate(vol_c = 0)



sample_data <- mutate(sample_data0,
  date = as.Date(date)
) %>%
  group_by(id) %>%
  complete(id,
    date = seq.Date(min(date),
      max(date),
      by = "day"
    ),
    fill = list(
      v_1 = 0,
      vol_c = 0
    )
  )




library("Rcpp")
cppFunction("
NumericVector cumsum_only_pos(NumericVector x){
 	// initialize an accumulator variable
 	double acc = 0;
 	
 	// initialize the result vector
 	NumericVector res(x.size());
 	
 	for(int i = 0; i < x.size(); i++){
 		acc += x[i];
 		acc = std::max(acc,0.0);
 		res[i] = acc;
 	}
 	return res;
 }")
# cumsum_only_pos(c(25,0,10,1)-10)
# cumsum(c(25,0,10,1) -10)


microbenchmark::microbenchmark(
  orig_way = orig_result <- sample_data %>% split(.$id) %>% map_df(roll_fun_long),
  new_way = new_result <- sample_data %>% mutate(vol_c = cumsum_only_pos(replace_na(lag(v_1), 0) - 500)),
  times = 8L
)

all_equal(new_result, orig_result)

16mc1r · September 23, 2020, 5:53am

Thats what I suspected too, but I wasnt sure since there might be a way I dont know of with dplyr or some other hidden R gem.
Also big thanks for the Rcpp solution, I hadnt thought of going that route, but now I finally have an excuse to learn C++ / Rcpp for work :).

system · October 14, 2020, 5:53am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.