Iterate over rows to conditionally calculate cumulative sum?


#1

Howdy!

I’ve been having trouble figuring out how to calculate a conditional cumulative sum for each row in a data frame. I think what I need to do is iterate through the rows (using pmap()?), select rows that meet my criteria and sum the touch_days column for those selected rows. Alternatively, I may be able to use some sort of rolling sum function.

For instance, the entry_date in the fourth row of the table below is 2017-01-20. Given this entry date, I want to find the other rows with entry dates that are within five years prior to this date and have the same id and then sum the touch_days for those rows. In this case, rows 2 and 3 would meet that criteria and the expected sum for that row would be 16.

There are a few extra wrinkles I’ll need to address (such as what to do when a date falls within an entry and exit), but I’m having trouble coming up with even a basic approach to this to get started and would love any guidance.

I have a reprex with my expected output below.

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date

data <-
  tribble(
    ~id, ~entry_date, ~exit_date, ~touch_days,
    "a", ymd("2010-01-26"), ymd("2010-01-30"), 5L,
    "a", ymd("2013-01-24"), ymd("2013-01-30"), 7L,
    "a", ymd("2015-01-22"), ymd("2015-01-30"), 9L,
    "a", ymd("2017-01-20"), ymd("2017-01-30"), 11L,
    "b", ymd("2010-02-26"), ymd("2010-03-30"), 33L,
    "b", ymd("2013-02-24"), ymd("2013-03-30"), 35L,
    "b", ymd("2015-02-22"), ymd("2015-03-30"), 37L,
    "b", ymd("2017-02-20"), ymd("2017-03-30"), 39L
  )
data
#> # A tibble: 8 x 4
#>   id    entry_date exit_date  touch_days
#>   <chr> <date>     <date>          <int>
#> 1 a     2010-01-26 2010-01-30          5
#> 2 a     2013-01-24 2013-01-30          7
#> 3 a     2015-01-22 2015-01-30          9
#> 4 a     2017-01-20 2017-01-30         11
#> 5 b     2010-02-26 2010-03-30         33
#> 6 b     2013-02-24 2013-03-30         35
#> 7 b     2015-02-22 2015-03-30         37
#> 8 b     2017-02-20 2017-03-30         39

output <-
  tribble(
    ~id, ~entry_date, ~exit_date, ~touch_days, ~stay_days_5yrs,
    "a", ymd("2010-01-26"), ymd("2010-01-30"), 5L, 0L,
    "a", ymd("2013-01-24"), ymd("2013-01-30"), 7L, 5L,
    "a", ymd("2015-01-22"), ymd("2015-01-30"), 9L, 12L,
    "a", ymd("2017-01-20"), ymd("2017-01-30"), 11L, 16L,
    "b", ymd("2010-02-26"), ymd("2010-03-30"), 33L, 0L,
    "b", ymd("2013-02-24"), ymd("2013-03-30"), 35L, 33L,
    "b", ymd("2015-02-22"), ymd("2015-03-30"), 37L, 68L,
    "b", ymd("2017-02-20"), ymd("2017-03-30"), 39L, 72L
  )

output
#> # A tibble: 8 x 5
#>   id    entry_date exit_date  touch_days stay_days_5yrs
#>   <chr> <date>     <date>          <int>          <int>
#> 1 a     2010-01-26 2010-01-30          5              0
#> 2 a     2013-01-24 2013-01-30          7              5
#> 3 a     2015-01-22 2015-01-30          9             12
#> 4 a     2017-01-20 2017-01-30         11             16
#> 5 b     2010-02-26 2010-03-30         33              0
#> 6 b     2013-02-24 2013-03-30         35             33
#> 7 b     2015-02-22 2015-03-30         37             68
#> 8 b     2017-02-20 2017-03-30         39             72

#2

I’ve came up with this which does what you want. Not sure it’s the most elegant solution, but the main β€œinnovation” is to use dplyr::do :slight_smile:

library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> βœ” ggplot2 2.2.1     βœ” purrr   0.2.4
#> βœ” tibble  1.4.2     βœ” dplyr   0.7.4
#> βœ” tidyr   0.8.0     βœ” stringr 1.3.0
#> βœ” readr   1.1.1     βœ” forcats 0.3.0
#> ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> βœ– dplyr::filter() masks stats::filter()
#> βœ– dplyr::lag()    masks stats::lag()
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date

data <- tribble(~id, ~entry_date, ~exit_date, ~touch_days, "a", ymd("2010-01-26"), 
  ymd("2010-01-30"), 5L, "a", ymd("2013-01-24"), ymd("2013-01-30"), 7L, "a", 
  ymd("2015-01-22"), ymd("2015-01-30"), 9L, "a", ymd("2017-01-20"), ymd("2017-01-30"), 
  11L, "b", ymd("2010-02-26"), ymd("2010-03-30"), 33L, "b", ymd("2013-02-24"), 
  ymd("2013-03-30"), 35L, "b", ymd("2015-02-22"), ymd("2015-03-30"), 37L, 
  "b", ymd("2017-02-20"), ymd("2017-03-30"), 39L)

output <- tribble(~id, ~entry_date, ~exit_date, ~touch_days, ~stay_days_5yrs, 
  "a", ymd("2010-01-26"), ymd("2010-01-30"), 5L, 0L, "a", ymd("2013-01-24"), 
  ymd("2013-01-30"), 7L, 5L, "a", ymd("2015-01-22"), ymd("2015-01-30"), 9L, 
  12L, "a", ymd("2017-01-20"), ymd("2017-01-30"), 11L, 16L, "b", ymd("2010-02-26"), 
  ymd("2010-03-30"), 33L, 0L, "b", ymd("2013-02-24"), ymd("2013-03-30"), 35L, 
  33L, "b", ymd("2015-02-22"), ymd("2015-03-30"), 37L, 68L, "b", ymd("2017-02-20"), 
  ymd("2017-03-30"), 39L, 72L)

stay_days <- function(df) {
  dates <- df[["entry_date"]]
  res <- purrr::map_int(dates, function(date) {
    first_date <- date
    lubridate::year(first_date) <- lubridate::year(date) - 5
    df %>% dplyr::filter(entry_date < date & entry_date > first_date) %>% 
      dplyr::pull(touch_days) %>% sum(na.rm = TRUE)
  })
  df[["stay_days_5yrs"]] <- res
  df
}

res <- data %>% dplyr::arrange(entry_date) %>% dplyr::group_by(id) %>% dplyr::do(stay_days(.)) %>% 
  dplyr::ungroup()

all.equal(res, output)
#> [1] TRUE

Reprex is being a bit difficult on Ubuntu, so formatting is sort of messed up, but it should be clear what I did.