conditionally copy text from previous date for next 4 days

ramakant · July 24, 2021, 12:00am

hello R community,
This is a pretty straightforward scenario but I've been unable to crack the R code for this. I've got a series of dates with text names on certain dates. the requirement is to create another column that copies the same name for the next 4 days. Sometimes a new text might appear within the 4 day window so the new column should reflect the most recent .

reprex code with the expected column hardcoded shared below reference. Have tried to google for rolling functions but I'm unable to address the conditionality. Certain there must be a simple workaround this but currently struggling with even where to begin!

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.3
#> Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
#> when loading 'dplyr'
#> Warning: package 'tibble' was built under R version 4.0.2

t <- tibble(date = seq(as.Date("2021-01-01"), as.Date("2021-01-15"), by = "1 day"), 
            name = c(NA,NA,"a",NA,NA,NA,NA,"b",NA,NA,"a",NA,NA,NA,NA))

t$expected <- c(NA,NA,"a","a","a","a",NA,"b","b","b","a","a","a","a",NA)

^{Created on 2021-07-24 by the reprex package (v0.3.0)}

jrmuirhead · July 24, 2021, 2:16am

Hi @ramakant ,
One way to approach this problem is by creating groups based on a conditionality - in this case, whether a value exists in the column name. Once the groups are defined, you can assess whether the dates within that group fall within a 4-day window, and if so, fill in the missing values in the name column with the first value defining that group.

One caveat for this to work is that the data must first be sorted by date.

library("tidyverse")

t <- tibble(date = seq(as.Date("2021-01-01"), as.Date("2021-01-15"),
  by = "1 day"), 
  name = c(NA,NA,"a",NA,NA,NA,NA,"b",NA,NA,"a",NA,NA,NA,NA))

t$expected <- c(NA,NA,"a","a","a","a",NA,"b","b","b","a","a","a","a",NA)

t <- t %>%
  # Find rows where values are not missing and assign to group
  mutate(value_present = cumsum(!is.na(name))) %>%
  # Group data by groups starting with non-missing value
  group_by(value_present)


# For each group, fill in the first value of that group for cases
# where the date window is 4 days or less
t <- t %>%
  mutate(new_name = case_when(date - min(date) <= 3 ~ head(name, 1)))

t
#> # A tibble: 15 × 5
#> # Groups:   value_present [4]
#>    date       name  expected value_present new_name
#>    <date>     <chr> <chr>            <int> <chr>   
#>  1 2021-01-01 <NA>  <NA>                 0 <NA>    
#>  2 2021-01-02 <NA>  <NA>                 0 <NA>    
#>  3 2021-01-03 a     a                    1 a       
#>  4 2021-01-04 <NA>  a                    1 a       
#>  5 2021-01-05 <NA>  a                    1 a       
#>  6 2021-01-06 <NA>  a                    1 a       
#>  7 2021-01-07 <NA>  <NA>                 1 <NA>    
#>  8 2021-01-08 b     b                    2 b       
#>  9 2021-01-09 <NA>  b                    2 b       
#> 10 2021-01-10 <NA>  b                    2 b       
#> 11 2021-01-11 a     a                    3 a       
#> 12 2021-01-12 <NA>  a                    3 a       
#> 13 2021-01-13 <NA>  a                    3 a       
#> 14 2021-01-14 <NA>  a                    3 a       
#> 15 2021-01-15 <NA>  <NA>                 3 <NA>

^{Created on 2021-07-23 by the reprex package (v2.0.0)}

ramakant · July 24, 2021, 10:39am

jrmuirhead:

t <- tibble(date = seq(as.Date("2021-01-01"), as.Date("2021-01-15"),
  by = "1 day"), 
  name = c(NA,NA,"a",NA,NA,NA,NA,"b",NA,NA,"a",NA,NA,NA,NA))

t$expected <- c(NA,NA,"a","a","a","a",NA,"b","b","b","a","a","a","a",NA)

t <- t %>%
  # Find rows where values are not missing and assign to group
  mutate(value_present = cumsum(!is.na(name))) %>%
  # Group data by groups starting with non-missing value
  group_by(value_present)


# For each group, fill in the first value of that group for cases
# where the date window is 4 days or less
t <- t %>%
  mutate(new_name = case_when(date - min(date) <= 3 ~ head(name, 1)))

thanks /u/jrmuirhead ... this works pretty very elegantly. i've got the data in the manner that i've wanted. I'd never have been able to figure out the cumsum and head(name,1) method!

however, there's something different that I'm able to see if i were to run the same code twice. reprex is produced below. I ran the piped command two times on the same data set and there seems to be two different results that are produced. the first time is as per your instructions and is absolutely correct. However, with the data still grouped on value_present, would u be able to help why does it just get reset to 0 and 1 when it run for the second time? I'm intrigued why does a double grouping on the same column change the output? not that it affects my data, but i'd like to expand my understanding of the code. thanks once again!

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.3
#> Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
#> when loading 'dplyr'
#> Warning: package 'tibble' was built under R version 4.0.2

t <- tibble(date = seq(as.Date("2021-01-01"), as.Date("2021-01-15"),
                       by = "1 day"), 
            name = c(NA,NA,"a",NA,NA,NA,NA,"b",NA,NA,"a",NA,NA,NA,NA))

#t$expected <- c(NA,NA,"a","a","a","a",NA,"b","b","b","a","a","a","a",NA)

t
#> # A tibble: 15 x 2
#>    date       name 
#>    <date>     <chr>
#>  1 2021-01-01 <NA> 
#>  2 2021-01-02 <NA> 
#>  3 2021-01-03 a    
#>  4 2021-01-04 <NA> 
#>  5 2021-01-05 <NA> 
#>  6 2021-01-06 <NA> 
#>  7 2021-01-07 <NA> 
#>  8 2021-01-08 b    
#>  9 2021-01-09 <NA> 
#> 10 2021-01-10 <NA> 
#> 11 2021-01-11 a    
#> 12 2021-01-12 <NA> 
#> 13 2021-01-13 <NA> 
#> 14 2021-01-14 <NA> 
#> 15 2021-01-15 <NA>

t <- t %>%
  mutate(value_present = cumsum(!is.na(name))) %>%
  group_by(value_present) %>%
  mutate(new_name = case_when(date - min(date) <= 3 ~ head(name, 1)))
t
#> # A tibble: 15 x 4
#> # Groups:   value_present [4]
#>    date       name  value_present new_name
#>    <date>     <chr>         <int> <chr>   
#>  1 2021-01-01 <NA>              0 <NA>    
#>  2 2021-01-02 <NA>              0 <NA>    
#>  3 2021-01-03 a                 1 a       
#>  4 2021-01-04 <NA>              1 a       
#>  5 2021-01-05 <NA>              1 a       
#>  6 2021-01-06 <NA>              1 a       
#>  7 2021-01-07 <NA>              1 <NA>    
#>  8 2021-01-08 b                 2 b       
#>  9 2021-01-09 <NA>              2 b       
#> 10 2021-01-10 <NA>              2 b       
#> 11 2021-01-11 a                 3 a       
#> 12 2021-01-12 <NA>              3 a       
#> 13 2021-01-13 <NA>              3 a       
#> 14 2021-01-14 <NA>              3 a       
#> 15 2021-01-15 <NA>              3 <NA>

str(t)
#> tibble [15 x 4] (S3: grouped_df/tbl_df/tbl/data.frame)
#>  $ date         : Date[1:15], format: "2021-01-01" "2021-01-02" ...
#>  $ name         : chr [1:15] NA NA "a" NA ...
#>  $ value_present: int [1:15] 0 0 1 1 1 1 1 2 2 2 ...
#>  $ new_name     : chr [1:15] NA NA "a" "a" ...
#>  - attr(*, "groups")= tibble [4 x 2] (S3: tbl_df/tbl/data.frame)
#>   ..$ value_present: int [1:4] 0 1 2 3
#>   ..$ .rows        : list<int> [1:4] 
#>   .. ..$ : int [1:2] 1 2
#>   .. ..$ : int [1:5] 3 4 5 6 7
#>   .. ..$ : int [1:3] 8 9 10
#>   .. ..$ : int [1:5] 11 12 13 14 15
#>   .. ..@ ptype: int(0) 
#>   ..- attr(*, ".drop")= logi TRUE

#running the same code for the second time 
t <- t %>%
  mutate(value_present = cumsum(!is.na(name))) %>%
  group_by(value_present) %>%
  mutate(new_name = case_when(date - min(date) <= 3 ~ head(name, 1)))
t
#> # A tibble: 15 x 4
#> # Groups:   value_present [2]
#>    date       name  value_present new_name
#>    <date>     <chr>         <int> <chr>   
#>  1 2021-01-01 <NA>              0 <NA>    
#>  2 2021-01-02 <NA>              0 <NA>    
#>  3 2021-01-03 a                 1 a       
#>  4 2021-01-04 <NA>              1 a       
#>  5 2021-01-05 <NA>              1 a       
#>  6 2021-01-06 <NA>              1 a       
#>  7 2021-01-07 <NA>              1 <NA>    
#>  8 2021-01-08 b                 1 <NA>    
#>  9 2021-01-09 <NA>              1 <NA>    
#> 10 2021-01-10 <NA>              1 <NA>    
#> 11 2021-01-11 a                 1 <NA>    
#> 12 2021-01-12 <NA>              1 <NA>    
#> 13 2021-01-13 <NA>              1 <NA>    
#> 14 2021-01-14 <NA>              1 <NA>    
#> 15 2021-01-15 <NA>              1 <NA>

^{Created on 2021-07-24 by the reprex package (v0.3.0)}

jrmuirhead · July 24, 2021, 11:13am

Hi @ramakant ,

That's a good question, and I don't have a ready answer for this. It might be related to the issue raised at dplyr issue #5598. The only suggestion I can make is to armor the code against this by specifically arranging the data by date and ungrouping at the end.

library("tidyverse")

t <- tibble(date = seq(as.Date("2021-01-01"), as.Date("2021-01-15"),
  by = "1 day"), 
  name = c(NA,NA,"a",NA,NA,NA,NA,"b",NA,NA,"a",NA,NA,NA,NA))

t$expected <- c(NA,NA,"a","a","a","a",NA,"b","b","b","a","a","a","a",NA)

t <- t %>%
  arrange(date) %>%
  mutate(value_present = cumsum(!is.na(name))) %>%
  group_by(value_present) %>%
  mutate(new_name = case_when(date - min(date) <= 3 ~ head(name, 1))) %>%
  ungroup()
t
#> # A tibble: 15 × 5
#>    date       name  expected value_present new_name
#>    <date>     <chr> <chr>            <int> <chr>   
#>  1 2021-01-01 <NA>  <NA>                 0 <NA>    
#>  2 2021-01-02 <NA>  <NA>                 0 <NA>    
#>  3 2021-01-03 a     a                    1 a       
#>  4 2021-01-04 <NA>  a                    1 a       
#>  5 2021-01-05 <NA>  a                    1 a       
#>  6 2021-01-06 <NA>  a                    1 a       
#>  7 2021-01-07 <NA>  <NA>                 1 <NA>    
#>  8 2021-01-08 b     b                    2 b       
#>  9 2021-01-09 <NA>  b                    2 b       
#> 10 2021-01-10 <NA>  b                    2 b       
#> 11 2021-01-11 a     a                    3 a       
#> 12 2021-01-12 <NA>  a                    3 a       
#> 13 2021-01-13 <NA>  a                    3 a       
#> 14 2021-01-14 <NA>  a                    3 a       
#> 15 2021-01-15 <NA>  <NA>                 3 <NA>

t <- t %>%
  arrange(date) %>%
  mutate(value_present = cumsum(!is.na(name))) %>%
  group_by(value_present) %>%
  mutate(new_name = case_when(date - min(date) <= 3 ~ head(name, 1))) %>%
  ungroup()
t
#> # A tibble: 15 × 5
#>    date       name  expected value_present new_name
#>    <date>     <chr> <chr>            <int> <chr>   
#>  1 2021-01-01 <NA>  <NA>                 0 <NA>    
#>  2 2021-01-02 <NA>  <NA>                 0 <NA>    
#>  3 2021-01-03 a     a                    1 a       
#>  4 2021-01-04 <NA>  a                    1 a       
#>  5 2021-01-05 <NA>  a                    1 a       
#>  6 2021-01-06 <NA>  a                    1 a       
#>  7 2021-01-07 <NA>  <NA>                 1 <NA>    
#>  8 2021-01-08 b     b                    2 b       
#>  9 2021-01-09 <NA>  b                    2 b       
#> 10 2021-01-10 <NA>  b                    2 b       
#> 11 2021-01-11 a     a                    3 a       
#> 12 2021-01-12 <NA>  a                    3 a       
#> 13 2021-01-13 <NA>  a                    3 a       
#> 14 2021-01-14 <NA>  a                    3 a       
#> 15 2021-01-15 <NA>  <NA>                 3 <NA>

str(t)
#> tibble [15 × 5] (S3: tbl_df/tbl/data.frame)
#>  $ date         : Date[1:15], format: "2021-01-01" "2021-01-02" ...
#>  $ name         : chr [1:15] NA NA "a" NA ...
#>  $ expected     : chr [1:15] NA NA "a" "a" ...
#>  $ value_present: int [1:15] 0 0 1 1 1 1 1 2 2 2 ...
#>  $ new_name     : chr [1:15] NA NA "a" "a" ...

^{Created on 2021-07-24 by the reprex package (v2.0.0)}

system · July 31, 2021, 11:13am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.