Setting a variable value for the first instance when a condition is met *within a group*

A colleague was working with data describing time series within individuals and was looking for a way to set up a variable that would switch "on" at the first instance of a flag variable value within each individual.

Her initial question was "Is there a way to do loops with the tidyverse?". Well, yes, but... fortunately no loops are required for the solution...

Here's the reprex of the solution:

Creating a reprex dataset
Subject 1 discontinues at TIME=3
Subject 2 discontinues but has no further observations
Subject 3 does not discontinue

library(tidyverse)

myData <- tribble(~ID, ~TIME, ~DV, ~FLAG,
                  1, 1, 0.1, 194,
                  1, 2, 0.2, 194,
                  1, 3, 0.3, 5000,
                  1, 4, 0.1, 194,
                  2, 1, 0.1, 194,
                  2, 2, 0.2, 5000,
                  3, 1, 0.1, 194, 
                  3, 2, 0.2, 194,
                  3, 3, 0.3, 194,
                  3, 4, 0.4, 194)

Want to set the discontinuation (DISC) variable to
have value 0 until we see the first instance of FLAG == 5000
within each individual.

myData %>%
  group_by(ID) %>%
  mutate(DISC = case_when(row_number()==1 ~ 0,
                        FLAG == 5000 ~ 1)) %>%
  fill(DISC)
#> # A tibble: 10 x 5
#> # Groups:   ID [3]
#>       ID  TIME    DV  FLAG  DISC
#>    <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     1     1   0.1   194     0
#>  2     1     2   0.2   194     0
#>  3     1     3   0.3  5000     1
#>  4     1     4   0.1   194     1
#>  5     2     1   0.1   194     0
#>  6     2     2   0.2  5000     1
#>  7     3     1   0.1   194     0
#>  8     3     2   0.2   194     0
#>  9     3     3   0.3   194     0
#> 10     3     4   0.4   194     0

Created on 2019-06-27 by the reprex package (v0.2.1)

mydata <- mydata %>% mutate(DISC = 0) %>% mutate(DISC = ifelse(FLAG == 5000, 1, FLAG))

This assumes, of course that FLAG never exceeds 5,000. Row 4 & 5 in the reprex confuse me: I don't know how to interpret it FLAG == 1 at 5,000 and next record, but goes back to 0 after that?

Nice, a simple modification can make this way more powerful for counting and separating groups. Let's say case 1 has no flags, 2 has one and 3 has 3:

df <- tribble(~ID, ~TIME, ~DV, ~FLAG,
                  1, 1, 0.1, 194,
                  1, 2, 0.2, 194,
                  1, 3, 0.3, 194,
                  2, 1, 0.1, 194,
                  2, 2, 0.2, 5000,
                  2, 2, 0.3, 194,
                  2, 2, 0.3, 194,
                  3, 1, 0.1, 194, 
                  3, 2, 0.2, 5000,
                  3, 1, 0.2, 194, 
                  3, 3, 0.3, 5000,
                  3, 4, 0.4, 194)

Then we can use the Flag as a counter and do a cumsum, by ID group. This also fixes the 0 value issue without a fill:

df %>%
    group_by(ID) %>%
    mutate(GROUP = cumsum(case_when(FLAG == 5000 ~ 1,
                                    TRUE ~ 0))) 

This will allow to separate ID's into different groups, depending on the logic expression. It is super useful in time series to separate windows of events by a "mother" group :wink:

2 Likes

Thanks for the comments. I wonder if a little more description of the data collection would help with context: So imagine that we are making observations on subjects over time and we want to identify which observations are to be included in the final analysis. A subject can discontinue treatment (rather than discontinue from the study) which is given by FLAG == 5000, but they can then still have observations taken in follow-up. Our DISC flag needs to be set so that it switches "on" when the subject discontinues treatment, even if there are subsequent observations. We can then filter on that later to get observations when they were on treatment (where DISC == 0).

That helps (along with spotting my spurious question).

I think that's what my snippet does. You could initialize the whole DISC column to 0 (or better make it LGL and set to FALSE) and whenever treatment discontinues flip it to TRUE. In which case

mydata <- mydata %>%  mutate(DISC = ifelse(FLAG == 5000, TRUE, FLAG))

Your code snippet seems to just recode the FLAG value though. What I'm trying to achieve is that the DISC value is set to 1 on the first observation with FLAG == 5000 but then remains 1 for all remaining observations for that subject. I see now that the catch in my code is that I'm assuming that the first observation will NOT have FLAG == 5000. This is because case_when will not check the second criteria if the first is met.

Fixed that in this code below:

library(tidyverse)

myData <- tribble(~ID, ~TIME, ~DV, ~FLAG,
                  1, 1, 0.1, 194,
                  1, 2, 0.2, 194,
                  1, 3, 0.3, 5000,
                  1, 4, 0.1, 194,
                  2, 1, 0.1, 194,
                  2, 2, 0.2, 5000,
                  3, 1, 0.1, 194, 
                  3, 2, 0.2, 194,
                  3, 3, 0.3, 194,
                  3, 4, 0.4, 194,
                  4, 1, 0.1, 5000,
                  4, 2, 0.2, 194)

myData %>%
  group_by(ID) %>%
  mutate(DISC = ifelse(row_number() == 1, 0, NA),
    DISC = ifelse(FLAG == 5000, 1, DISC)) %>%
  fill(DISC)
#> # A tibble: 12 x 5
#> # Groups:   ID [4]
#>       ID  TIME    DV  FLAG  DISC
#>    <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     1     1   0.1   194     0
#>  2     1     2   0.2   194     0
#>  3     1     3   0.3  5000     1
#>  4     1     4   0.1   194     1
#>  5     2     1   0.1   194     0
#>  6     2     2   0.2  5000     1
#>  7     3     1   0.1   194     0
#>  8     3     2   0.2   194     0
#>  9     3     3   0.3   194     0
#> 10     3     4   0.4   194     0
#> 11     4     1   0.1  5000     1
#> 12     4     2   0.2   194     1

Created on 2019-07-02 by the reprex package (v0.2.1)

1 Like

You can account for discontinuation from the first row by checking FLAG == 5000 first in the case_when().

library(tidyverse)
library(magrittr)

myData %>%
  group_by(ID) %>%
  mutate(DISC = case_when(
    FLAG == 5000 ~ 1,  #<< check if discontinued from row 1
    row_number() == 1 ~ 0
  )) %>%
  fill(DISC)
#> # A tibble: 10 x 5
#> # Groups:   ID [3]
#>       ID  TIME    DV  FLAG  DISC
#>    <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     1     1   0.1   194     0
#>  2     1     2   0.2   194     0
#>  3     1     3   0.3  5000     1
#>  4     1     4   0.1   194     1
#>  5     2     1   0.1   194     0
#>  6     2     2   0.2  5000     1
#>  7     3     1   0.1   194     0
#>  8     3     2   0.2   194     0
#>  9     3     3   0.3   194     0
#> 10     3     4   0.4   194     0

@technocrat's cumsum() answer is generally how I approach this. It's a clever solution; here it is with output to illustrate the process.

myData %>% 
  group_by(ID) %>% 
  mutate(DISC = FLAG >= 5000) %T>% print() %>%
#> # A tibble: 10 x 5
#> # Groups:   ID [3]
#>       ID  TIME    DV  FLAG DISC 
#>    <dbl> <dbl> <dbl> <dbl> <lgl>
#>  1     1     1   0.1   194 FALSE
#>  2     1     2   0.2   194 FALSE
#>  3     1     3   0.3  5000 TRUE 
#>  4     1     4   0.1   194 FALSE
#>  5     2     1   0.1   194 FALSE
#>  6     2     2   0.2  5000 TRUE 
#>  7     3     1   0.1   194 FALSE
#>  8     3     2   0.2   194 FALSE
#>  9     3     3   0.3   194 FALSE
#> 10     3     4   0.4   194 FALSE
  mutate(DISC = cumsum(DISC)) %T>% print() %>%
#> # A tibble: 10 x 5
#> # Groups:   ID [3]
#>       ID  TIME    DV  FLAG  DISC
#>    <dbl> <dbl> <dbl> <dbl> <int>
#>  1     1     1   0.1   194     0
#>  2     1     2   0.2   194     0
#>  3     1     3   0.3  5000     1
#>  4     1     4   0.1   194     1
#>  5     2     1   0.1   194     0
#>  6     2     2   0.2  5000     1
#>  7     3     1   0.1   194     0
#>  8     3     2   0.2   194     0
#>  9     3     3   0.3   194     0
#> 10     3     4   0.4   194     0
  mutate(DISC = DISC > 0)
#> # A tibble: 10 x 5
#> # Groups:   ID [3]
#>       ID  TIME    DV  FLAG DISC 
#>    <dbl> <dbl> <dbl> <dbl> <lgl>
#>  1     1     1   0.1   194 FALSE
#>  2     1     2   0.2   194 FALSE
#>  3     1     3   0.3  5000 TRUE 
#>  4     1     4   0.1   194 TRUE 
#>  5     2     1   0.1   194 FALSE
#>  6     2     2   0.2  5000 TRUE 
#>  7     3     1   0.1   194 FALSE
#>  8     3     2   0.2   194 FALSE
#>  9     3     3   0.3   194 FALSE
#> 10     3     4   0.4   194 FALSE

More concisely...

myData %>% 
  group_by(ID) %>% 
  mutate(DISC = cumsum(FLAG >= 5000) > 0)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.