How to delete rows prior to a certain condition?

Maike · July 26, 2019, 2:53pm

I aim to use dplyr to, first, group certain events per ID. And, then, I would like to be able to choose two possible startevents and discard all the events prior to these startevents in every ID group. I have tried to make an example:

Data:

id <- c(1, 1, 1, 1, 2, 2, 2, 2)
timeorder <- c(1, 2, 3, 4, 1, 2, 3, 4)
events1 <- c("a", "b", "a", "b", "a", "a", "a", "b")
events2 <- c("x", "x", "x", "x", "x", "y", "x", "y")
testdata <- data.frame(id, timeorder, events1, events2)

What I am aiming for:

Let's decide on rule: startevent b or startevent y

Then, for ID 1 the results should be:
events1: b, a, b
events2: x, x, x
In other words: disregard all events prior to the first b.

And, for ID 2 the results should be:
events1: a, a, b
events2: y, x, y
In other words: since event y became prior to event b, the filtering was done on event y. All events prior to the first event y were deleted.

I have tried several filter options, among which:

library(dplyr)
filtertest <- testdata %>%
group_by(id, timeorder) %>%
filter(events1 != max("b") | events2 != max("y"))

Which, unfortunately, does not give me the result I am aiming for. Maybe I need a while statement somewhere? I cannot figure this out.

I hope I have made my question clear and would appreciate your help a lot! Thank you!

joels · July 26, 2019, 5:02pm

How about this:

testdata %>% 
  arrange(id, timeorder) %>%
  group_by(id) %>% 
  filter(cumsum(events1=="b" | events2=="y") > 0)

     id timeorder events1 events2
  <dbl>     <dbl> <fct>   <fct>  
1     1         2 b       x      
2     1         3 a       x      
3     1         4 b       x      
4     2         2 a       y      
5     2         3 a       x      
6     2         4 b       y

The code below may make it easier to see what's going on. events1=="b" | events2=="y" returns TRUE for any row that meets one or both conditions. cumsum then keeps a running total by row by adding 1 for each TRUE and 0 for each FALSE. To filter, we keep only the rows where sum_test is greater than zero.

testdata %>% 
  arrange(id, timeorder) %>%
  group_by(id) %>% 
  mutate(test = events1=="b" | events2=="y",
         sum_test = cumsum(test))

     id timeorder events1 events2 test  sum_test
  <dbl>     <dbl> <fct>   <fct>   <lgl>    <int>
1     1         1 a       x       FALSE        0
2     1         2 b       x       TRUE         1
3     1         3 a       x       FALSE        1
4     1         4 b       x       TRUE         2
5     2         1 a       x       FALSE        0
6     2         2 a       y       TRUE         1
7     2         3 a       x       FALSE        1
8     2         4 b       y       TRUE         2

Maike · July 27, 2019, 10:02am

Absolutely perfect! Thank you so much!

system · August 3, 2019, 10:03am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.