Is it possible to filter n succeeding rows from a condition based on the preceding row?

dplyr

#1

Let's say we're dealing with mtcars, is it possible to filter rows where mpg == 22.8 and the next two rows?

So like mtcars %>% filter(mpg == 22.8 & next_n(2)) (if something like next_n() existed)

So kinda like the picture below. I feel like this is something stupidly simple, but I can't come up with it. Any help is greatly appreciated!


#2

Alright, so I have a reeeeeeeaaaaaally hacky solution here. I'm still trying to find something that's more "automated" than this is.

library(dplyr)

mtcars %>%
  as_tibble() %>%
  filter(between(row_number(), left = which(mtcars$mpg == 22.8)[1], right = (which(mtcars$mpg == 22.8)[1] + 2)) |
           between(row_number(), left = which(mtcars$mpg == 22.8)[2], right = (which(mtcars$mpg == 22.8)[2] + 2)))

#> # A tibble: 6 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  22.8  4.00   108  93.0  3.85  2.32  18.6  1.00  1.00  4.00  1.00
#> 2  21.4  6.00   258 110    3.08  3.22  19.4  1.00  0     3.00  1.00
#> 3  18.7  8.00   360 175    3.15  3.44  17.0  0     0     3.00  2.00
#> 4  22.8  4.00   141  95.0  3.92  3.15  22.9  1.00  0     4.00  2.00
#> 5  19.2  6.00   168 123    3.92  3.44  18.3  1.00  0     4.00  4.00
#> 6  17.8  6.00   168 123    3.92  3.44  18.9  1.00  0     4.00  4.00

Created on 2018-06-30 by the reprex package (v0.2.0).


#3

You can create three explicit filters for the three different lags, although this requires repetition of similar code. Note that the lag function below is the dplyr lag function rather than the base lag function.

library(dplyr)

mtcars %>% 
  filter(near(mpg, 22.8) |
         near(lag(mpg, 1), 22.8) |
         near(lag(mpg, 2), 22.8))

You can iterate over the lags using map, which saves some typing and increases flexibility if you want to create a function that can take arbitrary numbers of lags and lag values:

mtcars %>% 
  filter(map_df(0:2 %>% set_names,  ~ near(lag(mpg, .x), 22.8)) %>% 
           rowSums(., na.rm=TRUE) > 0)
   mpg cyl  disp  hp drat    wt  qsec vs am gear carb row
1 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1   3
2 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   4
3 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   5
4 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2   9
5 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4  10
6 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4  11

#4

Thanks so much for the answers! I really appreciate it. Just curious, what's the point of default = -1? I'm not familiar with lag()


#5

The default argument determines what to do with missing values. The default for default is NA. In my first attempt at answering your question, I needed to set default to a number, but it's not necessary in the code above and I've removed it.


#6

ah okay I see. Awesome, thanks again.


#7

If you're checking row numbers, might as well use slice:

library(magrittr)
mtcars %>% slice(
    which(near(mpg, 22.8)) %>% 
    mapply(FUN = add, list(0:2))
  )

One nice thing about slice is that you don't need to check for intersection with row_number; it will just ignore numbers outside the range. So, with mtcars having 32 rows, slice(mtcars, 30:35) gives rows 30-32.

If you don't want to load magrittr, then FUN = "+" or FUN = `+` should work.


If you have magrittr loaded, you can also use it to inspect the object passed to slice using %T>%, which I find useful for debugging:

> mtcars %>% slice(
+     which(near(mpg, 22.8)) %>% 
+     mapply(FUN = add, list(0:2)) %T>% print
+   )
     [,1] [,2]
[1,]    3    9
[2,]    4   10
[3,]    5   11
# A tibble: 6 x 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
2  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
3  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
4  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
5  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
6  17.8     6 167.6   123  3.92 3.440 18.90     1     0     4     4

#8

Be cautious when relying on implicit ordering of rows (or columns or elements in a vector). Usually, there's another way that clarifies the goal and survives minor changes to input.

What's the problem you're trying to solve?