# Find pattern in a numeric column

Hi,

I have problems wrapping my head around this problem without writing an excessive code with a ton of ifs.
Suppose you have a dataset:

``````library(dplyr)
t = tibble(Time = c(1:100), Value = sample(c(rep(1, 50), rep(0, 50))))
``````

Your goal is to find every sequence of 1s occuring on 4 or more subsequent rows for the Value column. So for example rows 1,1,1,1,1 would be a hit but 1,0,1,1,1 not. For every such sequence, you need to store the value of Time for the first occurring 1 as well as for the last occurring 1 of this sequence. For example in a tibble `result = tibble(mintime, maxtime)` . Note, there will be multiple such sequences and the number of subsequent 1 can vary widely, so the solution should be generalizable for sequences 4:n.

Thanks!

Run length encoding. See my post

1 Like

I could only come up with this embarrassingly complex solution but I have the feeling it must be a way simpler one.

``````library(tidyverse)
library(RcppRoll)

set.seed(1234)

t <- tibble(Time = c(1:100),
Value = sample(c(rep(1, 50), rep(0, 50))))
t %>%
mutate(roll_sum = roll_sum(Value, 4, align = "right", fill = NA)) %>%
filter((Value == 1 & roll_sum == 4) | (Value == 1 & lead(roll_sum, 3) == 4)) %>%
mutate(id = if_else(roll_sum == 4 & (lead(roll_sum) != 4 | is.na(lead(roll_sum))), row_number(), NA_integer_)) %>%
fill(id, .direction = "up") %>%
group_by(id) %>%
summarise(first_time = min(Time),
last_time = max(Time)) %>%
select(-id)
#> # A tibble: 5 x 2
#>   first_time last_time
#>        <int>     <int>
#> 1          3         8
#> 2         27        33
#> 3         56        61
#> 4         82        85
#> 5         88        91
``````

Created on 2020-01-17 by the reprex package (v0.3.0.9000)

1 Like

I think the following implementation using `rle` (as Richard suggested) is simpler, and it uses no additional package except base R.

``````set.seed(seed = 1234)

fake_data <- data.frame(Time = 1:100,
Value = sample(x = rep(x = c(1, 0),
each = 50)))

with(data = rle(x = fake_data\$Value),
expr =
{
valid_length_positions <- which(x = ((values == 1) & (lengths >= 4)))
cumulative_lengths <- cumsum(x = lengths)
data.frame(minimum_Time = fake_data\$Time[cumulative_lengths[valid_length_positions - 1] + 1],
maximum_Time = fake_data\$Time[cumulative_lengths[valid_length_positions]])

})
#>   minimum_Time maximum_Time
#> 1            3            8
#> 2           27           33
#> 3           56           61
#> 4           82           85
#> 5           88           91
``````

Created on 2020-01-17 by the reprex package (v0.3.0)

2 Likes

Thanks guys & sorry for late reply. Yarnabrina's solution looks cleaner. I was surprised myself that I could not solve it with dplyr without making mess. That being said, I encounter this kind of problems/task quiteoften (for example looking for specific trigger groups when working with experimental data, searching for changes in time-series data etc). Thanks again!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.