Find pattern in a numeric column

Enuma · January 17, 2020, 1:37am

Hi,

I have problems wrapping my head around this problem without writing an excessive code with a ton of ifs.
Suppose you have a dataset:

library(dplyr)
t = tibble(Time = c(1:100), Value = sample(c(rep(1, 50), rep(0, 50))))

Your goal is to find every sequence of 1s occuring on 4 or more subsequent rows for the Value column. So for example rows 1,1,1,1,1 would be a hit but 1,0,1,1,1 not. For every such sequence, you need to store the value of Time for the first occurring 1 as well as for the last occurring 1 of this sequence. For example in a tibble result = tibble(mintime, maxtime) . Note, there will be multiple such sequences and the number of subsequent 1 can vary widely, so the solution should be generalizable for sequences 4:n.

Thanks!

technocrat · January 17, 2020, 4:36am

Run length encoding. See my post

andresrcs · January 17, 2020, 4:05pm

I could only come up with this embarrassingly complex solution but I have the feeling it must be a way simpler one.

library(tidyverse)
library(RcppRoll)

set.seed(1234)

t <- tibble(Time = c(1:100),
            Value = sample(c(rep(1, 50), rep(0, 50))))
t %>% 
    mutate(roll_sum = roll_sum(Value, 4, align = "right", fill = NA)) %>%
    filter((Value == 1 & roll_sum == 4) | (Value == 1 & lead(roll_sum, 3) == 4)) %>% 
    mutate(id = if_else(roll_sum == 4 & (lead(roll_sum) != 4 | is.na(lead(roll_sum))), row_number(), NA_integer_)) %>% 
    fill(id, .direction = "up") %>%
    group_by(id) %>% 
    summarise(first_time = min(Time),
              last_time = max(Time)) %>% 
    select(-id)
#> # A tibble: 5 x 2
#>   first_time last_time
#>        <int>     <int>
#> 1          3         8
#> 2         27        33
#> 3         56        61
#> 4         82        85
#> 5         88        91

^{Created on 2020-01-17 by the reprex package (v0.3.0.9000)}

Yarnabrina · January 17, 2020, 4:34pm

I think the following implementation using rle (as Richard suggested) is simpler, and it uses no additional package except base R.

set.seed(seed = 1234)

fake_data <- data.frame(Time = 1:100,
                        Value = sample(x = rep(x = c(1, 0),
                                               each = 50)))

with(data = rle(x = fake_data$Value),
     expr =
       {
         valid_length_positions <- which(x = ((values == 1) & (lengths >= 4)))
         cumulative_lengths <- cumsum(x = lengths)
         data.frame(minimum_Time = fake_data$Time[cumulative_lengths[valid_length_positions - 1] + 1],
                    maximum_Time = fake_data$Time[cumulative_lengths[valid_length_positions]])
         
       })
#>   minimum_Time maximum_Time
#> 1            3            8
#> 2           27           33
#> 3           56           61
#> 4           82           85
#> 5           88           91

^{Created on 2020-01-17 by the reprex package (v0.3.0)}

Enuma · January 19, 2020, 1:56pm

Thanks guys & sorry for late reply. Yarnabrina's solution looks cleaner. I was surprised myself that I could not solve it with dplyr without making mess. That being said, I encounter this kind of problems/task quiteoften (for example looking for specific trigger groups when working with experimental data, searching for changes in time-series data etc). Thanks again!

system · January 26, 2020, 1:56pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.