Data manipulation, creating a variable that captures the number 1s that are consecutive in a vector of 1s and 0s

garridoalberto · August 28, 2018, 2:43pm

How can I create a variable that counts the number of consecutive observations with the same value?
Let's say my variable is [0,1,0,0,0, 1,1, 1,0, 0,1,1,0,0,1,1,1,1,1,0, 0, 1, 0, 1,1]

I want to know, with respect to that vector, that
the number of 1 consecutive 1s, is 2
the number of 2 consecutive 1s is 2
the number of 3 consecutive 1s is 1
the number of 4 consecutive 1s is 0
the number of 5 consecutive 1s us 1

Of course, my database is large

Thank you

mishabalyasin · August 28, 2018, 3:12pm

Take a look at rle function: https://stat.ethz.ch/R-manual/R-devel/library/base/html/rle.html.
I'm not sure if this will work on DB though, but you can at least search for something similar.

garridoalberto · August 28, 2018, 3:19pm

Looks promising
Thank you!

cderv · August 29, 2018, 6:56am

You can do something using dplyr workflow

var <- c(0,1,0,0,0, 1,1, 1,0, 0,1,1,0,0,1,1,1,1,1,0, 0, 1, 0, 1, 1)

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
res <- tibble(var = var) %>%
  # use lag to see if next value is equal to previous
  mutate(diff = lag(var, 1, default = 9)) %>%
  mutate(equal = var != diff) %>%
  # get an id for each group
  mutate(seq_id = cumsum(equal)) %>%
  #  we are interesting only by the 1s
  filter(var == 1) %>%
  # count how many 1s in each group
  count(var, seq_id) %>%
  # count how many times there is n 1s
  count(n)
res
#> # A tibble: 4 x 2
#>       n    nn
#>   <int> <int>
#> 1     1     2
#> 2     2     2
#> 3     3     1
#> 4     5     1

You can then see if there is a SQL translation that works for your database. Here is a generic translation.

library(dbplyr)
#> 
#> Attachement du package : 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
lazy_frame(var = var, src = simulate_dbi()) %>%
  # use lag to see if next value is equal to previous
  mutate(diff = lag(var, 1, default = 9)) %>%
  mutate(equal = var != diff) %>%
  # get an id for each group
  window_order(equal) %>% # needed. see warning without
  mutate(seq_id = cumsum(equal)) %>%
  #  we are interesting only by the 1s
  filter(var == 1) %>%
  # count how many 1s in each group
  count(var, seq_id) %>%
  # count how many times there is n 1s
  count(n) %>% show_query()
#> <SQL> SELECT "var", "n", COUNT() AS "nn"
#> FROM (SELECT "var", "seq_id", COUNT() AS "n"
#> FROM (SELECT *
#> FROM (SELECT "var", "diff", "equal", sum("equal") OVER (ORDER BY "equal" ROWS UNBOUNDED PRECEDING) AS "seq_id"
#> FROM (SELECT "var", "diff", "var" != "diff" AS "equal"
#> FROM (SELECT "var", LAG("var", 1, 9.0) OVER () AS "diff"
#> FROM "df") "ykyweekteo") "axbeluvfyo") "nhimatobvm"
#> WHERE ("var" = 1.0)) "qxojiberyj"
#> GROUP BY "var", "seq_id") "rrkfkuqlhj"
#> GROUP BY "var", "n"

otherwise, in memory, you have also the data.table framework that can help you to get the same results. It has included an efficient rleid function.

var <- c(0,1,0,0,0, 1,1, 1,0, 0,1,1,0,0,1,1,1,1,1,0, 0, 1, 0, 1, 1)

library(data.table)
tab <- data.table(var = var)
tab[, id := rleid(var)]
res <- tab[var == 1, .(nb1 = .N), keyby = .(id, var)][, .(nb2 = .N), keyby = nb1]
res
#>    nb1 nb2
#> 1:   1   2
#> 2:   2   2
#> 3:   3   1
#> 4:   5   1

the number of 1 consecutive 1s, is 2
the number of 2 consecutive 1s is 2
the number of 3 consecutive 1s is 1
the number of 4 consecutive 1s is 0
the number of 5 consecutive 1s us 1

garridoalberto · August 29, 2018, 1:17pm

Thank you. This seems promising too