Using preceding/following rows to code a new column

joewranglesdata · October 21, 2022, 2:22pm

I have some data that looks a bit like this:

df_start <- data.frame(time = c(1:10),
                  var1 = c(0, 0, 1, 1, 1, 1, 0, 0, 1, 0))

For my purposes, var1 contains 5 consecutive stretches of the same value: two 0s, four 1s, two 0s, one 1, 1 zero. I want to know the value of 'time' where each of these stretches starts and ends.

I want to create a new column, seq, which will tell me if a value in var1 is at the start or end of a sequence of 0s or 1s, i.e. if it is the first 0 or 1 of a stretch of 0s of 1s, it should have the value "start", if it's in the middle of a stretch, it should have the value "mid", and if it's at the end of a stretch, it should be "end". If it's the only 0 or only 1 in its stretch, it should have the value "solo".

Essentially, I'm trying to end up with a dataframe that looks like this:

df_aim <- data.frame(time = c(1:10),
                  var1 = c(0, 0, 1, 1, 1, 1, 0, 0, 1, 0),
                  seq = c("start", "end", "start", "mid", "mid", "end", "start", "end", "solo", "solo"))

I know I can use case_when() within mutate to create seq, and give it a value based on what's happening in var1. But is there a way to get it to look at what's happening in the row above or below to code the new column?

I'm imaging something like this:

#Not real code!!
df_start %>% mutate(seq = case_when(
      row_above != current_row & row_below == current_row ~ "start",
      row_above == current_row & row_below == current_row ~ "mid",
      row_above == current_row & row_below != current_row ~ "end",
      row_above != current_row & row_below != current_row ~ "solo"
    )
  )

But am not sure how to actually implement this in tidyverse as I'm not sure how I would get it to look at the row above or below each row. Any thoughts/suggestions?

Thanks in advance!

nirgrahamuk · October 21, 2022, 2:37pm

library(tidyverse)
df_aim <- data.frame(time = c(1:10),
                     var1 = c(0, 0, 1, 1, 1, 1, 0, 0, 1, 0),
                     seq = c("start", "end", "start", "mid", "mid", "end", "start", "end", "solo", "solo"))

df_start <- select(df_aim,
                   time,
                   var1)

(df_fin <- df_start  %>% mutate(seq = case_when(
  (is.na(lag(var1)) | lag(var1) != var1) & lead(var1) == var1 ~ "start",
  lag(var1) == var1 & lead(var1) == var1 ~ "mid",
  lag(var1) == var1 & (is.na(lead(var1)) | lead(var1) != var1)~ "end",
  lag(var1) != var1 & (is.na(lead(var1)) | lead(var1) != var1) ~ "solo"
)
))

identical(df_aim,df_fin)

FJCC · October 21, 2022, 2:43pm

You can use the lag() and lead() functions to make the value of the previous and succeeding rows available in each row. You can then build the logic of your case_when() working on each row.

library(dplyr)
df_start <- data.frame(time = c(1:10),
                       var1 = c(0, 0, 1, 1, 1, 1, 0, 0, 1, 0))
df_start <- df_start |> mutate(LagVal = lag(var1),
                               LeadVal = lead(var1))
df_start
#>    time var1 LagVal LeadVal
#> 1     1    0     NA       0
#> 2     2    0      0       1
#> 3     3    1      0       1
#> 4     4    1      1       1
#> 5     5    1      1       1
#> 6     6    1      1       0
#> 7     7    0      1       0
#> 8     8    0      0       1
#> 9     9    1      0       0
#> 10   10    0      1      NA

^{Created on 2022-10-21 with reprex v2.0.2}

system · October 28, 2022, 2:44pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.