conditional manipulating of reoccuring variables

I have the following, simplified data frame.

obj <- data.frame (id = c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2),
Date = c("1990-01", "1990-02", "1990-03", "1990-04", "1990-01", "1990-02", "1990-03", "1990-04", "1991-01", "1991-02", "1991-03", "1991-04", "1991-01", "1991-02", "1991-03", "1991-04"),
                   degree = c(1, 1, 1, 2, 1, 1, 0, 0, 1, 2, 2, 2, 1, 1, 2, 1)
)

Note that my real ID's are named something like 2.01811e+13

What I would like to do is the following:

  • I would like to create a new column which should be 1 for a date where degree turns to 2 and the previous observation is 1 for the same ID.
  • For every degree = 1 the new column should be zero.
  • For any degree = 0 the new column should take NA.
  • For any other degree = 2 (so if the previous degree of the same id observation was not 1) the new column should be NA as well.

Note that consecutive id observations might overlap across years in the real sample, e.g. an ID ranging from 1990-11, 1990-12, 1991-01, 1991-02.

it should look like this,

id date    new_col 
1   1990-01   0
1   1990-02   0 
1   1990-03   0
1   1990-04   1
2   1990-01   0
2   1990-02   0
2   1990-03   NA
2   1990-04   NA
1   1991-01   0
1   1991-02   1
1   1991-03   NA
1   1991-04   NA
2   1991-01   0
2   1991-02   0
2   1991-03   1
2   1991-04   0

Many thanks in advance!

suppressPackageStartupMessages({
  library(dplyr)
})

obj <- data.frame(
  id = c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2),
  Date = c("1990-01", "1990-02", "1990-03", "1990-04", "1990-01", "1990-02", "1990-03", "1990-04", "1991-01", "1991-02", "1991-03", "1991-04", "1991-01", "1991-02", "1991-03", "1991-04"),
  degree = c(1, 1, 1, 2, 1, 1, 0, 0, 1, 2, 2, 2, 1, 1, 2, 1)
)


obj %>% mutate(new_col = ifelse(degree == 1,0,NA),
               new_col = ifelse(degree == 2 & 
                                lag(degree,1) == 1,1,new_col))
#>    id    Date degree new_col
#> 1   1 1990-01      1       0
#> 2   1 1990-02      1       0
#> 3   1 1990-03      1       0
#> 4   1 1990-04      2       1
#> 5   2 1990-01      1       0
#> 6   2 1990-02      1       0
#> 7   2 1990-03      0      NA
#> 8   2 1990-04      0      NA
#> 9   1 1991-01      1       0
#> 10  1 1991-02      2       1
#> 11  1 1991-03      2      NA
#> 12  1 1991-04      2      NA
#> 13  2 1991-01      1       0
#> 14  2 1991-02      1       0
#> 15  2 1991-03      2       1
#> 16  2 1991-04      1       0
1 Like

Many thanks for your answer! I had already a similar code. Unfortunately this code assigns 1 to new_col in any case when degree =2 and previous degree = 1 and assigns these values, therefore, independent of the id variable. As my real sample is large and the probability that for the last observation of a given id the degree = 1 and the following observation, e.g. a first observation of another id, the variable degree is = 2, new_col would turn to 1. This should be not the case, as new_col should be just 1 if degree is 2 and the previous degree is 0 FOR THE SAME ID.
Many thanks in advance!

See the FAQ: How to do a minimal reproducible example reprex for beginners. Without truly representative data, an answer should not be expected to cover any cases other those given.

 obj <- data.frame (id = c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2),
    Date = c("1990-01", "1990-02", "1990-03", "1990-04", "1990-01", "1990-02", "1990-03", "1990-04", "1991-01", "1991-02", "1991-03", "1991-04", "1991-01", "1991-02", "1991-03", "1991-04"),
                       degree = c(1, 1, 1, 2, 1, 1, 0, 1, 2, 2, 2, 2, 1, 1, 2, 2)
    )

with the corresponding desired output.

   id date    degree new_col 
    1   1990-01   1     0
    1   1990-02   1     0 
    1   1990-03   1     0
    1   1990-04   2     1
    2   1990-01   1     0
    2   1990-02   1     0
    2   1990-03   0     NA
    2   1990-04   1     0
    1   1991-01   2     NA
    1   1991-02   2     NA
    1   1991-03   2     NA
    1   1991-04   2     NA
    2   1991-01   1     0
    2   1991-02   1     0
    2   1991-03   2     1
    2   1991-04   2     0

this reprex covers now the conditions dependent on the id

and

are inconsistent. I get

suppressPackageStartupMessages({
  library(dplyr)
})
obj <- data.frame (id = c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2),
  Date = c("1990-01", "1990-02", "1990-03", "1990-04", "1990-01", "1990-02", "1990-03", "1990-04", "1991-01", "1991-02", "1991-03", "1991-04", "1991-01", "1991-02", "1991-03", "1991-04"),
  degree = c(1, 1, 1, 2, 1, 1, 0, 1, 2, 2, 2, 2, 1, 1, 2, 2)
)

obj %>% mutate(new_col = ifelse(degree == 1,0,NA),
               new_col = ifelse(degree == 2 & 
                                  lag(degree,1) == 1 &
                                  lag(id,1) == id,1,new_col))
#>    id    Date degree new_col
#> 1   1 1990-01      1       0
#> 2   1 1990-02      1       0
#> 3   1 1990-03      1       0
#> 4   1 1990-04      2       1
#> 5   2 1990-01      1       0
#> 6   2 1990-02      1       0
#> 7   2 1990-03      0      NA
#> 8   2 1990-04      1       0
#> 9   1 1991-01      2      NA
#> 10  1 1991-02      2      NA
#> 11  1 1991-03      2      NA
#> 12  1 1991-04      2      NA
#> 13  2 1991-01      1       0
#> 14  2 1991-02      1       0
#> 15  2 1991-03      2       1
#> 16  2 1991-04      2      NA

The last new_col entry is NA, rather than zero because the id and degree values are identical for the last two rows.

1 Like