conditional manipulating of reoccuring variables

freddywit · May 3, 2021, 8:11pm

I have the following, simplified data frame.

obj <- data.frame (id = c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2),
Date = c("1990-01", "1990-02", "1990-03", "1990-04", "1990-01", "1990-02", "1990-03", "1990-04", "1991-01", "1991-02", "1991-03", "1991-04", "1991-01", "1991-02", "1991-03", "1991-04"),
                   degree = c(1, 1, 1, 2, 1, 1, 0, 0, 1, 2, 2, 2, 1, 1, 2, 1)
)

Note that my real ID's are named something like 2.01811e+13

What I would like to do is the following:

I would like to create a new column which should be 1 for a date where degree turns to 2 and the previous observation is 1 for the same ID.
For every degree = 1 the new column should be zero.
For any degree = 0 the new column should take NA.
For any other degree = 2 (so if the previous degree of the same id observation was not 1) the new column should be NA as well.

Note that consecutive id observations might overlap across years in the real sample, e.g. an ID ranging from 1990-11, 1990-12, 1991-01, 1991-02.

it should look like this,

id date    new_col 
1   1990-01   0
1   1990-02   0 
1   1990-03   0
1   1990-04   1
2   1990-01   0
2   1990-02   0
2   1990-03   NA
2   1990-04   NA
1   1991-01   0
1   1991-02   1
1   1991-03   NA
1   1991-04   NA
2   1991-01   0
2   1991-02   0
2   1991-03   1
2   1991-04   0

Many thanks in advance!

technocrat · May 3, 2021, 10:25pm

suppressPackageStartupMessages({
  library(dplyr)
})

obj <- data.frame(
  id = c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2),
  Date = c("1990-01", "1990-02", "1990-03", "1990-04", "1990-01", "1990-02", "1990-03", "1990-04", "1991-01", "1991-02", "1991-03", "1991-04", "1991-01", "1991-02", "1991-03", "1991-04"),
  degree = c(1, 1, 1, 2, 1, 1, 0, 0, 1, 2, 2, 2, 1, 1, 2, 1)
)


obj %>% mutate(new_col = ifelse(degree == 1,0,NA),
               new_col = ifelse(degree == 2 & 
                                lag(degree,1) == 1,1,new_col))
#>    id    Date degree new_col
#> 1   1 1990-01      1       0
#> 2   1 1990-02      1       0
#> 3   1 1990-03      1       0
#> 4   1 1990-04      2       1
#> 5   2 1990-01      1       0
#> 6   2 1990-02      1       0
#> 7   2 1990-03      0      NA
#> 8   2 1990-04      0      NA
#> 9   1 1991-01      1       0
#> 10  1 1991-02      2       1
#> 11  1 1991-03      2      NA
#> 12  1 1991-04      2      NA
#> 13  2 1991-01      1       0
#> 14  2 1991-02      1       0
#> 15  2 1991-03      2       1
#> 16  2 1991-04      1       0

freddywit · May 3, 2021, 10:43pm

Many thanks for your answer! I had already a similar code. Unfortunately this code assigns 1 to new_col in any case when degree =2 and previous degree = 1 and assigns these values, therefore, independent of the id variable. As my real sample is large and the probability that for the last observation of a given id the degree = 1 and the following observation, e.g. a first observation of another id, the variable degree is = 2, new_col would turn to 1. This should be not the case, as new_col should be just 1 if degree is 2 and the previous degree is 0 FOR THE SAME ID.
Many thanks in advance!

technocrat · May 3, 2021, 10:56pm

See the FAQ: How to do a minimal reproducible example reprex for beginners. Without truly representative data, an answer should not be expected to cover any cases other those given.

freddywit · May 4, 2021, 12:47pm

 obj <- data.frame (id = c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2),
    Date = c("1990-01", "1990-02", "1990-03", "1990-04", "1990-01", "1990-02", "1990-03", "1990-04", "1991-01", "1991-02", "1991-03", "1991-04", "1991-01", "1991-02", "1991-03", "1991-04"),
                       degree = c(1, 1, 1, 2, 1, 1, 0, 1, 2, 2, 2, 2, 1, 1, 2, 2)
    )

with the corresponding desired output.

   id date    degree new_col 
    1   1990-01   1     0
    1   1990-02   1     0 
    1   1990-03   1     0
    1   1990-04   2     1
    2   1990-01   1     0
    2   1990-02   1     0
    2   1990-03   0     NA
    2   1990-04   1     0
    1   1991-01   2     NA
    1   1991-02   2     NA
    1   1991-03   2     NA
    1   1991-04   2     NA
    2   1991-01   1     0
    2   1991-02   1     0
    2   1991-03   2     1
    2   1991-04   2     0

this reprex covers now the conditions dependent on the id

technocrat · May 4, 2021, 11:37pm

and

are inconsistent. I get

suppressPackageStartupMessages({
  library(dplyr)
})
obj <- data.frame (id = c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2),
  Date = c("1990-01", "1990-02", "1990-03", "1990-04", "1990-01", "1990-02", "1990-03", "1990-04", "1991-01", "1991-02", "1991-03", "1991-04", "1991-01", "1991-02", "1991-03", "1991-04"),
  degree = c(1, 1, 1, 2, 1, 1, 0, 1, 2, 2, 2, 2, 1, 1, 2, 2)
)

obj %>% mutate(new_col = ifelse(degree == 1,0,NA),
               new_col = ifelse(degree == 2 & 
                                  lag(degree,1) == 1 &
                                  lag(id,1) == id,1,new_col))
#>    id    Date degree new_col
#> 1   1 1990-01      1       0
#> 2   1 1990-02      1       0
#> 3   1 1990-03      1       0
#> 4   1 1990-04      2       1
#> 5   2 1990-01      1       0
#> 6   2 1990-02      1       0
#> 7   2 1990-03      0      NA
#> 8   2 1990-04      1       0
#> 9   1 1991-01      2      NA
#> 10  1 1991-02      2      NA
#> 11  1 1991-03      2      NA
#> 12  1 1991-04      2      NA
#> 13  2 1991-01      1       0
#> 14  2 1991-02      1       0
#> 15  2 1991-03      2       1
#> 16  2 1991-04      2      NA

The last new_col entry is NA, rather than zero because the id and degree values are identical for the last two rows.

system · May 11, 2021, 11:38pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.