filter with nested condition

I need to combine overlapping years, if there is no more than 1 year gap between periods. If there is two or more years gap I need to keep a separate records for this periods. Example data look like that.

client<- c(1,1,1,1,1,2,2,3,3) 
year_from<-c(1993, 1998, 2000,2001, 2008,2007,2007,2006,2020)
year_to<-c(1995, 2020,2005, 2021, 2010, 2010, 2014,2017, 2020)
data<- tibble(client,year_from, year_to)  

A tibble: 9 x 3
  client year_from year_to
   <dbl>     <dbl>   <dbl>
1      1      1993    1995
2      1      1998    2020
3      1      2000    2005
4      1      2001    2021
5      1      2008    2010
6      2      2007    2010
7      2      2007    2014
8      3      2006    2017
9      3      2020    2020

What I need to get this output:

A tibble: 4 x 3
 client year_from year_to
   <dbl>     <dbl>   <dbl>
1      1      1993    1995
2      1      1998    2021
3      2      2007    2014
4      3      2006    2020

What I tried is to create a "dif "column as below.

data %>% group_by(client) %>%  mutate(dif=(year_from)-lag(year_to))

   A tibble: 11 x 4
   # Groups:   client [3]
   client year_from year_to   dif
    <dbl>     <dbl>   <dbl> <dbl>
 1      1      1993    1995    NA
 2      1      1998    2020     3
 3      1      2000    2005   -20
 4      1      2008    2021     3
 5      1      2008    2010   -13
 6      2      2007    2010    NA
 7      2      2007    2014    -3
 8      2      2007    2020    -7
 9      3      2006    2017    NA
10      3      2020    2020     3
11      3      2023    2025     3

After that I have to write a condition that:

  • if dif>=2 and year_from corresponding to that row is NOT BETWEEN (year_from , year_to) of any other rows for the same client then keep that record separately
  • otherwise collapse rows with min of "year_from" and max of "year_to."

I would appreciate any help. Thanks.

Hi NargizB ,

The following is not completely fulfilling your requirements, but I created the following approach so that you can see step-by-step how the result is achieved. Hopefully this brings you forward by tweaking the code such that it completely satisfies your requirements.

library(tidyverse)

df <- tibble(client = c(1, 1, 1, 1, 1, 2, 2, 3, 3),
             year_from = c(1993, 1998, 2000,2001, 2008, 2007, 2007, 2006, 2020),
             year_to = c(1995, 2020, 2005, 2021, 2010, 2010, 2014, 2017, 2020),
             )
df
#> # A tibble: 9 x 3
#>   client year_from year_to
#>    <dbl>     <dbl>   <dbl>
#> 1      1      1993    1995
#> 2      1      1998    2020
#> 3      1      2000    2005
#> 4      1      2001    2021
#> 5      1      2008    2010
#> 6      2      2007    2010
#> 7      2      2007    2014
#> 8      3      2006    2017
#> 9      3      2020    2020

df %>% 
  ## ensure that data is sorted correctly
  arrange(client, year_from, year_to) %>% 
  group_by(client) %>%
  mutate(## ensure that NA values are replaced with `FALSE`
         y_from_overlapping = if_else(lag(year_from) <= year_from & year_from <= lag(year_to), TRUE, FALSE, missing = FALSE),
         y_to_overlapping   = if_else(lag(year_from) <= year_to   & year_to   <= lag(year_to), TRUE, FALSE, missing = FALSE),
         ## notice also the explicit use of `ifelse()` to bypass the error thrown when using `if_else()`
         period_start = ifelse(y_from_overlapping == TRUE, NA, year_from),
         period_end   = ifelse(y_to_overlapping == TRUE, NA, year_to),
         ) %>% 
  ## replace NA values with values from previous row
  fill(period_start, period_end) %>% 
  ## group per client the different periods
  group_by(client, period_start) %>% 
  ## select the first and last year for a specific period
  mutate(period_min = min(period_start),
         period_max = max(period_end),
         ) %>% 
  ungroup() %>% 
  select(client, year_from = period_min, year_to = period_max) %>% 
  ## remove duplicate rows
  distinct()
#> # A tibble: 5 x 3
#>   client year_from year_to
#>    <dbl>     <dbl>   <dbl>
#> 1      1      1993    1995
#> 2      1      1998    2021
#> 3      2      2007    2014
#> 4      3      2006    2017
#> 5      3      2020    2020

Created on 2021-02-19 by the reprex package (v1.0.0)

HTH

You can change dif to indicate change in period, then use cumsum to code the periods, then group_by and summarise to get the start and end dates:

library(dplyr)

client <- c(1, 1, 1, 1, 1, 2, 2, 3, 3)
year_from <- c(1993, 1998, 2000, 2001, 2008, 2007, 2007, 2006, 2020)
year_to <- c(1995, 2020, 2005, 2021, 2010, 2010, 2014, 2017, 2020)
data <- tibble(client, year_from, year_to)

data %>%
  group_by(client) %>%
  arrange(year_from) %>% 
  mutate(
    dif = year_from - lag(year_to),
    dif = is.na(dif) | dif >= 2, # new period?
    period = cumsum(dif) # period id
  ) %>% 
  group_by(client, period) %>% 
  summarise(
    period_start = min(year_from),
    period_end = max(year_to)
  )
#> # A tibble: 5 x 4
#> # Groups:   client [3]
#>   client period period_start period_end
#>    <dbl>  <int>        <dbl>      <dbl>
#> 1      1      1         1993       1995
#> 2      1      2         1998       2021
#> 3      2      1         2007       2014
#> 4      3      1         2006       2017
#> 5      3      2         2020       2020

Created on 2021-02-19 by the reprex package (v0.3.0)

Thanks for your reply. This seems to be very close to the answer. But issue is that, when the year_from happens to be >=lag(year_to)+2 but period still overlaps with another period, code fails to combine them. I changed year_from in row #4 from 2001 to 2008 to represent that case.

client<- c(1,1,1,1,1, 2,2,2, 3,3,3) 
year_from<-c(1993, 1998, 2000,2008, 2008,2007,2007,2007,2006,2020, 2023)
year_to<-c(1995, 2020,2005, 2021, 2010, 2010, 2014, 2020, 2017, 2020, 2025)
data_3<- tibble(client,year_from, year_to)  

    A tibble: 11 x 3
   client year_from year_to
    <dbl>     <dbl>   <dbl>
 1      1      1993    1995
 2      1      1998    2020
 3      1      2000    2005
 4      1      2008    2021
 5      1      2008    2010
 6      2      2007    2010
 7      2      2007    2014
 8      2      2007    2020
 9      3      2006    2017
10      3      2020    2020
11      3      2023    2025

Please see result below.

A tibble: 7 x 4
# Groups:   client [3]
client period period_start period_end
<dbl>  <int>        <dbl>      <dbl>
1      1      1         1993       1995
2      1      2         1998       2020
3      1      3         2008       2021
4      2      1         2007       2020
5      3      1         2006       2017
6      3      2         2020       2020
7      3      3         2023       2025

Row # 2 and 3 are overlapping years in the output. Thanks.

It probably easiest to use a loop then. Here's a custom function to figure out the periods.

library(dplyr)

client <- c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3)
year_from <- c(1993, 1998, 2000, 2008, 2008, 2007, 2007, 2007, 2006, 2020, 2023)
year_to <- c(1995, 2020, 2005, 2021, 2010, 2010, 2014, 2020, 2017, 2020, 2025)
data <- tibble(client, year_from, year_to)

get_period <- function(starts, ends){
  periods <- rep(NA, length(starts))
  for (i in seq_along(starts)){
    if (i == 1){
      periods[i] <- 1
      period_start <- starts[1]
      period_end <- ends[1]
    } else {
      if (starts[i] >= period_end + 2){
        periods[i] <- periods[i - 1] + 1 # new period
        period_start <- starts[i]
        period_end <- ends[i]
      } else {
        periods[i] <- periods[i - 1] # same period
        period_start <- min(period_start, starts[i])
        period_end <- max(period_end, ends[i])
      }
    }
  }
  return(periods)
}

data %>%
  group_by(client) %>%
  arrange(client, year_from, year_to) %>%
  mutate(
    period = get_period(year_from, year_to) # period id
  ) %>%
  group_by(client, period) %>%
  summarise(
    period_start = min(year_from),
    period_end = max(year_to)
  )
#> # A tibble: 6 x 4
#> # Groups:   client [3]
#>   client period period_start period_end
#>    <dbl>  <dbl>        <dbl>      <dbl>
#> 1      1      1         1993       1995
#> 2      1      2         1998       2021
#> 3      2      1         2007       2020
#> 4      3      1         2006       2017
#> 5      3      2         2020       2020
#> 6      3      3         2023       2025

Created on 2021-02-20 by the reprex package (v0.3.0)

So sorry for delayed response. That solved my problem perfectly. I really appreciate that. I was stuck on this for a long time, that helped me to move forward. Thanks again.