Find similar rows and combine them by values of certain variables

Hi!

I have a dataset with a lot of similar observations. The similarity comes from the fact that observations are split up by time period.

Here is an example dataframe:

df <- data.frame(Name = c("Peter", "Peter", "Peter", "Peter", "Carol", "Carol"),
                 Group = c("A", "A", "B", "B", "A", "A"),
                 Committee = c("com1", "com1", "com1", "com1", "com1", "com1"),
                 StartDate_committee = c("2010", "2015", "2010", "2015", "2010", "2016"),
                 EndDate_committee = c("2014", "2016", "2014", "2016", "2013", "2019")
                 )
``
   Name Group Committee StartDate_committee EndDate_committee
1 Peter     A      com1                2010              2014
2 Peter     A      com1                2015              2016
3 Peter     B      com1                2010              2014
4 Peter     B      com1                2015              2016
5 Carol     A      com1                2010              2013
6 Carol     A      com1                2016              2019

Peter stays in com1 from 2010 until 2019. Instead of 4 observations, i want only 2, one for each Group Peter is in.

For Carol, the split into 2 observations makes sense. She enters Com1 in 2010, leaves it in 2013 before joining again from 2016-2019. But because she has two time periods in com1, i would like to add two columns and combine the two observations into one as well.

The final dataframe should look like that:

df_new <- data.frame(Name = c("Peter", "Peter", "Carol"),
                     Group = c("A", "B", "A"),
                     Committee = c("com1", "com1", "com1"),
                     StartDate_committee1 = c("2010", "2010", "2010"),
                     EndDate_committee1 = c("2016", "2016", "2013"),
                     StartDate_committee2 = c("NA", "NA", "2016"),
                     EndDate_committee2 = c("NA", "NA", "2019"))

   Name Group Committee StartDate_committee1 EndDate_committee1 StartDate_committee2 EndDate_committee2
1 Peter     A      com1                 2010               2016                   NA                 NA
2 Peter     B      com1                 2010               2016                   NA                 NA
3 Carol     A      com1                 2010               2013                 2016               2019

I am aware that these are basically two separate problems. The main issue for me is Peter's problem, so if anyone could give me a solution for that I would be extremely grateful. Solving Carol's problem would be a bonus.

Thank you.

Note that I converted the example dates into integers that can represent years usefully.

df <- data.frame(
  Name = c("Peter", "Peter", "Peter", "Peter", "Carol", "Carol"),
  Group = c("A", "A", "B", "B", "A", "A"),
  Committee = c("com1", "com1", "com1", "com1", "com1", "com1"),
  StartDate_committee = c(2010L, 2015L, 2010L, 2015L, 2010L, 2016L),
  EndDate_committee = c(2014L, 2016L, 2014L, 2016L, 2013L, 2019L)
)

library(tidyverse)

df2 <- df %>%
  rowwise() %>%
  mutate(betweendates = list(seq.int(
    from =StartDate_committee),
    to =EndDate_committee),
    by = 1
  )))
(df3 <- unnest(df2,
  cols = betweendates
) %>% arrange(Name, Group, Committee,StartDate_committee) %>% group_by(Name, Group, Committee) %>% 
    mutate(continued = case_when(
  row_number()==1 ~ 0,
  lag(betweendates) == betweendates - 1 ~ 0,
  TRUE ~ 1
),
counter = 1+cumsum(continued)))

(df4 <- group_by(df3,
                 Name, Group, Committee,counter) %>% summarise(st_dt = min(betweendates),
                                                                 e_dt = max(betweendates),
                                                               .groups="drop"))

(df5 <- pivot_wider(df4,
                    id_cols=c("Name","Group","Committee"),
                    names_from = "counter",
                    values_from=c("st_dt","e_dt")))
1 Like

Thank you a million times, it worked like a charm.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.