Help with concatenating row values if they have the same identity value.

auhelp · February 11, 2021, 12:41pm

Hi all,

I've been sitting with this problem for a few hours now and can't seem to come up with a good solution - I hope you can help me!

The data are structured like so:

Speaker StartTimeMS EndTimeMS
1 John 0 2009
2 Mother 4203 6316
3 John 6687 7221
4 Mother 6676 8092
5 Mother 9309 16623
6 John 17036 18220
7 Mother 17930 18940
8 John 19399 20026
9 Mother 20020 21192
10 Mother 22808 23458

What I want to do is combine the StartTimeMS from one row and the EndTimeMS from another row, if two of the same Speakers appear after each other; that is, the goal is to combine the time data in the 'Mother' cells appearing immediately after each other. The goal is the following:

Speaker StartTimeMS EndTimeMS
1 John 0 2009
2 Mother 4203 6316
3 John 6687 7221
4 Mother 6676 16623
5 John 17036 18220
7 Mother 17930 18940
8 John 19399 20026
9 Mother 20020 23458

I've been trying with the following code, but I can't seem to find the final step of combining the data from other columns:

data <- data %>%
mutate(identity = lag(Speaker, 1),
decide = if_else(Speaker != identity, 1, 0),
keep = lead(decide, 1, default = 1)) %>%
filter(keep == 1) %>%
select(-identity, -decide, -keep)

I'd really appreciate your help on this!!

Here is some code for a subset of the data:
Speaker <- c("John", "Mother", "John", "Mother", "Mother", "John", "Mother", "John", "Mother", "Mother")
StartTimeMS <- c(0, 4203, 6687, 6676, 9309, 17036, 17930, 19399, 20020, 22808)
EndTimeMS <- c(2009, 6316, 7221, 8092, 16623, 18220, 18940, 20026, 21192, 23458)
data <- data.frame(Speaker, StartTimeMS,EndTimeMS)

lars · February 11, 2021, 5:36pm

Hi,

Is this what you're looking for? Rows 4+5 and 9+10 have been combined.

library(tidyverse)

data <- tibble(speaker = c("John", "Mother", "John", "Mother", "Mother", "John", "Mother", "John", "Mother", "Mother"),
               start_time = c(0, 4203, 6687, 6676, 9309, 17036, 17930, 19399, 20020, 22808),
               end_time = c(2009, 6316, 7221, 8092, 16623, 18220, 18940, 20026, 21192, 23458),
               )

data
#> # A tibble: 10 x 3
#>    speaker start_time end_time
#>    <chr>        <dbl>    <dbl>
#>  1 John             0     2009
#>  2 Mother        4203     6316
#>  3 John          6687     7221
#>  4 Mother        6676     8092
#>  5 Mother        9309    16623
#>  6 John         17036    18220
#>  7 Mother       17930    18940
#>  8 John         19399    20026
#>  9 Mother       20020    21192
#> 10 Mother       22808    23458

data %>%
  ## add temporary variables
  mutate(identity = lag(speaker),
         keep_start = case_when(speaker != identity ~ 1, 
                                is.na(identity) ~ 1,
                                TRUE ~ 0),
         keep_end = lead(keep_start, 1, default = 1)) %>%
  ## remove start or end time when consecutive talks are done by same speaker 
  mutate(start_time = ifelse(keep_start == 1, start_time, NA),
         end_time = ifelse(keep_end == 1, end_time, NA)
         ) %>% 
  ## copy start_time to next talk
  fill(start_time) %>% 
  ## remove previous talk
  filter(!is.na(end_time)) %>%
  ## remove the temporary variables
  select(-(identity:keep_end))
#> # A tibble: 8 x 3
#>   speaker start_time end_time
#>   <chr>        <dbl>    <dbl>
#> 1 John             0     2009
#> 2 Mother        4203     6316
#> 3 John          6687     7221
#> 4 Mother        6676    16623
#> 5 John         17036    18220
#> 6 Mother       17930    18940
#> 7 John         19399    20026
#> 8 Mother       20020    23458

^{Created on 2021-02-11 by the reprex package (v1.0.0)}

I have not tested this for cases when a speaker has 3 consecutive talks, but I gladly leave that up to you

auhelp · February 12, 2021, 9:42am

Hi Lars,

Thanks a million for this - this works on the above subset of data!

In the larger dataset, however, I have up to 60 successive 'Speakers' occurring after each other. I'll try to adapt the code to work for these cases too!

system · March 5, 2021, 9:42am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.