Combine rows based on certain columns while concatenating cells from other columns

Hello all,

I am trying to combine rows based on certain columns while concatenating cells from other columns

My original data frame looks like this:

ID <- c(rep('1',2),rep('2',2))
Time <- c(rep('11/17/19',2), rep('11/18/19',2))
User <- c(rep('sys',2), rep('man',2))
Name <- c('start','finish','start','finish')
oldval <- c('11/18/19','11/19/19','11/20/19','11/21/19')
newval <- c('11/22/19','11/23/19','11/24/19','11/25/19')

d <- data.frame(ID,Time,User,Name, oldval, newval)
d

I want to combine the values in oldval and new val in one cell based on ID,Time, User, Name

So my end results should look like this:

ID_1 <- c(1,2)
Time_1 <- c('11/17/19','11/18/19')
User_1 <- c('sys','man')
Name_1 <- c('start, finish','start, finish')
oldval_1 <- c('11-18-19, 11/19/19','11/20/19, 11/21/19')
newval_1 <- c('11-22-19, 11/23/19','11/24/19, 11/25/19')

d_1 <- data.frame(ID_1, Time_1, User_1, Name_1, oldval_1, newval_1)

d
d_1

You can do it like this

ID <- c(rep('1',2),rep('2',2))
Time <- c(rep('11/17/19',2), rep('11/18/19',2))
User <- c(rep('sys',2), rep('man',2))
Name <- c('start','finish','start','finish')
oldval <- c('11/18/19','11/19/19','11/20/19','11/21/19')
newval <- c('11/22/19','11/23/19','11/24/19','11/25/19')

d <- data.frame(ID,Time,User,Name, oldval, newval)

ID_1 <- c(1,2)
Time_1 <- c('11/17/19','11/18/19')
User_1 <- c('sys','man')
Name_1 <- c('start, finish','start, finish')
oldval_1 <- c('11-18-19, 11/19/19','11/20/19, 11/21/19')
newval_1 <- c('11-22-19, 11/23/19','11/24/19, 11/25/19')

d_1 <- data.frame(ID_1, Time_1, User_1, Name_1, oldval_1, newval_1)

library(dplyr)

d %>% 
    group_by(ID, Time, User) %>% 
    summarise_all(~paste(first(.), last(.), sep = ", "))
#> # A tibble: 2 x 6
#> # Groups:   ID, Time [2]
#>   ID    Time     User  Name          oldval             newval            
#>   <fct> <fct>    <fct> <chr>         <chr>              <chr>             
#> 1 1     11/17/19 sys   start, finish 11/18/19, 11/19/19 11/22/19, 11/23/19
#> 2 2     11/18/19 man   start, finish 11/20/19, 11/21/19 11/24/19, 11/25/19

Created on 2020-01-08 by the reprex package (v0.3.0.9000)

Thanks @andresrcs This works great. When I try to run this on my original 1.2 million records data set, it's taking forever. More than half an hour now. I am wondering if you have any suggestions for that. Otherwise, thank you very much!

You could use dtplyr or data.table directly, or if your data source is a database, you could perform calculations on the database with dbplyr

I've no idea whether this will be quicker or not (I suspect not), but tweaking @andresrcs answer might help:

d %>% 
  group_by(ID, Time, User) %>% 
  summarise_all(~paste(., collapse = ", ")) %>% 
  ungroup()

I don't actually know why sep worked, but for such cases I use collapse. Also, I've allowed all values to collapse, not just the first and last. Finally, the ungroup() tends to be a good idea because if you extend the analysis it will be otherwise still grouped by ID and Time, which would probably not be desired and may take a long time.

Alternatively, try this version:

library(data.table)

dt <- data.table(d)
dt[, lapply(.SD, function(x) {paste(x, collapse = ", ")}), keyby = .(ID, Time, User), .SDcols = c("Name", "oldval", "newval")]
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.