How to merge (by rows) more than two datasets with different columns

eugenio.alladio · March 6, 2020, 8:05am

I've prepared the following simple Shiny app aiming to import multiple datasets with that contain different columns. I'd like to merge them using a loop since I could upload more than two datasets and I'd like to discard the unmatching columns.
The data I'd like to upload, as an example, are the following:

data_A <- data.frame(
      Sample = as.factor(c("x_1", "x_2", "x_3")),
      Pop = as.factor(c("FF", "FF", "FF")),
         aaa = c(31L, 27L, 17L),
       aaa.1 = c(3L, 7L, 19L),
         bbb = c(10L, 31L, 37L),
       bbb.1 = c(23L, 1L, 13L),
         ccc = c(20L, 18L, 37L),
       ccc.1 = c(9L, 12L, 10L),
         ddd = c(3L, 6L, 28L),
       ddd.1 = c(32L, 27L, 27L),
         eee = c(34L, 33L, 3L),
       eee.1 = c(38L, 13L, 17L),
         fff = c(21L, 26L, 40L),
       fff.1 = c(38L, 4L, 28L)
)

data_B <- data.frame(
      Sample = as.factor(c("x_4", "X_5", "X_6")),
      Pop = as.factor(c("GG", "GG", "GG")),
         aaa = c(21L, 24L, 24L),
       aaa.1 = c(9L, 1L, 36L),
         ccc = c(3L, 7L, 30L),
       ccc.1 = c(37L, 13L, 23L),
         fff = c(6L, 6L, 3L),
       fff.1 = c(11L, 3L, 30L)
)

data_C <- data.frame(
        Sample = as.factor(c("x_7", "x_8", "x_9")),
        Pop = as.factor(c("ZZ", "ZZ", "ZZ")),
         aaa = c(31L, 27L, 17L),
       aaa.1 = c(3L, 7L, 19L),
         bbb = c(10L, 31L, 37L),
       bbb.1 = c(23L, 1L, 13L),
         ccc = c(20L, 18L, 37L),
       ccc.1 = c(9L, 12L, 10L),
         fff = c(21L, 26L, 40L),
       fff.1 = c(38L, 4L, 28L)
)

and I'd like to obtain the following final dataset:

data_def <- data.frame(
        Sample = as.factor(c("x_1", "x_2", "x_3", "x_4", "X_5", "X_6", "x_7","x_8", "x_9")),
      Pop = as.factor(c("FF", "FF", "FF", "GG", "GG", "GG", "ZZ", "ZZ","ZZ")),
         aaa = c(31L, 27L, 17L, 21L, 24L, 24L, 31L, 27L, 17L),
       aaa.1 = c(3L, 7L, 19L, 9L, 1L, 36L, 3L, 7L, 19L),
         ccc = c(20L, 18L, 37L, 3L, 7L, 30L, 20L, 18L, 37L),
       ccc.1 = c(9L, 12L, 10L, 37L, 13L, 23L, 9L, 12L, 10L),
         fff = c(21L, 26L, 40L, 6L, 6L, 3L, 21L, 26L, 40L),
       fff.1 = c(38L, 4L, 28L, 11L, 3L, 30L, 38L, 4L, 28L)
)

So, I prepared the following function and I tried to use intersect function, but I noticed it works only in case of two datasets, so I'd like to use it in a loop since I'd need to merge event more than two datasets. Here I report an example with 3 datasets, but I'd like to extend it to a N number of uploaded datasets:

datasets <- list(data_A,data_B,data_C)
numfiles = length(datasets)
mydata = datasets
variables = list()
for (i in 1:numfiles)
{
  variables[[i]] = c(colnames(mydata[[i]]))
}
InAll = lapply(variables, FUN = intersect)
do.call(rbind, mydata[,inAll])

I tried different solutions but I could not achieve my goal. I'd be very grateful if anybody could help me with this matter!

nirgrahamuk · March 6, 2020, 8:16am

Hello. I understand you ultimately want to add this functionality to a shiny app. But there is a lot of 'overhead' in considering your code example with uneccesary shiny parts that are not a fundamental part of your issue.

I.e. if you can transform the datasets in a script then you can also in a shiny app.

Please consider revising your example so that it's more minimal. You are more likely to get useful help this way

eugenio.alladio · March 6, 2020, 8:31am

Dear @nirgrahamuk, thank you very much for your advice! I edited the post, I hope now it is more minimal and clear!

nirgrahamuk · March 6, 2020, 9:06am

Thanks
Here is a possible solution for you to consider. Hope it helps.
(it assumes your data_* from above)

library(tidyverse)
mydata <- list(data_A,data_B,data_C)
variables <- map(mydata,names)

##collates all the common names
InAll <-Reduce(f = intersect,x = variables)
#reform the data list to dataframes with only the column names above
my_selected_data <- map(mydata,~select(.,all_of(InAll)))
#stack them all vertically
final <-Reduce(f=union_all,x = my_selected_data)
#the target has strings as factors, so remake this
final_with_fact <- mutate_if(final,
                             is.character,as_factor)
#test achieved target
all.equal(data_def,final_with_fact)
# [1] TRUE

note: the above uses base::Reduce, but purrr::reduce is also available in tidyverse. That can use the pipes.


library(tidyverse)
mydata <- list(data_A,data_B,data_C)
variables <- map(mydata,names)

##collates all the common names
InAll <- variables %>% reduce(intersect)
#reform the data list to column with only the column names
my_selected_data <- map(mydata,~select(.,all_of(InAll)))
#stack them all vertically
final <- my_selected_data %>% reduce(union_all)
#the target has strings as factors, so remake this
final_with_fact <- mutate_if(final,
                             is.character,as_factor)
#test achieved target
all.equal(data_def,final_with_fact)
# [1] TRUE

eugenio.alladio · March 6, 2020, 9:17am

Dear @nirgrahamuk, thank you for you help and assistance.
I tested the code you prepared but I obtained the following error:

Error in all_of(InAll) : could not find function "all_of"

I obtained it even writing library(tidyselect), could you fix this issues, too?

nirgrahamuk · March 6, 2020, 9:18am

I have version 1.0.0, which do you have ?

packageVersion("tidyselect")
[1] ‘1.0.0’

p.s. the all_of() function can be removed and the code should work, it just removes a tidyverse warning about ambiguity... but I think if you install the latest version, you will have this function

eugenio.alladio · March 6, 2020, 9:38am

Dear @nirgrahamuk,
now it works, I updated the tidyselect version of the package I had.
Thank you very much for your constant help and sorry for my continue silly questions!

nirgrahamuk · March 6, 2020, 9:39am

I'm here to become a better R programmer, and your question was interesting to solve.
Thanks for asking it

system · March 13, 2020, 9:39am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.