How to combine two separate dataframes of different lengths

Pioneer82 · February 21, 2020, 7:56pm

I have three data-frames, each of a single column, but of different lengths, each with a distinct header. How do I join them into a single data-frame? Example of input.

df1> Case 1 df2> Case 2 df3> Case 3
ABCCD TCGGS ZCRTTS
PCCGT GCQZZ
XGGTS

I'd like to combine them into a single table or data-frame, like:

Case 1 Case 2 Case 3
ABCCD TCGGS ZCRTTS
PCCGT GCQZZ
XGGTS

I've tried using merge and join, but I don't get the result above.

technocrat · February 21, 2020, 8:06pm

I think the path of lease resistance is to

Remove the headers with

colnames(df.x) <- NULL

Row bind

new_df <- rbind(df1,df2,df3)

Assign colnames

colnames(new_df) <- "Case 1 Case 2 Case 3"

BTW: length() of data frames refers to number of columns; to get the number of rows use nrow(). Tripped me up more than once.

Pioneer82 · February 21, 2020, 8:21pm

Interesting. So the first step is to remove the headers, then combine the df and finally add the column names. I'll give it a shot and let you know. Thanks.

Pioneer82 · February 21, 2020, 8:38pm

Hmm, I'm getting the following error:
Error in if (facCol[jj]) { : missing value where TRUE/FALSE needed
It seems I'm missing a parameter in rbind?

dag.tanneberg · February 21, 2020, 8:50pm

That is hard to say without seeing your code.

Anyhow, have you considered merge's all.x/all.y arguments? dplyr's left_join/right_join offer the same functionality and are explained in detail here: https://r4ds.had.co.nz/relational-data.html.

HTH

Dag

technocrat · February 21, 2020, 9:38pm

Serves me right for not doing a reprex; I'll come back from errands and take another whack at the piñata.

technocrat · February 21, 2020, 11:51pm

UPDATE: this is for a single column data frame--not responsive, ignore it.

Ok, I've got my head screwed on right; all df must have same non-null column names

df1  <- as.data.frame(seq(1:10))
df2  <- as.data.frame(seq(1:20))
df3  <- as.data.frame(seq(1:30))
colnames(df1) <- "X"
colnames(df2) <- "X"
colnames(df3) <- "X"
rbind(df1,df2,df3)
#>     X
#> 1   1
#> 2   2
#> 3   3
#> 4   4
#> 5   5
#> 6   6
#> 7   7
#> 8   8
#> 9   9
#> 10 10
#> 11  1
#> 12  2
#> 13  3
#> 14  4
#> 15  5
#> 16  6
#> 17  7
#> 18  8
#> 19  9
#> 20 10
#> 21 11
#> 22 12
#> 23 13
#> 24 14
#> 25 15
#> 26 16
#> 27 17
#> 28 18
#> 29 19
#> 30 20
#> 31  1
#> 32  2
#> 33  3
#> 34  4
#> 35  5
#> 36  6
#> 37  7
#> 38  8
#> 39  9
#> 40 10
#> 41 11
#> 42 12
#> 43 13
#> 44 14
#> 45 15
#> 46 16
#> 47 17
#> 48 18
#> 49 19
#> 50 20
#> 51 21
#> 52 22
#> 53 23
#> 54 24
#> 55 25
#> 56 26
#> 57 27
#> 58 28
#> 59 29
#> 60 30
rbind(df1,df2,df3)
#>     X
#> 1   1
#> 2   2
#> 3   3
#> 4   4
#> 5   5
#> 6   6
#> 7   7
#> 8   8
#> 9   9
#> 10 10
#> 11  1
#> 12  2
#> 13  3
#> 14  4
#> 15  5
#> 16  6
#> 17  7
#> 18  8
#> 19  9
#> 20 10
#> 21 11
#> 22 12
#> 23 13
#> 24 14
#> 25 15
#> 26 16
#> 27 17
#> 28 18
#> 29 19
#> 30 20
#> 31  1
#> 32  2
#> 33  3
#> 34  4
#> 35  5
#> 36  6
#> 37  7
#> 38  8
#> 39  9
#> 40 10
#> 41 11
#> 42 12
#> 43 13
#> 44 14
#> 45 15
#> 46 16
#> 47 17
#> 48 18
#> 49 19
#> 50 20
#> 51 21
#> 52 22
#> 53 23
#> 54 24
#> 55 25
#> 56 26
#> 57 27
#> 58 28
#> 59 29
#> 60 30

^{Created on 2020-02-21 by the reprex package (v0.3.0)}

andresrcs · February 22, 2020, 12:52am

This is one option

library(tidyverse)

df1 <- data.frame(stringsAsFactors = FALSE,
                  case1 = c("ABCCD", "PCCGT", "XGGTS"))
df2 <- data.frame(stringsAsFactors = FALSE,
                  case2 = c("TCGGS", "GCQZZ"))
df3 <- data.frame(stringsAsFactors = FALSE,
                  case3 = c("ZCRTTS"))

df1 %>%
    rownames_to_column() %>% 
    left_join(df2 %>% rownames_to_column()) %>% 
    left_join(df3 %>% rownames_to_column()) %>% 
    select(-rowname)
#> Joining, by = "rowname"
#> Joining, by = "rowname"
#>   case1 case2  case3
#> 1 ABCCD TCGGS ZCRTTS
#> 2 PCCGT GCQZZ   <NA>
#> 3 XGGTS  <NA>   <NA>

technocrat · February 22, 2020, 1:08am

Disregard the man behind the wall. I screwed up reading the question.

Pioneer82 · February 25, 2020, 7:38pm

Interesting, but I needed the columns side by side, not combined into a single column.

Pioneer82 · February 25, 2020, 7:40pm

This seems to work well. Is there a way to have the code ignore inserting and just leave the cell blank?

mara · February 25, 2020, 7:51pm

The rownames trick is basically a workaround for the fact that a data frame is fundamentally meant to have the same number of rows for each column.

From the data frame docs:

A data frame is a list of variables of the same number of rows with unique row names, given class "data.frame" . If no variables are included, the row names determine the number of rows.

Further down

Objects passed to data.frame should have the same number of rows, but atomic vectors (see is.vector ), factors and character vectors protected by I will be recycled a whole number of times if necessary (including as elements of list arguments).

You could replace NA with some other desired value (e.g. an empty string), but something has to be there to maintain the structure of the object.

Pioneer82 · February 25, 2020, 7:53pm

Thanks for the explanation, Mara. How do I pass an empty string instead of the NA?

nirgrahamuk · February 25, 2020, 8:17pm

the package sjmisc, has a convenient helper for this. sjmisc::replace_na()

 testframe <- tribble(~a, ~b,
                       "X",NA,
                        NA,NA)
> testframe
# A tibble: 2 x 2
  a     b    
  <chr> <lgl>
1 X     NA   
2 NA    NA   

replace_na(testframe,value="")
# A tibble: 2 x 2
  a     b    
  <chr> <chr>
1 "X"   ""   
2 ""    ""

andresrcs · February 25, 2020, 8:37pm

tidyr has its own replace_na() but it is not as straight forward.

library(tidyverse)

df1 <- data.frame(stringsAsFactors = FALSE,
                  case1 = c("ABCCD", "PCCGT", "XGGTS"))
df2 <- data.frame(stringsAsFactors = FALSE,
                  case2 = c("TCGGS", "GCQZZ"))
df3 <- data.frame(stringsAsFactors = FALSE,
                  case3 = c("ZCRTTS"))

df1 %>%
    rownames_to_column() %>% 
    left_join(df2 %>% rownames_to_column()) %>% 
    left_join(df3 %>% rownames_to_column()) %>% 
    select(-rowname) %>% 
    replace_na(list(case1 = "", case2 = "", case3 = ""))
#> Joining, by = "rowname"
#> Joining, by = "rowname"
#>   case1 case2  case3
#> 1 ABCCD TCGGS ZCRTTS
#> 2 PCCGT GCQZZ       
#> 3 XGGTS

mara · February 25, 2020, 8:40pm

Base alternative here, too. Though I think sjmisc is the easiest way to go!

dromano · February 25, 2020, 10:51pm

Another possibility is to use mutate_all() instead of replace_na():

df1 %>%
  rownames_to_column() %>% 
  left_join(df2 %>% rownames_to_column()) %>% 
  left_join(df3 %>% rownames_to_column()) %>% 
  select(-rowname) %>% 
  mutate_all(~ if_else(is.na(.), "",  .))
  # uses anonymous function syntax with '.' as variable: ~ f(.)

and if df1 and df2 are already tibbles, the analogue would be:

df1 %>% mutate(rowname =  row_number()) %>%
  # since tibbles don't allow row names
  left_join(df2 %>% mutate(rowname =  row_number())) %>% 
  left_join(df3 %>% mutate(rowname =  row_number())) %>% 
  select(-rowname) %>% 
  mutate_if(is.character, ~ if_else(is.na(.), "",  .))
  # since tibbles require uniform column data types, "" can only appear 
  # in character columns

system · March 3, 2020, 10:51pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.