Merging data frames with different number of rows

Maaike27 · May 9, 2021, 8:05pm

Hi, my actual datasets are way too large to share so I made 2 example data frames

df1 <- data.frame(w=c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yzA", "BCD"), a=rep(c(0:1), times =5), b=rep(0, times= 2), c=c(0, 5, 0, 9, 6, 7, 7, 0, 0, 3))
df2 <- data.frame(w=c("abc", "def", "ghi", "jkl", "mno"), f=c(1, 2, 5, 7, 3))

These are my 2 example dataframes I want to merge based on the names of column w.

df_merged <- data.frame(w=c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yzA", "BCD"), a=rep(c(0:1), times =5), b=rep(0, times= 2), c=c(0, 5, 0, 9, 6, 7, 7, 0, 0, 3), f=c(1,2,5,7,3,0,0,0,0,0))

This would be my ideal output data frame. So the df2 rows that match based on column w will be matched, and the row that are then missing, will have a 0 on that place.

I already figured out that I should use the merge function, but I do not know how.
Can anyone help me?

andresrcs · May 9, 2021, 9:43pm

Use a left_join()

library(tidyverse)

df1 <- data.frame(
    w = c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yzA", "BCD"),
    a = rep(c(0:1), times =5),
    b = rep(0, times= 2),
    c = c(0, 5, 0, 9, 6, 7, 7, 0, 0, 3)
    )

df2 <- data.frame(
    w = c("abc", "def", "ghi", "jkl", "mno"),
    f = c(1, 2, 5, 7, 3)
    )

df1 %>% 
    left_join(df2) %>% 
    replace_na(list(f = 0))
#> Joining, by = "w"
#>      w a b c f
#> 1  abc 0 0 0 1
#> 2  def 1 0 5 2
#> 3  ghi 0 0 0 5
#> 4  jkl 1 0 9 7
#> 5  mno 0 0 6 3
#> 6  pqr 1 0 7 0
#> 7  stu 0 0 7 0
#> 8  vwx 1 0 0 0
#> 9  yzA 0 0 0 0
#> 10 BCD 1 0 3 0

^{Created on 2021-05-09 by the reprex package (v2.0.0)}

Maaike27 · May 10, 2021, 7:58am

Thank you! I have one more question. How would I need to change the code if df2 had over 300 columns?

andresrcs · May 10, 2021, 12:47pm

I don't understand your question, a join works regardless of the number of columns, can you clarify?

Maaike27 · May 10, 2021, 12:58pm

In the code replace_na(list(f=0)) is specified. f is a column of one of the data frames. I have tried to specify multiple columns here, but it does not work

andresrcs · May 11, 2021, 12:56am

Manually specifying each an NA replacement value for each column does work, but if you want to avoid the typing you can also use a different approach, see this example:

library(tidyverse)

df1 <- data.frame(
    w = c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yzA", "BCD"),
    a = rep(c(0:1), times =5),
    b = rep(0, times= 2),
    c = c(0, 5, 0, 9, 6, 7, 7, 0, 0, 3)
)

df2 <- data.frame(
    w = c("abc", "def", "ghi", "jkl", "mno"),
    f = c(1, 2, 5, 7, 3),
    g = c(3, 4, 5, 6, 7),
    h = c(6, 7, 8, 9, 10)
)

df1 %>% 
    left_join(df2) %>% 
    replace_na(list(f = 0, g = 0, h = 0))
#> Joining, by = "w"
#>      w a b c f g  h
#> 1  abc 0 0 0 1 3  6
#> 2  def 1 0 5 2 4  7
#> 3  ghi 0 0 0 5 5  8
#> 4  jkl 1 0 9 7 6  9
#> 5  mno 0 0 6 3 7 10
#> 6  pqr 1 0 7 0 0  0
#> 7  stu 0 0 7 0 0  0
#> 8  vwx 1 0 0 0 0  0
#> 9  yzA 0 0 0 0 0  0
#> 10 BCD 1 0 3 0 0  0

df1 %>% 
    left_join(df2) %>% 
    mutate(across(where(is.numeric), ~ ifelse(is.na(.), 0, .)))
#> Joining, by = "w"
#>      w a b c f g  h
#> 1  abc 0 0 0 1 3  6
#> 2  def 1 0 5 2 4  7
#> 3  ghi 0 0 0 5 5  8
#> 4  jkl 1 0 9 7 6  9
#> 5  mno 0 0 6 3 7 10
#> 6  pqr 1 0 7 0 0  0
#> 7  stu 0 0 7 0 0  0
#> 8  vwx 1 0 0 0 0  0
#> 9  yzA 0 0 0 0 0  0
#> 10 BCD 1 0 3 0 0  0

^{Created on 2021-05-11 by the reprex package (v2.0.0)}

As a note of warning, be careful while replacing NA by 0, they are not the same and they might affect your calculations down the route.

Maaike27 · May 11, 2021, 7:51am

This helps a lot, thank you!

system · May 18, 2021, 7:52am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.