Merging data frames with different number of rows

Hi, my actual datasets are way too large to share so I made 2 example data frames

df1 <- data.frame(w=c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yzA", "BCD"), a=rep(c(0:1), times =5), b=rep(0, times= 2), c=c(0, 5, 0, 9, 6, 7, 7, 0, 0, 3))
df2 <- data.frame(w=c("abc", "def", "ghi", "jkl", "mno"), f=c(1, 2, 5, 7, 3))

These are my 2 example dataframes I want to merge based on the names of column w.

df_merged <- data.frame(w=c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yzA", "BCD"), a=rep(c(0:1), times =5), b=rep(0, times= 2), c=c(0, 5, 0, 9, 6, 7, 7, 0, 0, 3), f=c(1,2,5,7,3,0,0,0,0,0)) 

This would be my ideal output data frame. So the df2 rows that match based on column w will be matched, and the row that are then missing, will have a 0 on that place.

I already figured out that I should use the merge function, but I do not know how.
Can anyone help me? :upside_down_face:

Use a left_join()

library(tidyverse)

df1 <- data.frame(
    w = c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yzA", "BCD"),
    a = rep(c(0:1), times =5),
    b = rep(0, times= 2),
    c = c(0, 5, 0, 9, 6, 7, 7, 0, 0, 3)
    )

df2 <- data.frame(
    w = c("abc", "def", "ghi", "jkl", "mno"),
    f = c(1, 2, 5, 7, 3)
    )

df1 %>% 
    left_join(df2) %>% 
    replace_na(list(f = 0))
#> Joining, by = "w"
#>      w a b c f
#> 1  abc 0 0 0 1
#> 2  def 1 0 5 2
#> 3  ghi 0 0 0 5
#> 4  jkl 1 0 9 7
#> 5  mno 0 0 6 3
#> 6  pqr 1 0 7 0
#> 7  stu 0 0 7 0
#> 8  vwx 1 0 0 0
#> 9  yzA 0 0 0 0
#> 10 BCD 1 0 3 0

Created on 2021-05-09 by the reprex package (v2.0.0)

Thank you! I have one more question. How would I need to change the code if df2 had over 300 columns? :blush:

I don't understand your question, a join works regardless of the number of columns, can you clarify?

In the code replace_na(list(f=0)) is specified. f is a column of one of the data frames. I have tried to specify multiple columns here, but it does not work

Manually specifying each an NA replacement value for each column does work, but if you want to avoid the typing you can also use a different approach, see this example:

library(tidyverse)

df1 <- data.frame(
    w = c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yzA", "BCD"),
    a = rep(c(0:1), times =5),
    b = rep(0, times= 2),
    c = c(0, 5, 0, 9, 6, 7, 7, 0, 0, 3)
)

df2 <- data.frame(
    w = c("abc", "def", "ghi", "jkl", "mno"),
    f = c(1, 2, 5, 7, 3),
    g = c(3, 4, 5, 6, 7),
    h = c(6, 7, 8, 9, 10)
)

df1 %>% 
    left_join(df2) %>% 
    replace_na(list(f = 0, g = 0, h = 0))
#> Joining, by = "w"
#>      w a b c f g  h
#> 1  abc 0 0 0 1 3  6
#> 2  def 1 0 5 2 4  7
#> 3  ghi 0 0 0 5 5  8
#> 4  jkl 1 0 9 7 6  9
#> 5  mno 0 0 6 3 7 10
#> 6  pqr 1 0 7 0 0  0
#> 7  stu 0 0 7 0 0  0
#> 8  vwx 1 0 0 0 0  0
#> 9  yzA 0 0 0 0 0  0
#> 10 BCD 1 0 3 0 0  0

df1 %>% 
    left_join(df2) %>% 
    mutate(across(where(is.numeric), ~ ifelse(is.na(.), 0, .)))
#> Joining, by = "w"
#>      w a b c f g  h
#> 1  abc 0 0 0 1 3  6
#> 2  def 1 0 5 2 4  7
#> 3  ghi 0 0 0 5 5  8
#> 4  jkl 1 0 9 7 6  9
#> 5  mno 0 0 6 3 7 10
#> 6  pqr 1 0 7 0 0  0
#> 7  stu 0 0 7 0 0  0
#> 8  vwx 1 0 0 0 0  0
#> 9  yzA 0 0 0 0 0  0
#> 10 BCD 1 0 3 0 0  0

Created on 2021-05-11 by the reprex package (v2.0.0)

As a note of warning, be careful while replacing NA by 0, they are not the same and they might affect your calculations down the route.

1 Like

This helps a lot, thank you!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.