Problems when using the merge() function

Hi there!

I am merging two data frames using this function:

merged <- merge(df1, df2[, c("var1", "var2", "var3")], all.x = TRUE, all.y = FALSE)

The final number of observations in merged > df1.

Note: df1 has about 2,500,000 observations and df2 about 500.

Does anyone know what the problem could be?

Thank you!

You would only expect the result of the merge to have the same number of rows as the largest table, if the key by which the merge is done is unique. Your result implies it has duplicates. (In either df1 , df2 , or both)

1 Like

The output of the merge will have more rows than df1 if there are multiple rows in df2 that meet the merge condition. You don't say what columns you are merging on. In this example, I merge on var1 and get one extra row because df2 has var1 equal to 11 in two rows.

df1 <- data.frame(A = 1:4, var1 = 11:14)
df2 <- data.frame(var1 = c(11,11,12,13), var2 = 21:24, var3 = 31:34, B =1:4)

merge(df1, df2[, c("var1", "var2", "var3")], all.x = TRUE, all.y = FALSE)
#>   var1 A var2 var3
#> 1   11 1   21   31
#> 2   11 1   22   32
#> 3   12 2   23   33
#> 4   13 3   24   34
#> 5   14 4   NA   NA

Created on 2023-03-10 with reprex v2.0.2

1 Like

Thanks, that is helpful. In the end, I had to aggregate some values in the second dataset to match the observations in the first one.

Thanks!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.