Problems when using the merge() function

hgarbino · March 10, 2023, 4:35pm

Hi there!

I am merging two data frames using this function:

merged <- merge(df1, df2[, c("var1", "var2", "var3")], all.x = TRUE, all.y = FALSE)

The final number of observations in merged > df1.

Note: df1 has about 2,500,000 observations and df2 about 500.

Does anyone know what the problem could be?

Thank you!

nirgrahamuk · March 10, 2023, 5:06pm

You would only expect the result of the merge to have the same number of rows as the largest table, if the key by which the merge is done is unique. Your result implies it has duplicates. (In either df1 , df2 , or both)

FJCC · March 10, 2023, 5:08pm

The output of the merge will have more rows than df1 if there are multiple rows in df2 that meet the merge condition. You don't say what columns you are merging on. In this example, I merge on var1 and get one extra row because df2 has var1 equal to 11 in two rows.

df1 <- data.frame(A = 1:4, var1 = 11:14)
df2 <- data.frame(var1 = c(11,11,12,13), var2 = 21:24, var3 = 31:34, B =1:4)

merge(df1, df2[, c("var1", "var2", "var3")], all.x = TRUE, all.y = FALSE)
#>   var1 A var2 var3
#> 1   11 1   21   31
#> 2   11 1   22   32
#> 3   12 2   23   33
#> 4   13 3   24   34
#> 5   14 4   NA   NA

^{Created on 2023-03-10 with reprex v2.0.2}

hgarbino · March 11, 2023, 10:32am

Thanks, that is helpful. In the end, I had to aggregate some values in the second dataset to match the observations in the first one.

Thanks!

system · March 18, 2023, 10:32am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.