Problem when I want to merge 2 dataframes

Crabe2 · November 5, 2022, 10:54am

Hello, first, I have 2 dataframes, ChicagoCrime2029 and ChicagoCrime2020.
The dataframes have respectively 258,993 and 207,989 observations.
The variable that I want to use to merge these to datasets is 'Primary.Type'. So I use the followwing function :

(crime20b and crime19b are just the name of my df's)

crime19_20 <- full_join(crime19b2, crime20b2, by = 'Primary.Type')

I could use "left_join" but here I think it doesn't matter because I have the same result.

So when I run the code, R tell me that it is impossible because I'm running out of memory, but for information I have 16 Gb of ram. Moreover The vector of parameter is "too high".

If you have any answers, if you have already encountered this problem I'll be happy if you can help me.

FJCC · November 5, 2022, 11:47am

It is easy to imagine that you are running out of memory. Let's look at a toy example with only 20 rows in each data frame and three different values of Type, the column by which we join.

library(dplyr)
set.seed(123)
DF1 <- data.frame(Type=sample(c('a','b','c'),20,replace = TRUE),
                   other_col=1:20)
DF2 <- data.frame(Type=sample(c('a','b','c'),20,replace = TRUE),
                   other_col=1:20)
DF1_2 <- left_join(DF1,DF2,by="Type")
nrow(DF1_2)
[1] 131

table(DF1$Type)

a b c 
5 7 8 
table(DF2$Type)

a b c 
8 5 7 
5*8 + 7*5 + 8*7
[1] 131

The 20 rows in each data frame results in 131 rows in the final data frame because each row in DF1 is matched with every row in DF2 that has the same type. In your data set of about 200000 rows in each DF, let's assume you have 20 values of Primary.Type and they are evenly distributed, so there are 10000 of each Primary.Type. That results in 2 billion rows, allowing only 8 bytes per row if all of your memory could go to this object.

10000 * 10000 * 20 = 2 E9

Why are you doing this join? What calculation or comparison are you trying to make?

M_AcostaCH · November 5, 2022, 3:55pm

You could use this when the DF are diferent length :

library(tidyverse)

dfull <- crime19b2%>% 
  merge(crime20b2, by= "Primary.Type")

Crabe2 · November 5, 2022, 5:04pm

Hi, thank you for your answer ! But finally I have cleaned more my data frames and it works with :

crime19_20 <- full_join(crime19, crime20)

But the thing I don't understand is why when I write :

crime19_20 <- full_join(crime19, crime20, by = "Primary.Type")

it doesn't work (because running out of memory, and when I don't writhe the condition on primary type it works.
Thanks you again.

FJCC · November 5, 2022, 5:32pm

I do not have your data ,but my guess is that the full_join, when you do not specify what columns to join by, is doing the same thing as simply row binding the two data frames. The help section on the by argument says:

by
A character vector of variables to join by.

If NULL, the default, ⁠*_join()⁠ will perform a natural join, using all variables in common across x and y. A message lists the variables so that you can check they're correct; suppress the message by supplying by explicitly.

So if you do not provide by, all of the columns present in both data sets must match. That probably means that there are no matches. The code below shows a simple example.

library(dplyr)

set.seed(123)
DF1 <- data.frame(Type=sample(c('a','b','c'),20,replace = TRUE),
                  other_col=1:20)
DF2 <- data.frame(Type=sample(c('a','b','c'),20,replace = TRUE),
                  other_col=21:40)

#Lots of matches on Type only
DF1_2 <- full_join(DF1,DF2,by="Type")
nrow(DF1_2)
#> [1] 131

#No matches for both Type and other_col, so each row is 
#individually returned
DF1_2_2 <- full_join(DF1,DF2)
#> Joining, by = c("Type", "other_col")
nrow(DF1_2_2)
#> [1] 40

#binding row-wise with rbind() does the same as the full_join
DF1_2_3 <- rbind(DF1,DF2)

identical(DF1_2_2,DF1_2_3)
#> [1] TRUE

^{Created on 2022-11-05 with reprex v2.0.2}

Crabe2 · November 5, 2022, 6:00pm

Oh, I see, that is more clear now, it is because none of my rows match between the 2 datasets.
Thanks again.

system · November 12, 2022, 6:01pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.