Hello, first, I have 2 dataframes, ChicagoCrime2029 and ChicagoCrime2020.
The dataframes have respectively 258,993 and 207,989 observations.
The variable that I want to use to merge these to datasets is 'Primary.Type'. So I use the followwing function :
(crime20b and crime19b are just the name of my df's)
crime19_20 <- full_join(crime19b2, crime20b2, by = 'Primary.Type')
I could use "left_join" but here I think it doesn't matter because I have the same result.
So when I run the code, R tell me that it is impossible because I'm running out of memory, but for information I have 16 Gb of ram. Moreover The vector of parameter is "too high".
If you have any answers, if you have already encountered this problem I'll be happy if you can help me.
It is easy to imagine that you are running out of memory. Let's look at a toy example with only 20 rows in each data frame and three different values of Type, the column by which we join.
library(dplyr)
set.seed(123)
DF1 <- data.frame(Type=sample(c('a','b','c'),20,replace = TRUE),
other_col=1:20)
DF2 <- data.frame(Type=sample(c('a','b','c'),20,replace = TRUE),
other_col=1:20)
DF1_2 <- left_join(DF1,DF2,by="Type")
nrow(DF1_2)
[1] 131
table(DF1$Type)
a b c
5 7 8
table(DF2$Type)
a b c
8 5 7
5*8 + 7*5 + 8*7
[1] 131
The 20 rows in each data frame results in 131 rows in the final data frame because each row in DF1 is matched with every row in DF2 that has the same type. In your data set of about 200000 rows in each DF, let's assume you have 20 values of Primary.Type and they are evenly distributed, so there are 10000 of each Primary.Type. That results in 2 billion rows, allowing only 8 bytes per row if all of your memory could go to this object.
10000 * 10000 * 20 = 2 E9
Why are you doing this join? What calculation or comparison are you trying to make?
I do not have your data ,but my guess is that the full_join, when you do not specify what columns to join by, is doing the same thing as simply row binding the two data frames. The help section on the by argument says:
by
A character vector of variables to join by.
If NULL, the default, *_join() will perform a natural join, using all variables in common across x and y. A message lists the variables so that you can check they're correct; suppress the message by supplying by explicitly.
So if you do not provide by, all of the columns present in both data sets must match. That probably means that there are no matches. The code below shows a simple example.
library(dplyr)
set.seed(123)
DF1 <- data.frame(Type=sample(c('a','b','c'),20,replace = TRUE),
other_col=1:20)
DF2 <- data.frame(Type=sample(c('a','b','c'),20,replace = TRUE),
other_col=21:40)
#Lots of matches on Type only
DF1_2 <- full_join(DF1,DF2,by="Type")
nrow(DF1_2)
#> [1] 131
#No matches for both Type and other_col, so each row is
#individually returned
DF1_2_2 <- full_join(DF1,DF2)
#> Joining, by = c("Type", "other_col")
nrow(DF1_2_2)
#> [1] 40
#binding row-wise with rbind() does the same as the full_join
DF1_2_3 <- rbind(DF1,DF2)
identical(DF1_2_2,DF1_2_3)
#> [1] TRUE