anti_join is having a problem with large dataset? How to fix this?

mmarion · March 31, 2023, 3:00pm

df2 <- read.csv("df2.csv", header=TRUE, na.strings = c('"NA', ""))
df2 <- as.data.frame(df2)
str(df2)

OUTPUT:

'data.frame': 6497 obs. of 17 variables:
id : int 1 2 3 4 5 6 7 8 9 10 ... id2: int 0 1 2 3 4 5 6 7 8 9 ...
z : int 0 0 0 0 0 0 0 0 0 0 ... x1 : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
x2 : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ... x3 : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
x4 : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ... x5 : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
x6 : num 11 25 15 17 11 13 15 15 9 17 ... x7 : num 34 67 54 60 34 40 59 21 18 102 ...
x8 : num 0.998 0.997 0.997 0.998 0.998 ... x9 : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
x10: num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ... x11: num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
y : int 5 5 5 6 5 5 5 7 7 5 ... y2 : int 0 0 0 1 0 0 0 1 1 0 ...
$ y3 : int 1 1 1 2 1 1 1 2 2 1 ...

# set complement
excerpt1 <- subset(df2,subset=id<21)
excerpt1 <- as.data.frame(excerpt1)
str(excerpt1)

OUTPUT:

'data.frame': 20 obs. of 17 variables:
id : int 1 2 3 4 5 6 7 8 9 10 ... id2: int 0 1 2 3 4 5 6 7 8 9 ...
z : int 0 0 0 0 0 0 0 0 0 0 ... x1 : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
x2 : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ... x3 : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
x4 : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ... x5 : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
x6 : num 11 25 15 17 11 13 15 15 9 17 ... x7 : num 34 67 54 60 34 40 59 21 18 102 ...
x8 : num 0.998 0.997 0.997 0.998 0.998 ... x9 : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
x10: num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ... x11: num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
y : int 5 5 5 6 5 5 5 7 7 5 ... y2 : int 0 0 0 1 0 0 0 1 1 0 ...
$ y3 : int 1 1 1 2 1 1 1 2 2 1 ...

library(dplyr)
one <- anti_join(df2,excerpt1, by = c(df2 = "excerpt1"))
head(one)

OUTPUT:

Error in anti_join():
! Join columns must be present in data.
Problem with df2.
Backtrace:

dplyr::anti_join(df2, excerpt1, by = c(df2 = "excerpt1"))
dplyr:::anti_join.data.frame(df2, excerpt1, by = c(df2 = "excerpt1"))
Error in anti_join(df2, excerpt1, by = c(df2 = "excerpt1")) :
Problem with df2.

nirgrahamuk · March 31, 2023, 3:19pm

the by param is to tell it what columns of each dataset should be compared, you seem to be repeating the frames to use. it would probably suffice to say by = 'id' in your case

system · May 12, 2023, 3:20pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.