Dataframe filter gives incorrect results with any + %in%

Using the construct any(X %in% Y) as a filter for a data frame containing vector elements does not give the correct results.

Reproducible example:

> df <- data.frame(K=c(1,2,3), L=rep(NA, 3))
> df$L<-list(c("a","b","c"), c("b","c","d"), c("c","d","e"))
> df
  K       L
1 1 a, b, c
2 2 b, c, d
3 3 c, d, e
> 
> M <- c("a", "e")
> 
> M %in% c("a","b","c")
[1]  TRUE FALSE
> any(M %in% c("a","b","c"))
[1] TRUE
> M %in% c("b","c","d")
[1] FALSE FALSE
> any(M %in% c("b","c","d"))
[1] FALSE
> M %in% c("c","d","e")
[1] FALSE  TRUE
> any(M %in% c("c","d","e"))
[1] TRUE
> 
> filtered2 <- df[any(M %in% df$L),]
> filtered2
[1] K L
<0 rows> (or 0-length row.names)
> nrow(filtered2)
[1] 0
> 
> # Dplyr gives the same results, as does mutate
> 
> library(dplyr)
> 
> filtered <- df %>% filter(any(M %in% L))
> nrow(filtered)
[1] 0
> 
> df %>% mutate(X=any(M %in% L))
  K       L     X
1 1 a, b, c FALSE
2 2 b, c, d FALSE
3 3 c, d, e FALSE

As demonstrated in the direct value testing, the first and third rows should have been returned from the filter and the values of X for mutate should have been TRUE, FALSE, TRUE. Not FALSE, FALSE, FALSE.

I found a workaround:

I found a work around:

```R
> FF_closure <- function(V) {
+   if (length(V[[1]]) > 1) {
+     RR = vector()
+     for (vv in V) {
+       RR[[length(RR)+1]] = FF_closure(vv)
+     }
+     return(RR)
+   }
+   for (m in M) {
+     for (v in V) {
+       if (v == m) {
+         return(TRUE)
+       }
+     }
+   }
+   return(FALSE)
+ }
> 
> filtered3 <- df[FF_closure(df$L),]
> filtered3
  K       L
1 1 a, b, c
3 3 c, d, e
> nrow(filtered3)
[1] 2

I am not sure what you are doing here but if we look at it like this

df <- data.frame(K=c(1,2,3), L=rep(NA, 3))
df$L<-list(c("a","b","c"), c("b","c","d"), c("c","d","e"))

M <- c("a", "e")

M %in% df$L[[1]]

we get what we expect for the first element.

More simply

d <- data.frame(K=c(1,2,3), L=rep(NA, 3))
d$L <- list(c("a","b","c"), c("b","c","d"), c("c","d","e"))
M <- c("a", "e")

(result <- sapply(d$L, function(x) M %in% x))
#>       [,1]  [,2]  [,3]
#> [1,]  TRUE FALSE FALSE
#> [2,] FALSE FALSE  TRUE

Created on 2023-11-07 with reprex v2.0.2

Right, but the point is that this DOES NOT WORK when trying to use any ... %in% in a filtering context.

You didn't use "any" in your filter function.

Neither does the workaround. Because of the order of evaluation any() takes as its argument not the vector but the return of the %in% operator, which is typeof logical.

see below ____________________

I think this is a fundamental issue with R unable to tell the difference between a collection and a scalar. The interpretation of filter should be "for each record do " . For each record I want to know if ANY of a given fixed set are found in a multi-valued array in the row.

My final workaround looks like:

Cannot test vectors for NA directly since cannot test vectors of length > 1

vector_is_na <- function(vector_element) {
if (test(length(vector_element) == 1)) {
if (is.na(vector_element)) {
return(TRUE)
}
}
return(FALSE)
}

TRUE if a vector_element from a dataframe (e.g. when a dataframe element itself is a vector)

contains at least one of the items in test_items

simple_any_in <- function(test_items, vector_element) {
if (!vector_is_na(vector_element)) {
for (ve in vector_element) {
for (it in test_items) {
if (test(ve == ti)) {
return(TRUE)
}
}
}
}
return(FALSE)
}

filter_dataframe_vector_by_any_in <- function(df_to_filter, df_col, test_items) {
match_closure <- function(column_vect) {
RR = rep(NA, nrow(df_to_filter))
i = 1
for (vector_str_element in column_vect) {
if (is.na(vector_str_element)) {
RR[[i]] = FALSE
} else {
vector_str_element = str_split_1(string=vector_str_element, pattern=", ")
RR[[i]] = simple_any_in(test_items, vector_element)
}
i = i + 1
}
# return a truthy-vector the length of nrow(df_to_filter) to filter the dataframe
return(RR)
}
truthy_vec = match_closure(df_col)
df_to_filter[truthy_vec, ]
}

R does not have "collection" or "scalar" types, it has vectors (equivalent of arrays in other languages), which can be of length 1 (thus be equivalent to scalars), and it has lists.

The operator x %in% table is defined such that it takes as input a vector x (optionally of length 1), and a vector table, and looks if each element of x is in table. In your code, you want to look in a list, which is not an appropriate data type for table.

What you want is a different operator, which can be defined as:

`%inlist%` <- function(x, lst){
  x %in% unlist(lst)
}

then you can use it as expected:

lst <- list(c("a", "b"),
            c("d", "e"))

c("a","f") %inlist% lst
#> [1]  TRUE FALSE

You could argue that where R is wrong is that %in% should give an error if you use a list as second argument. I guess the reason it's not the case is that %in% is agnostic to the type, so you can use it on list elements:

list(c("a","b"), c("a","f")) %in% lst
#> TRUE FALSE

and that would stay true if you define a custom class derived from list(), you wouldn't have to define a custom %in% for that class.

1 Like

As for this filtering, the problem is that you want to index the rows. So the vector on the left of %in% has to have one entry per row of df. You want a function that does "for each element of df$L, is any of them in M"?

So you have to have a separate loop:

sapply(df$L, \(el) any(el %in% M))

applied to the data frame:

df <- data.frame(K=c(1,2,3), L=rep(NA, 3))
df$L<-list(c("a","b","c"), c("b","c","d"), c("c","d","e"))

M <- c("a", "e")

filtered2 <- df[sapply(df$L, \(el) any(el %in% M)),]

filtered2
#>   K       L
#> 1 1 a, b, c
#> 3 3 c, d, e

Created on 2023-11-28 with reprex v2.0.2

1 Like

Oh and I was forgetting: this one has a problem for a different reason. Well, the basic reason is the same as above, when doing M %inlist% L you are asking backward, you are asking which element of M is in the data frame, whereas to filter you would need to ask with L on the left side (something like L %in% M).

So the result of M %in% L is of length 2, the same length as M. But then you add the any(), which collapses the vector of length 2 into a single value. And mutate() does something "clever": it recycles values. So the reason your column X has FALSE FALSE FALSE is not because you got three FALSE, it's because you ask if any element of M is in L, and get FALSE once (because L is a list, its elements are vectors), but then you recycle the FALSE to fill df$X.

This recycling "trick" is also something that is really useful in some situations, but can be dangerous as it looks like you correctly did three tests, even though you actually only did one.

2 Likes

Thanks for the helpful and informative responses.

1 Like