Remove rows if column contains object from another list

I have a dataframe with a column containing objects separated by a semicolon. I want to remove all rows where that column contains 1 or more objects from an outside list.

I think I know how this could be done by using ‘mutate’ to create a new column flagging whether anything in the column list matches the external list (which I think would need a couple loops), and then filtering out. However, the actual dataframe will have millions of rows, so I’m trying to figure out what is the simplest/fastest way to do this.

# Dataframe #
df <- data.frame(id = c(1,2,3,4,5,6),
                 species = c("species1", "species1", "species2", "species3", "species3", "species4"),
                 issue = c("a b;d", "e;f;g", "b;c d;f", NA, "d", "e;g"))

# List of objects that specify which rows should be removed (any row where the column “issue” contains any of these objects should be removed)  
issues_to_remove <- list("a b", "b", "c d")

# Desired output #
df.new <- data.frame(id = c(2, 4, 5, 6),
                 species = c("species1", "species3", "species3", "species4"),
                 issue = c("e;f;g", NA, "d", "e;g"))

Here is one approach. For some reason I cannot do a reprex at the moment, so I will just post the code.

# Dataframe #
df <- data.frame(id = c(1,2,3,4,5,6),
                 species = c("species1", "species1", "species2", "species3", "species3", "species4"),
                 issue = c("a b;d", "e;f;g", "b;c d;f", NA, "d", "e;g"))

# List of objects that specify which rows should be removed (any row where the column “issue” contains any of these objects should be removed)  
issues_to_remove <- list("a b", "b", "c d")
# Desired output #
df.new <- data.frame(id = c(2, 4, 5, 6),
                     species = c("species1", "species3", "species3", "species4"),
                     issue = c("e;f;g", NA, "d", "e;g"))
library(dplyr)
library(stringr)
SearchPattern <- paste(issues_to_remove, collapse = "|")

df <- df %>% filter(str_detect(issue, SearchPattern, negate = TRUE) |is.na(issue))
1 Like

I got Reprex untangled. This is similar to my last post but using reprex.

df <- data.frame(id = c(1,2,3,4,5,6),
                 species = c("species1", "species1", "species2", "species3", "species3", "species4"),
                 issue = c("a b;d", "e;f;g", "b;c d;f", NA, "d", "e;g"))

issues_to_remove <- list("a b", "b", "c d")
df.new <- data.frame(id = c(2, 4, 5, 6),
                     species = c("species1", "species3", "species3", "species4"),
                     issue = c("e;f;g", NA, "d", "e;g"))
library(dplyr)
library(stringr)
SearchPattern <- paste(issues_to_remove, collapse = "|")
SearchPattern
#> [1] "a b|b|c d"

df <- df %>% filter(str_detect(issue, SearchPattern, negate = TRUE) |is.na(issue))
df
#>   id  species issue
#> 1  2 species1 e;f;g
#> 2  4 species3  <NA>
#> 3  5 species3     d
#> 4  6 species4   e;g

df.new
#>   id  species issue
#> 1  2 species1 e;f;g
#> 2  4 species3  <NA>
#> 3  5 species3     d
#> 4  6 species4   e;g

Created on 2020-04-08 by the reprex package (v0.3.0)

1 Like

I knew there had to be a super simple solution, thanks so much!!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.