multiple str_detect or loop

Is there a more efficient way in dpylr or the tidyverse ecosystem to filter out multiple text items such as in the below example, or do I just need to compile a character vector and use a loop? The use case is to filter out political tweets not relevant to my analysis - text is the tweet message text in a column of a dataframe created via twitteR library.

df <- df %>% 
  filter(!str_detect(text, fixed("squat when #Putin annexed Crimea", ignore_case = TRUE)),
         !str_detect(text, fixed("environmental pol", ignore_case = TRUE)),
         !str_detect(text, fixed("Republicans have meeting with the Russians", ignore_case = TRUE)),
         !str_detect(text, fixed("impeach", ignore_case = TRUE)),
         !str_detect(text, fixed("AboutStrzok", ignore_case = TRUE)),
         !str_detect(text, fixed("Clinton", ignore_case = TRUE)),
         !str_detect(text, fixed("Obama", ignore_case = TRUE)),
         !str_detect(text, fixed("UraniumOne", ignore_case = TRUE)),
         !str_detect(text, fixed("Mueller", ignore_case = TRUE)),
         !str_detect(text, fixed("Hillary", ignore_case=TRUE)),
         !str_detect(text, fixed("Brennan", ignore_case=TRUE)),
         !str_detect(text, fixed("BUNDY", ignore_case=TRUE)),
         !str_detect(text, fixed("MAGA", ignore_case=TRUE)),
         !str_detect(text, fixed("realDonaldTrump", ignore_case=TRUE)),
         !str_detect(text, fixed("Obame", ignore_case=TRUE)),
         !str_detect(text, fixed("uranium 1", ignore_case=TRUE)),
         !str_detect(text, fixed("contaminate", ignore_case=TRUE)),
         !str_detect(text, fixed("munitions", ignore_case=TRUE)),
         !str_detect(text, fixed("AngelaMerk", ignore_case=TRUE)),
         !str_detect(text, fixed("TheMighty200", ignore_case=TRUE)),
         !str_detect(text, fixed("uranium-free water", ignore_case=TRUE)))

str_detect is vectorised over both the string and pattern arguments. An example below.

political <- c("Clinton", "Obama")
str_detect("Clinton", fixed(political, ignore_case = TRUE))

The above call to str_detect will return a vector c(TRUE, FALSE). So you could apply the function any over the result to determine if the variable text contains any of the patterns.

Yes but not at the same time so one need to take care with that. It is like map2 where first text is map with first regex, second text with second regex, and so on...

political <- c("Clinton", "Obama")
# ok
stringr::str_detect("Clinton", stringr::fixed(political, ignore_case = TRUE))
#> [1]  TRUE FALSE

# second string is map to Obama, and third recycle Clinton
# we have a wrong result
stringr::str_detect(c("Clinton", "Clinton", "Obama"), stringr::fixed(political, ignore_case = TRUE))
#> Warning in stri_detect_fixed(string, pattern, negate = negate, opts_fixed
#> = opts(pattern)): longer object length is not a multiple of shorter object
#> length
#> [1]  TRUE FALSE FALSE

# result depend of the order of both vector
stringr::str_detect(c("Obama", "Clinton", "Bush"), stringr::fixed(political, ignore_case = TRUE))
#> Warning in stri_detect_fixed(string, pattern, negate = negate, opts_fixed
#> = opts(pattern)): longer object length is not a multiple of shorter object
#> length
#> [1] FALSE FALSE FALSE

Created on 2019-03-06 by the reprex package (v0.2.1)

Vectorization is already used here by pattern when applied on the whole column text. One solution is to iterate over ignored string but using function programming with purrr and not just a for loop

library(magrittr)
# detect string to to_keep 
to_keep <- function(fixed_string, text) {
  !stringr::str_detect(text, stringr::fixed(fixed_string, ignore_case = TRUE))
}

# create a vector of string to detect and to ignore
ignored_string <- c("squat when #Putin annexed Crimea", "environmental pol")

# dummy table
df <- tibble::tibble(
  text = c(ignored_string, "dummy")
)

df <- df %>% 
  dplyr::filter(
    ignored_string %>%
      # apply the filter of all the text rows for each pattern
      # you'll get one list of logical by pattern ignored_string
      purrr::map(~ to_keep(.x, text = text)) %>%
      # get a logical vector of rows to keep
      purrr::pmap_lgl(all)
  )

df
#> # A tibble: 1 x 1
#>   text 
#>   <chr>
#> 1 dummy

Created on 2019-03-06 by the reprex package (v0.2.1)

4 Likes

Wow, impressive! I got the same filtered number as my long version. Thanks.

I'll have to read up and try to understand those purrr functions.

See purrr::map as an advanced lapply : you iterate on some element one by one mapping a function on each and returning a list. One thing interesting in purrr is that you type stable function with some map_* variants. map_lgl will always return a logical list or fail trying.
purrr::pmap is an advanced map() as it can iterates on several element from several list at once, hence the use with the function all() that can take several elements.

ignored_string %>%
      purrr::map(~ to_keep(.x, text = text)) %>%

This will apply the to_keep function on each element of ignored_string returning a list of length ignored_string, each element containing the result of to_keep (a vector of length text containing TRUE or FALSE).

purrr::pmap_lgl(all)

allows to iterate on the result applying the all() function on each element of the previous result. It will map all to the firsts element of each element of the previous result, then on all the seconds, then on all the thirds, and so one. pmap would return a list of length text with one element TRUE or FALSE. pmap_lgl ensures that the list is flatten to a logical vector.

Hope it helps understand but going through step by step will be the best things to do to understand.

2 Likes

Thanks for the great explanation.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.