Is it required to collapse a vector of strings into a standard expression for str_subset?

colours <- c("^red$", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")

has_colour <- str_subset(sentences, colour_match) # 1
has_colour_test <- str_subset(sentences, colours) # 2

Why do #1 & #2 return different results (why doesn't #2 work ) ?

1 Like

Welcome @archishman!

The help for str_subset says it is Vectorised over string and pattern. Since the string vector sentences has more elements, colours is recycled along the length of sentences. So, for example, the first element of sentences is tested against the pattern "^red$", the second element of sentences is tested against the pattern "orange", etc., and the pattern vector gets recycled for every successive group of six elements of sentences.

Thus, for example, "green" is in the 4th out of 6 positions in colours so it will match any element of sentences that contains the word "green" and whose index leaves a remainder of 4 when divided by 6:

sentences[grepl("green", sentences) & 1:length(sentences) %% 6 == 4]

[1] "The spot on the blotter was made by green ink."

And this is exactly what your second example matches for "green":

has_colour_test
[1] "The spot on the blotter was made by green ink." "A man in a blue sweater sat at the desk."      
[3] "The sky in the west is tinged with orange red."

We can do something similar for the other colors. For example:

sentences[grepl("orange", sentences) & 1:length(sentences) %% 6 == 2]

[1] "The sky in the west is tinged with orange red."

On the other hand, your first example provides a single regular expression to str_detect:

colour_match

"^red$|orange|yellow|green|blue|purple"

therefore every element of sentences is tested against that single regular expression.

To use colours but have the result come out the way we want, we could use map to check each element of sentence separately against each element of colours or each element of colours against each element of sentence. Neither of these is as fast as str_subset(sentences, colour_match), but there may be other approaches that are faster.

library(tidyverse)

# Takes 40 times as long as str_subset(sentences, colour_match)
sentences %>% 
  map(~str_subset(.x, colours)) %>% 
  compact %>% unlist

# Takes twice as long as str_subset(sentences, colour_match)
colours %>%
  map(~str_subset(sentences, .x)) %>% 
  unlist %>% unique
1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.