Reg ex and str_detect() whole words

wouterrstudio · May 28, 2020, 10:32am

I use a str_detect() to filter a column for words that are part of a certain column that I uploaded through excel.

I want to exclude strings in the column trigram that contain the whole word hub (with a space before and after, of on the beginning or end of the string). Therefore, I entered hub\s in the excel sheet. This returns as "hub\\\\s" in R. What is the right thing to do here?

hf_trigram_excl_ad <- hf_trigrams_sorted %>% 
  filter(!str_detect(trigram, paste(hf_bus_kw$excl_ad_gr, collapse = "|")))`

hf_bus_kw$excl_ad_gr
 [1] "hub\\\\s"

siddharthprabhu · May 28, 2020, 11:05am

In future, please try and post a reproducible example. It really makes it easier for others to help you.

You can solve this by using the word boundary metacharacter \b. I would recommend not modifying your Excel sheet and instead just generating the regular expression on-the-fly as shown below.

library(stringr)

words_to_exclude <- c("hub", "spoke")
vector_to_check <- c("at the hub dock", "hub master", "bespoke", "he spoke to me")

my_regex <- regex(paste("\\b", words_to_exclude, "\\b", sep = "", collapse = "|"))

str_detect(vector_to_check, my_regex)
#> [1]  TRUE  TRUE FALSE  TRUE

^{Created on 2020-05-28 by the reprex package (v0.3.0)}

wouterrstudio · May 28, 2020, 11:14am

Sorry, next time I will make it reproducible! Thanks a lot.

wouterrstudio · May 28, 2020, 11:55am

In addition I would like to match using fuzzyjoin based on whole words.

library(stringr)
library(fuzzyjoin)

df_with_strings <- data.frame(matrix(ncol = 1, nrow = 2))
x <- "match_column"
colnames(df_with_strings) <- x
df_with_strings$match_column <- c("this example should be included", "this example shouldn't be included")

df_match_words <- data.frame(matrix(ncol = 3, nrow = 1))
x <- c("word1", "word2", "matching_info")
colnames(df_match_words) <- x
df_match_words$word1 <- "example"
df_match_words$word2 <- "should"
df_match_words$matching_info <- "success"

matched_df <- fuzzy_left_join(df_with_strings, df_match_words, by = c(match_column = "word1", match_column = "word2"), match_fun = str_detect)

Current outcome

matched_df
                        match_column   word1  word2 matching_info
1    this example should be included example should       success
2 this example shouldn't be included example should       success

Preferred outcome

matched_df
                        match_column   word1  word2 matching_info
1    this example should be included example should       success

siddharthprabhu · May 28, 2020, 1:40pm

Thank you for the reproducible example. I'm not sure why you're making the data frame creation so convoluted. I've simplified it below. Hope you find it an easier approach.

As for the problem at hand, I think you need to use regex_join() instead of fuzzy_join().

library(stringr)
library(fuzzyjoin)

df_with_strings <- data.frame(match_column = c("this example should be included", 
                                               "this example shouldn't be included"))

df_match_words <- data.frame(word1 = "example", word2 = "should", matching_info = "success")

df_match_words <- dplyr::mutate(df_match_words, 
                                word1_regex = str_c("\\b", word1, "\\b"),
                                word2_regex = str_c("\\b", word2, "\\b"))

regex_inner_join(df_with_strings, df_match_words, 
                by = c(match_column = "word1_regex", match_column = "word2_regex"))
#>                      match_column   word1  word2 matching_info   word1_regex
#> 1 this example should be included example should       success \\bexample\\b
#>    word2_regex
#> 1 \\bshould\\b

^{Created on 2020-05-28 by the reprex package (v0.3.0)}

system · June 4, 2020, 1:40pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.