I use a str_detect() to filter a column for words that are part of a certain column that I uploaded through excel.
I want to exclude strings in the column trigram that contain the whole word hub (with a space before and after, of on the beginning or end of the string). Therefore, I entered hub\s in the excel sheet. This returns as "hub\\\\s" in R. What is the right thing to do here?
In future, please try and post a reproducible example. It really makes it easier for others to help you.
You can solve this by using the word boundary metacharacter \b. I would recommend not modifying your Excel sheet and instead just generating the regular expression on-the-fly as shown below.
In addition I would like to match using fuzzyjoin based on whole words.
library(stringr)
library(fuzzyjoin)
df_with_strings <- data.frame(matrix(ncol = 1, nrow = 2))
x <- "match_column"
colnames(df_with_strings) <- x
df_with_strings$match_column <- c("this example should be included", "this example shouldn't be included")
df_match_words <- data.frame(matrix(ncol = 3, nrow = 1))
x <- c("word1", "word2", "matching_info")
colnames(df_match_words) <- x
df_match_words$word1 <- "example"
df_match_words$word2 <- "should"
df_match_words$matching_info <- "success"
matched_df <- fuzzy_left_join(df_with_strings, df_match_words, by = c(match_column = "word1", match_column = "word2"), match_fun = str_detect)
Current outcome
matched_df
match_column word1 word2 matching_info
1 this example should be included example should success
2 this example shouldn't be included example should success
Preferred outcome
matched_df
match_column word1 word2 matching_info
1 this example should be included example should success
Thank you for the reproducible example. I'm not sure why you're making the data frame creation so convoluted. I've simplified it below. Hope you find it an easier approach.
As for the problem at hand, I think you need to use regex_join() instead of fuzzy_join().
library(stringr)
library(fuzzyjoin)
df_with_strings <- data.frame(match_column = c("this example should be included",
"this example shouldn't be included"))
df_match_words <- data.frame(word1 = "example", word2 = "should", matching_info = "success")
df_match_words <- dplyr::mutate(df_match_words,
word1_regex = str_c("\\b", word1, "\\b"),
word2_regex = str_c("\\b", word2, "\\b"))
regex_inner_join(df_with_strings, df_match_words,
by = c(match_column = "word1_regex", match_column = "word2_regex"))
#> match_column word1 word2 matching_info word1_regex
#> 1 this example should be included example should success \\bexample\\b
#> word2_regex
#> 1 \\bshould\\b