Word boundaries in str_detect function in R

Hi everyone...I should search some words in this dataframe.

data <- data.frame(stringsAsFactors = FALSE,
Id_text = c("1", "2", "3"),
Text = c("What I really feel is necessary is that the black people in this country wil have to upset this apple cart. We can no longer ignore the fact that America is not the... land of the free and the home of the brave",
"This Article shall not apply to pineapples produced in the Azores.", "Particularly suitable for watering vegetables, pineapples, sugar cane and bananas")
)

dictionary <- c("apple", "pineapple", "pine", "pineapples")

So I run this:

data %>%
bind_cols(dictionary %>%
set_names() %>%
map_dfc(~str_detect(data$Text, .x)) %>%
mutate_all(as.numeric)) %>%
as_tibble()

So I have 2 problems:

  • the word "apple" match with "pineapple"
  • the word "pine" match with "pineapple"

I could use "\b" or boundary function but i have 20 different dictionaries that i will import from excel document and I can 't figure it out.
Thank you!

Does it work for you to use the paste0() function to put the word boundaries on to your dictionary?

data <- data.frame(stringsAsFactors = FALSE,
                   Id_text = c("1", "2", "3"),
                   Text = c("What I really feel is necessary is that the black people in this country wil have to upset this apple cart. We can no longer ignore the fact that America is not the... land of the free and the home of the brave",
                            "This Article shall not apply to pineapples produced in the Azores.", 
                            "Particularly suitable for watering vegetables, pineapples, sugar cane and bananas")
)

dictionary <- c("apple", "pineapple", "pine", "pineapples")
dictionary <- paste0("\\b", dictionary, "\\b")

  
  data %>%
  bind_cols(dictionary %>%
              set_names() %>%
              map_dfc(~str_detect(data$Text, .x)) %>%
              mutate_all(as.numeric)) %>%
  as_tibble()

Yes, it run! Thank you FJCC ! I'm wondering if is possible manage this situation...

data <- data.frame(stringsAsFactors = FALSE,
                   Id_text = c("1", "2", "3"),
                   Text = c("What I really feel is necessary is that the black people in this country wil have to upset this apple cart. We can no longer ignore the fact that America is not the... land of the free and the home of the brave",
                            "This Article shall not apply to pineapples produced in the Azores.", 
                            "Particularly suitable for watering vegetables, pineapples, sugar cane and bananas")
)

dictionary <- c("apple", "pineapple", "pine", "people black")
dictionary <- paste0("\\b", dictionary, "\\b")

Created on 2019-12-24 by the reprex package (v0.3.0)

How you can see i would like search two topics: people and black but people black doesn't match with black people that we found in the first Id_text.
What do you suggest me doing then?

I have no experience searching through text, so there are probably better answers than what I can provide. If you want to search for text that contains both the words "people" and "black", you could try

dictionary <- c("apple", "pineapple", "pine", "pineapples", "people\\b.+\\bblack\\b|\\bblack\\b.+\\bpeople")
dictionary <- paste0("\\b", dictionary, "\\b")

You might want to start a separate topic for this question. That will make it more likely that someone with experience handling text will see it.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.