matching words in a string with regex

cereghetti · April 27, 2021, 5:31pm

Hi there! I have a df like this:

df<-data.frame(products=c('1 kg pears','appears to be a dog','a pear','apples red','red apple','1 kg 
anana','1 kg banana'))

and I have a vector of products:

vector<-c('pear','apple','banana','anana')

I need to classify each product in df, based on the words in the vector. I was thinking about something like

df$class<-NA

for(i in 1:length(vector)){

rows_product<-which(grepl(vector[[i]],df[[1]]))
df$class[rows_product]<-vector[[i]]
}

But I realized I need to look for the words to start like the words in the vector, so if I am looking to match 'pear' it does not match 'appears', or if I am looking for 'anana' does not match 'banana'.
There is any way I can do this? I think there might be a way to do it with regex but i could not find how.

Jwvz001 · April 27, 2021, 8:35pm

Hi! are you looking for something like this? Use '\s' to match any preceding white spaces. Below I omitted "banana" from the strings to show that it 'anana' does indeed not match 'banana' as you want.

JW

library(tidyverse)                                                            
df<-data.frame(products=c('1 kg pears',                                       
                          'appears to be a dog',                              
                          'a pear','apples red',                              
                          'red apple',                                        
                          '1 kg anana',                                       
                          '1 kg banana'))                                     
                                                                              
# add \s to match any preceding white space (note: extra \ to escape...)      
vector<-c('\\spear','apple','\\sanana')  #took out anana                      
                                                                              
df %>% mutate(match = str_detect(products , paste(vector, collapse = "|")))   
#              products match                                                 
# 1          1 kg pears  TRUE                                                 
# 2 appears to be a dog FALSE                                                 
# 3              a pear  TRUE                                                 
# 4          apples red  TRUE                                                 
# 5           red apple  TRUE                                                 
# 6          1 kg anana  TRUE                                                 
# 7         1 kg banana FALSE

cereghetti · April 29, 2021, 1:09pm

Thank you for your answer, but there is any way to match it also when the string starts? Because in this case, if I would have something like this:

str_detect(c('pear','a pear'),'\\spear')

It would match in the first case but not in the second one. There is any way to have both?

Thank you!

Jwvz001 · April 29, 2021, 1:43pm

Oh, sorry, I see. In that case try \b, meaning beginning of word...

You can try regex-es out at regex101. Have a look here.

JW

> str_detect(c('pear','a pear'),'\\bpear')
[1] TRUE TRUE

system · May 6, 2021, 1:43pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.