exclude words preceded by some words

Hi R experts,

I would like to find some keywords (e.g., 'cat') in a list of sentences but need to leave out those in which the keyword is preceded by "no" or "not" within three-word distance. For instance, I have four sentences, like
s1: "I have a cat."
s2: "I have no cat"
s3: "I did not have any cat"
s4: "I did not have any dog but I have a cat"
s5: "I have a cat but no dogs"

What is the regular expression I can use with grepl() to find only s1, s4 and s5?

Thanks

Best,
Veda

The following works for your example cases:

x = c("I have a cat.",
      "I have no cat",
      "I did not have any cat",
      "I did not have any dog but I have a cat",
      "I have a cat but no dogs")

x[!grepl("\\bnot?\\W+(\\w+\\W+){0,3}cat\\b", x)]
[1] "I have a cat."                          
[2] "I did not have any dog but I have a cat"
[3] "I have a cat but no dogs"  

\\b is a word boundary
not? matches no or not
\\w+ matches any word
\\W+ matches any number of non-word characters
{0,3} means match the pattern from zero to three times

Putting these together, (\\w+\\W+){0,3} means match any word followed by any non-word characters from zero to three times (i.e., zero to three words between no or not and cat).

I adapted this tutorial to put together my answer.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.