I would like to find some keywords (e.g., 'cat') in a list of sentences but need to leave out those in which the keyword is preceded by "no" or "not" within three-word distance. For instance, I have four sentences, like
s1: "I have a cat."
s2: "I have no cat"
s3: "I did not have any cat"
s4: "I did not have any dog but I have a cat"
s5: "I have a cat but no dogs"
What is the regular expression I can use with grepl() to find only s1, s4 and s5?
x = c("I have a cat.",
"I have no cat",
"I did not have any cat",
"I did not have any dog but I have a cat",
"I have a cat but no dogs")
x[!grepl("\\bnot?\\W+(\\w+\\W+){0,3}cat\\b", x)]
[1] "I have a cat."
[2] "I did not have any dog but I have a cat"
[3] "I have a cat but no dogs"
\\b is a word boundary not? matches no or not \\w+ matches any word \\W+ matches any number of non-word characters {0,3} means match the pattern from zero to three times
Putting these together, (\\w+\\W+){0,3} means match any word followed by any non-word characters from zero to three times (i.e., zero to three words between no or not and cat).