ID Text
23 Patient has a probable chance of diabetes
78 He had to go to the doctor today
92 He mentioned possible diabetes
83 Patient likely has diabetes
45 She told us about her family history
I want to find the texts where it says "probable" or "possible" diabetes, but I want to return the list of ID's and their matching text where those are found. In this case, I want to return a dataset of:
ID Text
23 Patient has a probable chance of diabetes
92 He mentioned possible diabetes
library(tidyverse)
d2 = d %>%
filter(grepl("probable|possible", Text, ignore.case=TRUE))
The above will return all rows where either word occurs, even if the word doesn't occur in the context of diabetes. The search pattern can be refined if your real situation is more complex.
Your example will match the case where "probable" or "possible" are followed by one space and then "diabetes". Here are a couple of additional examples. I'm not an expert with regular expressions and there may be better or more efficient ways to implement these.
Fake data
library(tidyverse)
dd = tibble(var=c('probable for diabetes',
'diabetes probable',
'Diabetes probable',
'diabetes is the most probable',
'possible for diabetes',
'diabetes is unlikely but insulin resistance is probable',
'probable',
'diabetes',
'something else'))
Diabetes must appear anywhere before or after "probable" or "possible". (?i) makes it case insensitive (which you can also do with the ignore.case=TRUE argument):
dd %>%
filter(grepl("(?i)diabetes.*(probable|possible)", var) |
grepl("(?i)(probable|possible).*diabetes", var))
#> # A tibble: 6 x 1
#> var
#> <chr>
#> 1 probable for diabetes
#> 2 diabetes probable
#> 3 Diabetes probable
#> 4 diabetes is the most probable
#> 5 possible for diabetes
#> 6 diabetes is unlikely but insulin resistance is probable
Diabetes must appear within four words before or after "probable" or "possible".
\\w+ means any number of word characters (letters, numbers, dashes)
\\W+ means any number of non-word characters (e.g., white space)
{0,4} means zero to 4 repetitions of the pattern
(?:...) means a non-capturing group (that is, it skips zero to 4 words without capturing them as part of the match
dd %>%
filter(grepl("(?i)(probable|possible)\\W+(?:\\w+\\W+){0,4}?diabetes", var) |
grepl("(?i)diabetes\\W+(?:\\w+\\W+){0,4}?(probable|possible)", var))
#> # A tibble: 5 x 1
#> var
#> <chr>
#> 1 probable for diabetes
#> 2 diabetes probable
#> 3 Diabetes probable
#> 4 diabetes is the most probable
#> 5 possible for diabetes