Regular Expressions and ID's

Say I have an example of data like this:

ID      Text
23      Patient has a probable chance of diabetes
78      He had to go to the doctor today
92      He mentioned possible diabetes
83      Patient likely has diabetes
45      She told us about her family history

I want to find the texts where it says "probable" or "possible" diabetes, but I want to return the list of ID's and their matching text where those are found. In this case, I want to return a dataset of:

ID     Text
23      Patient has a probable chance of diabetes
92      He mentioned possible diabetes

How about this (where d is your data frame):

library(tidyverse)

d2 = d %>% 
  filter(grepl("probable|possible", Text, ignore.case=TRUE))

The above will return all rows where either word occurs, even if the word doesn't occur in the context of diabetes. The search pattern can be refined if your real situation is more complex.

Can I ask what that would look like if I wanted to find "probable" or "possible" right next to "diabetes"?

Would it be

d2 = d %>%
filter(grepl("probable|possible diabetes", Text, ignore.case=TRUE))

Your example will match the case where "probable" or "possible" are followed by one space and then "diabetes". Here are a couple of additional examples. I'm not an expert with regular expressions and there may be better or more efficient ways to implement these.

Fake data

library(tidyverse)

dd = tibble(var=c('probable for diabetes',
                  'diabetes probable',
                  'Diabetes probable',
                  'diabetes is the most probable',
                  'possible for diabetes',
                  'diabetes is unlikely but insulin resistance is probable',
                  'probable',
                  'diabetes',
                  'something else'))

Diabetes must appear anywhere before or after "probable" or "possible". (?i) makes it case insensitive (which you can also do with the ignore.case=TRUE argument):

dd %>% 
  filter(grepl("(?i)diabetes.*(probable|possible)", var) |
           grepl("(?i)(probable|possible).*diabetes", var))

#> # A tibble: 6 x 1
#>   var                                                    
#>   <chr>                                                  
#> 1 probable for diabetes                                  
#> 2 diabetes probable                                      
#> 3 Diabetes probable                                      
#> 4 diabetes is the most probable                          
#> 5 possible for diabetes                                  
#> 6 diabetes is unlikely but insulin resistance is probable

Diabetes must appear within four words before or after "probable" or "possible".

  • \\w+ means any number of word characters (letters, numbers, dashes)
  • \\W+ means any number of non-word characters (e.g., white space)
  • {0,4} means zero to 4 repetitions of the pattern
  • (?:...) means a non-capturing group (that is, it skips zero to 4 words without capturing them as part of the match
dd %>% 
  filter(grepl("(?i)(probable|possible)\\W+(?:\\w+\\W+){0,4}?diabetes", var) |
           grepl("(?i)diabetes\\W+(?:\\w+\\W+){0,4}?(probable|possible)", var))

#> # A tibble: 5 x 1
#>   var                          
#>   <chr>                        
#> 1 probable for diabetes        
#> 2 diabetes probable            
#> 3 Diabetes probable            
#> 4 diabetes is the most probable
#> 5 possible for diabetes

Created on 2020-07-10 by the reprex package (v0.3.0)

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.