Trying to turn key phrases in a text column into a new column with yes/no

This post was suggested as a possible solution but I couldn't understand what the OP was talking about so here it goes, starting with a reprex. Also, I tried watching text mining videos (all 6, which were quite good) but that didn't help me out for this task. I basically copied StatSteph's solution to a similar issue I had in a previous post I made.

The issue I have is phrases that I put in under the wordlist weren't picked up, so some columns should say yes but are currently saying no.

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 3.6.2
#> Warning: package 'tibble' was built under R version 3.6.2
#> Warning: package 'tidyr' was built under R version 3.6.2
#> Warning: package 'dplyr' was built under R version 3.6.2
#> Warning: package 'stringr' was built under R version 3.6.2
reprex_mrn<-c("2449754", "2001748", "9023298", "2452107", "4174678", "6310355", "9338915", "2459626", "0423163")
reprex_us<-c("no significant pericardial effusion, hyperdynamic LV systolic function", "no pericardial effusion, enlarged akinetic RV, no organized cardiac function", "hyperdynamic with poor LV filling and large RV", "No Pericardial Effusion, no clot visualized", "very dilated LV with very poor function and inferoseptal WMA", "no cardiac activity", "Globally poor contractility", "Mildly enlarged left atria, Good Global Function", "poor global function progressing to cardiac arrest and then return of poor cardiac function, no pericardial effusion")
tee_reprex <- tibble(MRN=sample(reprex_mrn, 200, replace=TRUE), us_interpretation=sample(reprex_us, 200, replace=TRUE))
tee_reprex
#> # A tibble: 200 x 2
#>    MRN     us_interpretation                                                    
#>    <chr>   <chr>                                                                
#>  1 2459626 Globally poor contractility                                          
#>  2 2449754 no cardiac activity                                                  
#>  3 4174678 no cardiac activity                                                  
#>  4 2452107 no significant pericardial effusion, hyperdynamic LV systolic functi…
#>  5 2449754 No Pericardial Effusion, no clot visualized                          
#>  6 6310355 No Pericardial Effusion, no clot visualized                          
#>  7 2452107 Mildly enlarged left atria, Good Global Function                     
#>  8 6310355 hyperdynamic with poor LV filling and large RV                       
#>  9 2449754 very dilated LV with very poor function and inferoseptal WMA         
#> 10 2001748 Globally poor contractility                                          
#> # … with 190 more rows
wordlist<-c("poor", "agonal", "cardiac arrest", "rearrested", "ventricular fibrillation", "enlarged")
tee_reprex1<- tee_reprex %>%
  mutate(us_abnormal=if_else("us_interpretation" %in% wordlist, "yes", "no"))
tee_reprex1
#> # A tibble: 200 x 3
#>    MRN     us_interpretation                                         us_abnormal
#>    <chr>   <chr>                                                     <chr>      
#>  1 2459626 Globally poor contractility                               no         
#>  2 2449754 no cardiac activity                                       no         
#>  3 4174678 no cardiac activity                                       no         
#>  4 2452107 no significant pericardial effusion, hyperdynamic LV sys… no         
#>  5 2449754 No Pericardial Effusion, no clot visualized               no         
#>  6 6310355 No Pericardial Effusion, no clot visualized               no         
#>  7 2452107 Mildly enlarged left atria, Good Global Function          no         
#>  8 6310355 hyperdynamic with poor LV filling and large RV            no         
#>  9 2449754 very dilated LV with very poor function and inferoseptal… no         
#> 10 2001748 Globally poor contractility                               no         
#> # … with 190 more rows

Is this what you are after?

library(tidyverse)

reprex_mrn<-c("2449754", "2001748", "9023298", "2452107", "4174678", "6310355", "9338915", "2459626", "0423163")
reprex_us<-c("no significant pericardial effusion, hyperdynamic LV systolic function", "no pericardial effusion, enlarged akinetic RV, no organized cardiac function", "hyperdynamic with poor LV filling and large RV", "No Pericardial Effusion, no clot visualized", "very dilated LV with very poor function and inferoseptal WMA", "no cardiac activity", "Globally poor contractility", "Mildly enlarged left atria, Good Global Function", "poor global function progressing to cardiac arrest and then return of poor cardiac function, no pericardial effusion")
tee_reprex <- tibble(MRN=sample(reprex_mrn, 200, replace=TRUE), us_interpretation=sample(reprex_us, 200, replace=TRUE))
wordlist<-c("poor", "agonal", "cardiac arrest", "rearrested", "ventricular fibrillation", "enlarged")
wordlist2 <- paste(wordlist, collapse = "|")
wordlist2
#> [1] "poor|agonal|cardiac arrest|rearrested|ventricular fibrillation|enlarged"
tee_reprex1<- tee_reprex %>%
  mutate(us_abnormal=if_else(str_detect(us_interpretation,wordlist2), "yes", "no"))
tee_reprex1
#> # A tibble: 200 x 3
#>    MRN     us_interpretation                                         us_abnormal
#>    <chr>   <chr>                                                     <chr>      
#>  1 9338915 hyperdynamic with poor LV filling and large RV            yes        
#>  2 2459626 very dilated LV with very poor function and inferoseptal… yes        
#>  3 4174678 hyperdynamic with poor LV filling and large RV            yes        
#>  4 9338915 Mildly enlarged left atria, Good Global Function          yes        
#>  5 2459626 no cardiac activity                                       no         
#>  6 2449754 no significant pericardial effusion, hyperdynamic LV sys… no         
#>  7 0423163 no significant pericardial effusion, hyperdynamic LV sys… no         
#>  8 9023298 no pericardial effusion, enlarged akinetic RV, no organi… yes        
#>  9 0423163 No Pericardial Effusion, no clot visualized               no         
#> 10 2001748 poor global function progressing to cardiac arrest and t… yes        
#> # … with 190 more rows

Created on 2020-01-06 by the reprex package (v0.3.0)

@FJCC has a more straightforward answer.

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(purrr)) 
suppressPackageStartupMessages(library(stringr)) 

reprex_mrn<-c("2449754", "2001748", "9023298", "2452107", "4174678", "6310355", "9338915", "2459626", "0423163")
reprex_us<-c("no significant pericardial effusion, hyperdynamic LV systolic function", "no pericardial effusion, enlarged akinetic RV, no organized cardiac function", "hyperdynamic with poor LV filling and large RV", "No Pericardial Effusion, no clot visualized", "very dilated LV with very poor function and inferoseptal WMA", "no cardiac activity", "Globally poor contractility", "Mildly enlarged left atria, Good Global Function", "poor global function progressing to cardiac arrest and then return of poor cardiac function, no pericardial effusion")
tee_reprex <- tibble(MRN=sample(reprex_mrn, 200, replace=TRUE), us_interpretation=sample(reprex_us, 200, replace=TRUE))
wordlist<-c("poor", "agonal", "cardiac arrest", "rearrested", "ventricular fibrillation", "enlarged")
tee_reprex1<- tee_reprex %>%
  mutate(us_abnormal=if_else("us_interpretation" %in% wordlist, "yes", "no"))

# us_interpretation has 19 categories
tee_reprex %>% group_by(us_interpretation) %>% count() %>% ungroup()
#> # A tibble: 9 x 2
#>   us_interpretation                                                            n
#>   <chr>                                                                    <int>
#> 1 Globally poor contractility                                                 18
#> 2 hyperdynamic with poor LV filling and large RV                              21
#> 3 Mildly enlarged left atria, Good Global Function                            23
#> 4 no cardiac activity                                                         20
#> 5 no pericardial effusion, enlarged akinetic RV, no organized cardiac fun…    30
#> 6 No Pericardial Effusion, no clot visualized                                 16
#> 7 no significant pericardial effusion, hyperdynamic LV systolic function      20
#> 8 poor global function progressing to cardiac arrest and then return of p…    21
#> 9 very dilated LV with very poor function and inferoseptal WMA                31

# you are interested in only six phrases
length(wordlist)
#> [1] 6

# to use the %in% operator is possible but a bit tricky and difficult to read; stringr::str_split(.," ") will give you a list of list of words in us_interpretation and you can compare each us_interpretaation to wordlist with list_of_list %in% wordlist (which seems backwards, but works)

# longer but easier to read is ifelse() with str_detect

find_keys <- function(z) {
    tee_reprex %>% filter(str_detect(us_interpretation,z) == TRUE)
    }
    
hits <- map(wordlist, find_keys)[[1]]
hits
#> # A tibble: 91 x 2
#>    MRN     us_interpretation                                                    
#>    <chr>   <chr>                                                                
#>  1 6310355 hyperdynamic with poor LV filling and large RV                       
#>  2 2449754 very dilated LV with very poor function and inferoseptal WMA         
#>  3 4174678 poor global function progressing to cardiac arrest and then return o…
#>  4 9338915 very dilated LV with very poor function and inferoseptal WMA         
#>  5 4174678 Globally poor contractility                                          
#>  6 2449754 poor global function progressing to cardiac arrest and then return o…
#>  7 2459626 Globally poor contractility                                          
#>  8 2001748 very dilated LV with very poor function and inferoseptal WMA         
#>  9 9023298 very dilated LV with very poor function and inferoseptal WMA         
#> 10 9338915 Globally poor contractility                                          
#> # … with 81 more rows

# this assumes only those records with a word in wordlist of are interest
# the function can be reversed with FALSE and the two tibbles rbind back

Created on 2020-01-06 by the reprex package (v0.3.0)

This works! When I used my original code it didn't work, but when I just replaced what I wrote with your text it finally worked so I'm cool now. In the interest of me becoming a more independent learner, how could I have learned that on my own?

Great. Please mark the solution for the benefit of those to follow.

Two great resources are rseek which tweaks data queries for R and R for Data Science. The key for me was learning to read and study the help() pages. Most of what you use in R is functional. Think school algebra f(x) = y writ large. The function takes arguments and returns results. Understanding those is key, and working the examples helps.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.