Using str_detect with two wild cards?

kmprioli · October 11, 2018, 2:33pm

Hi all,

I have some free text data that I'm trying to recategorize. The data arises from health care coordinator contacts with patients and caregivers, which can be phone calls, emails, or text messages. I'm trying to use str_detect() with two wildcards and am getting a syntax error. Here's a reprex containing dummy data (not actual patient data).

library(tidyverse)

commdf <- tribble(
  ~case, ~purpose,
  1,     "set up visit",
  2,     "left message with client",
  3,     "Texted about visit",
  4,     "left voicemail",
  5,     "communication about appointment",
  6,     "phone call",
  7,     "Emailed client",
  8,     "client called back",
  9,     "REPORTED CALL TO MANAGER",
  10,    "texted client"
)

commdf <- commdf %>% 
  mutate(commtype = case_when(
    str_detect(str_to_lower(purpose), "*call*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*spoke*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*message*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*phone*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*discuss*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*reported*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*set up*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*confirm*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*sched*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*communicat*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*voicemail*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*vm*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*text*") == TRUE ~ "text",
    str_detect(str_to_lower(purpose), "*txt*") == TRUE ~ "text",
    str_detect(str_to_lower(purpose), "*email*") == TRUE ~ "email",
    str_detect(str_to_lower(purpose), "*e-mail*") == TRUE ~ "email",
    TRUE ~ NA
  ))
#> Error in mutate_impl(.data, dots): Evaluation error: Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX).

Created on 2018-10-11 by the reprex package (v0.2.0).

I'm not sure what's causing the error, but I wonder if it's from using two wildcard asterisks. Is this use legitimate? Is there a better way to go about this?

Also, I'd like to condense this code further by using something like c("*call*", "*spoke*", "*message*", ...) within str_detect(), but first need to figure out the regex error I'm getting.

Any help you can provide would be appreciated!

martin.R · October 11, 2018, 3:12pm

If you just want the text, e.g. "call" you don't need the *.

If you are also looking for the asterisk characters then you need to escape them, i.e. "\*call\*".

kmprioli · October 11, 2018, 3:20pm

Oh wow - looks like I've been using completely unnecessary asterisks! I thought they were needed to capture buried substrings - for example, call within phonecalls, called, etc. Thanks, Martin!

joels · October 11, 2018, 3:29pm

You could shorten the code quite a bit by combining some of the regular expressions. For example:

commdf <- commdf %>% 
  mutate(commtype = case_when(
    str_detect(str_to_lower(purpose), "call|spoke|message|phone|discuss|reported|set up|confirm|sched|communicat|voicemail|vm") ~ "call",
    str_detect(str_to_lower(purpose), "te?xt") ~ "text",
    str_detect(str_to_lower(purpose), "e-?mail") ~ "email",
    TRUE ~ NA_character_
  ))

kmprioli · October 11, 2018, 3:32pm

Perfect, this is exactly what I needed. Thank you Joel!

jcblum · October 11, 2018, 4:11pm

Dropping in to plug regexplain again (and its inspiration, RegExr). I’ve found that being able to see what your regex is matching in some of your own sample data, reactively updating as you fiddle, is huge for flattening the learning curve.

And regexplain will add the extra escape characters necessary when using regex in R for you! (You know what’s less fun than debugging missing escapes in a regex? debugging missing double escapes in a regex )

kmprioli · October 11, 2018, 5:31pm

This will definitely make life easier - thank you!