Recode variable based on keywords within text

Hello!

I have 10114 (and counting) text messages in 4 different languages I would like to code for analysis.

I'm looking for a concise way to label/recode the text messages based on keywords/text within the message.

The data looks similar to this:

library(tidyverse)
library(knitr)

msg <- tibble::tribble(
               ~ID,                                                      ~text,
                 1,          "Please call me from Jane,  sent on: Mar 1, 2019",
                 2,           "Please call me from Dan,  sent on: Feb 5, 2018",
                 3,           "Please call me from Ben,  sent on: Mar 9, 2017",
                 4,     "Reminder to do something Jane,  sent on: Apr 1, 2016",
                 5,          "Reminder to do this Dan,  sent on: Jun 14, 2019",
                 6, "Reminder to do something else Ben,  sent on: Jan 1, 2018"
               )
msg %>% kable ()
ID text
1 Please call me from Jane, sent on: Mar 1, 2019
2 Please call me from Dan, sent on: Feb 5, 2018
3 Please call me from Ben, sent on: Mar 9, 2017
4 Reminder to do something Jane, sent on: Apr 1, 2016
5 Reminder to do this Dan, sent on: Jun 14, 2019
6 Reminder to do something else Ben, sent on: Jan 1, 2018

I would like to add a label variable to classify each message based on its contents to use in further analysis. For example:

library(tidyverse)
library(knitr)
msg_lab <- tibble::tribble(
                   ~ID,                                                      ~text,     ~label,
                     1,          "Please call me from Jane,  sent on: Mar 1, 2019",  "Call me",
                     2,           "Please call me from Dan,  sent on: Feb 5, 2018",  "Call me",
                     3,           "Please call me from Ben,  sent on: Mar 9, 2017",  "Call me",
                     4,     "Reminder to do something Jane,  sent on: Apr 1, 2016", "Reminder",
                     5,          "Reminder to do this Dan,  sent on: Jun 14, 2019", "Reminder",
                     6, "Reminder to do something else Ben,  sent on: Jan 1, 2018", "Reminder"
                   )

msg_lab %>% kable()
ID text label
1 Please call me from Jane, sent on: Mar 1, 2019 Call me
2 Please call me from Dan, sent on: Feb 5, 2018 Call me
3 Please call me from Ben, sent on: Mar 9, 2017 Call me
4 Reminder to do something Jane, sent on: Apr 1, 2016 Reminder
5 Reminder to do this Dan, sent on: Jun 14, 2019 Reminder
6 Reminder to do something else Ben, sent on: Jan 1, 2018 Reminder


table(msg_lab$label)
#> 
#>  Call me Reminder 
#>        3        3

I'm trying to use the fct_recode function from the forcats package. This solution works, however, it isn't feasible for my data set.

library(forcats)
msg_lab <- msg %>%
        mutate(label = fct_recode(text,
                                  "Call me" = "Please call me from Jane,  sent on: Mar 1, 2019",
                                  "Call me" = "Please call me from Dan,  sent on: Feb 5, 2018",
                                  "Call me" = "Please call me from Ben,  sent on: Mar 9, 2017",
                                  "Reminder" = "Reminder to do something Jane,  sent on: Apr 1, 2016",
                                  "Reminder" = "Reminder to do this Dan,  sent on: Jun 14, 2019",
                                  "Reminder" = "Reminder to do something else Ben,  sent on: Jan 1, 2018"
        ))

table(msg_lab$label)
#> 
#>  Call me Reminder 
#>        3        3

I tried to use the str_detect function from the stringr package to determine the presence of the key words used in my labels. Unfortunately, this results in an error.

library(stringr)
str_detect("Please call me from Jane,  sent on: Mar 1, 2019", "call me")
#> [1] TRUE

msg_lab <- msg %>%
        mutate(label = fct_recode(text,
                                  "Call me" = str_detect(text, "call me"),
                                  "Call me" = str_detect(text, "Reminder")
        ))
#> Error: Each input to fct_recode must be a single named string. Problems at positions: 1, 2

I would appreciate some pointers on how to do this!

What using case_when if the rules are fairly straightforward?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)
library(stringr)
library(kableExtra)
#> 
#> Attaching package: 'kableExtra'
#> The following object is masked from 'package:dplyr':
#> 
#>     group_rows

msg <- tibble::tribble(
  ~ID,                                                      ~text,
  1,          "Please call me from Jane,  sent on: Mar 1, 2019",
  2,           "Please call me from Dan,  sent on: Feb 5, 2018",
  3,           "Please call me from Ben,  sent on: Mar 9, 2017",
  4,     "Reminder to do something Jane,  sent on: Apr 1, 2016",
  5,          "Reminder to do this Dan,  sent on: Jun 14, 2019",
  6, "Reminder to do something else Ben,  sent on: Jan 1, 2018"
)

msg_lab <- msg %>%
  mutate(
    label = case_when(
      str_detect(text, "call me") ~ "Call me",
      str_detect(text, "Reminder") ~ "Reminder"
    )
  )

print(msg_lab)
#> # A tibble: 6 x 3
#>      ID text                                                     label   
#>   <dbl> <chr>                                                    <chr>   
#> 1     1 Please call me from Jane,  sent on: Mar 1, 2019          Call me 
#> 2     2 Please call me from Dan,  sent on: Feb 5, 2018           Call me 
#> 3     3 Please call me from Ben,  sent on: Mar 9, 2017           Call me 
#> 4     4 Reminder to do something Jane,  sent on: Apr 1, 2016     Reminder
#> 5     5 Reminder to do this Dan,  sent on: Jun 14, 2019          Reminder
#> 6     6 Reminder to do something else Ben,  sent on: Jan 1, 2018 Reminder

Created on 2019-08-13 by the reprex package (v0.3.0)

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.