String recoding based on a library of predefined phrases

Hi,
I have this simple data file:

data.frame(stringsAsFactors=FALSE,
                                                                               Unique.respondent.number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                                                                                                comment = c("I have seen many various charges in my life,
                                                                                                            but I don’t like your saving rates",
                                                                                                            "I like R Studio", "No comment",
                                                                                                            "Main benefit is having low charges",
                                                                                                            NA,
                                                                                                            "Charge could be an issue",
                                                                                                            "Issues with saving rates", "Good saving rates",
                                                                                                            "Many benefits like reasonable charges", "N/A", "-",
                                                                                                            "Nothing")
                                                                            )

Now, I would like to recode all invalid or blank responses into a new variable called "Blank".

I can recode blank fields and single/double characters using these:

source$Blank <- ifelse(is.na(source$comment), 1, ifelse(str_length(source$comment)<3, 1, 0))

All fine but now, rather than guessing what other invalid responses could be ("No clue", "Nothing to say", "Don't know", "?", No comment"), I would like to use an external source to let R know that if "comment" equals any of the "No comment" values from the list, "Blank" should be recoded into 1.
Also, I know some respondents leave repetitive characters if have nothing to comment such as "zzzzz", "xxxx". Is any way of recoding "Blank" into 1 if a "comment" contains this repetitive string type?

A solution for this will be the following:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dataset <- tibble(Unique.respondent.number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                  comment = c("I have seen many various charges in my life,
                  but I don’t like your saving rates",
                              "I like R Studio", "No comment",
                              "Main benefit is having low charges",
                              NA,
                              "Charge could be an issue",
                              "Issues with saving rates", "Good saving rates",
                              "Many benefits like reasonable charges", "N/A", "-",
                              "Nothing"))

# modify it based on the other possible responses
list_of_no_comment_possibilities <- c(NA, "No comment", "N/A", "-", "Nothing")

dataset %>%
  mutate(Blank = if_else(condition = (comment %in% list_of_no_comment_possibilities),
                         true = 1,
                         false = 0))
#> # A tibble: 12 x 3
#>    Unique.respondent.nu~ comment                                      Blank
#>                    <dbl> <chr>                                        <dbl>
#>  1                     1 "I have seen many various charges in my lif~     0
#>  2                     2 I like R Studio                                  0
#>  3                     3 No comment                                       1
#>  4                     4 Main benefit is having low charges               0
#>  5                     5 <NA>                                             1
#>  6                     6 Charge could be an issue                         0
#>  7                     7 Issues with saving rates                         0
#>  8                     8 Good saving rates                                0
#>  9                     9 Many benefits like reasonable charges            0
#> 10                    10 N/A                                              1
#> 11                    11 -                                                1
#> 12                    12 Nothing                                          1

Created on 2019-07-08 by the reprex package (v0.3.0)

I'm not too sure about this, but probably you can try something like this:

if_else(str_detect(comment, "(.)\\1{5,}"), 1, 0)

It checks for repetition of some character for more than 5 times. But I'm really weak in regular expressions, so it may be wrong or there may be better solutions.

Thank you but:

  1. Although I did not have any errors, my results stay the same before and after applying
source %>%
  mutate(Blank = if_else(condition = (comment %in% list_of_no_comment_possibilities),
                         true = 1,
                         false = 0))

and I don't really know why...

  1. I was thinking about using a library of negative words. I know it is possible as I can see something similar here: A Light Introduction to Text Analysis in R | by Brian Ward | Towards Data Science
    (chapter about sentiment analysis). Unfortunately, I cannot apply it to my little data file

  2. Your second code is not working and I have the following error

Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) : 
  argument `str` should be a character vector (or an object coercible to)

What do you mean by this? Did you mean that after you run the code you quoted, if you run source again, you get the previous source? If that is the case, it's because you haven't assigned the results. Try with source <- source %>$ mutate(...), where ... is the rest of the code and afterwards source should get updated.

I can't help you with this.

It seems to work. Almost certainly you just copy-pasted my code in the console, and hence R has no idea what is comment object, and hence considers it as the function defined in base package. That's not what I meant.

What I meant was to use it inside mutate like the previous if_else. Add a few rows with the comment element being xxxxx or zzzzz and then if you use ... %>% mutate(Blank = if_else(((comment %in% list_of_no_comment_possibilities)|(str_detect(comment, "(.)\\1{4,}"))), 1, 0)), it is supposed to work and it do work on my device.

(Note that I change 5 to 4 here, as using 5 will search for 1+5=6 identical characters. If you want to check for repetitions of some specific characters, i.e. you know for certain it'll be either x or z you can replace . inside the parentheses)

If you can't do it, please provide a reproducible example showing your attempts at the problem and the errors you are facing.

Thank you very much. You are really helpful.
I managed to create something and I moved my issue to this post Mutate issues plus external dictionary for string detect
(I did not include my zzzz or xxxxxx detention though as I have just seen your response).
I know you have already helped me a lot but would you be able to look at the post and modify my existing work (blank_statements are defined there separately as they might be used for other projects)?

One comment to your response. Looking for elements from the list should not be case sensitive (so "No Comments" or "no comments" should be detected)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.