Text Analysis - Coding text into variables/ categories

Dear R users,
I have found a few Text Analysis packages which might be used for sentiment analysis, word clouds or even looking for phrases using Rapid Automatic Keyword Extraction like that:

## Using RAKE
stats <- keywords_rake(x = x, term = "lemma", group = "doc_id", 
                       relevant = x$upos %in% c("NOUN", "ADJ"))
stats$key <- factor(stats$keyword, levels = rev(stats$keyword))
barchart(key ~ rake, data = head(subset(stats, freq > 3), 20), col = "red", 
         main = "Keywords identified by RAKE", 
         xlab = "Rake")

Unfortunately, I cannot find anything with ability to specify rules which move comments into specific categories.
For example, we have a data frame with list of respondents giving comments and, therefore, just two variables: “Unique respondent number” and his/her “comment”.
Now I need to apply following rules to the “comment” variable for each “Unique respondent number”:

  1. If a sentence contains a word "charge" or "charges" but does not contain "benefit" or "benefits" a new variable called “Charges/ Fees” should be created and value 1 given for a record where this sentence is (otherwise 0).
  2. If a sentence has a negative sentiment (I believe I could specify words which determine whether the sentiment is negative, positive or neutral) and it contains a phrase "savings rates" a new variable called “Poor Rates” should be created and value 1 given for a record where this sentence is (otherwise 0).
  3. If a sentence does not meet the 2 criteria above, a new variable called “Other” should be created and value 1 given for a record where this sentence is (otherwise 0).
  4. If there is no comment, so “comment” field is blank, or a character from a list of NAs given (I guess I could set up this list with words like “NA”, “No comment”, “Nothing to say”) the fourth variable called “Blank” should be created and value 1 given for this respondent (otherwise 0).

In the end, I should end up having four new variables added to the existing data frame with value 1 when the criteria above are met (“Charges/ Fees”, “Poor Rates”, “Other”, “Blank”) with 1s or 0s for each Unique respondent number. Obviously, some comments can meet criteria 1-3 in many combinations for example:

  1. If first respondent’s comment was “I have seen many various charges in my life, but I don’t like your saving rates”, Unique respondent number=1 should get:
  • Charges/ Fees=1,
  • Poor Rates=1,
  • Other=0,
  • Blank=0
  1. If second respondent’s comment was “I like R Studio”, Unique respondent number=2 should get:
  • Charges/ Fees=0,
  • Poor Rates=0,
  • Other=1,
  • Blank=0
  1. If third respondent’s comment was “No comment”, Unique respondent number=3 should get:
  • Charges/ Fees=0,
  • Poor Rates=0,
  • Other=0,
  • Blank=1
  1. If fourth respondent’s comment was “Main benefit is having low charges”, Unique respondent number=4 should get:
  • Charges/ Fees=0,
  • Poor Rates=0,
  • Other=1,
  • Blank=0
  1. If fourth respondent’s comment field was left blank, Unique respondent number=5 should get:
  • Charges/ Fees=0,
  • Poor Rates=0,
  • Other=0,
  • Blank=1

Finally, I need to count a proportion of 1s appearing in “Charges/ Fees”, “Poor Rates”, “Other” and “Blank” in total number of df records.

If my df had only 5 records mentioned above the results should be following:

  • Charges/ Fees=20%
  • Poor Rates=20%
  • Other=40%
  • Blank=40%

Is it challenging enough?

This sounds relatively easy to do by simple data wrangling using dplyr or even base R (no need for an especilized package), as usual, if you need specific help, please provide a reproducible example of your issue including sample data on a copy/paste friendly format.

Also, maybe you want to change your title to something more descriptive for your task in hand, is not very helpful as it is right now.

Oh wow, easy? really? That is my test data:

library(readxl)
source <- read_excel("C:/Users/sdanilowicz/Documents/TM test data.xlsx", sheet = "comments")

datapasta::df_paste(source)
data.frame(stringsAsFactors=FALSE,
   Unique.respondent.number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
                    comment = c("I have seen many various charges in my life,
                                but I don’t like your saving rates",
                                "I like R Studio", "No comment",
                                "Main benefit is having low charges", NA, "Charge could be an issue",
                                "Issues with saving rates", "Good saving rates",
                                "Many benefits like reasonable charges", "NA")
)

Obviously we need to set up a list of words which specify the negative sentiment mentioned in point 2. In this df this would be "don't like" and "issue"

This example shows how to create the first two variables using dplyr and regular expressions, you should be able to complete the rest by doing something similar.

df <- data.frame(stringsAsFactors=FALSE,
                 Unique.respondent.number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
                 comment = c("I have seen many various charges in my life,
                                but I don’t like your saving rates",
                             "I like R Studio", "No comment",
                             "Main benefit is having low charges", NA, "Charge could be an issue",
                             "Issues with saving rates", "Good saving rates",
                             "Many benefits like reasonable charges", "NA"))
library(dplyr)
library(stringr)

df %>% 
    mutate(Charges_Fees = if_else(str_detect(comment, "charges?") & !str_detect(comment, "benefits?"), 1, 0),
           Poor_Rates = if_else(str_detect(comment, "don.?t\\slike|issue") & str_detect(comment, "saving\\srates"), 1, 0)) %>% 
    select(-comment) # I have diselected this long variable just for printing purposes

#>    Unique.respondent.number Charges_Fees Poor_Rates
#> 1                         1            1          1
#> 2                         2            0          0
#> 3                         3            0          0
#> 4                         4            0          0
#> 5                         5           NA         NA
#> 6                         6            0          0
#> 7                         7            0          0
#> 8                         8            0          0
#> 9                         9            0          0
#> 10                       10            0          0

Created on 2019-07-04 by the reprex package (v0.3.0)

Thank you very much but:

  1. I have the following error
Error in UseMethod("mutate_") : 
  no applicable method for 'mutate_' applied to an object of class "function"
  1. The results are incorrect (respondent 6 and 7) and we should get the following (only integers 0 or 1 are allowed):
#>    Unique.respondent.number Charges_Fees Poor_Rates
#> 1                         1            1          1
#> 2                         2            0          0
#> 3                         3            0          0
#> 4                         4            0          0
#> 5                         5            0          0
#> 6                         6            1          0
#> 7                         7            0          1
#> 8                         8            0          0
#> 9                         9            0          0
#> 10                       10            0          0

Also, can we assign a list of words which characterise negative sentiment (such as "don't like", "issue" etc.) and a list of words which describe empty response (such as "no comment", "nothing to say"," NA" etc) separately and use a reference to them in a code?

I really appreciate your help and I can create more rules after “Charges/ Fees”, “Poor Rates” but "Other" is conditional (if a comment does not meet any previous requirement and is not blank then it should become "Other".). Is it a simple condition?

Thank you,
Slavek

For your first point I can't know why are you getting that error without a reproducible example (it works for me on a clean environment with the sample data provided).

For the second point, it's giving incorrect results because I forgot to make it case insensitive, this would fix that.

df <- data.frame(stringsAsFactors=FALSE,
                 Unique.respondent.number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
                 comment = c("I have seen many various charges in my life,
                                but I don’t like your saving rates",
                             "I like R Studio", "No comment",
                             "Main benefit is having low charges", NA, "Charge could be an issue",
                             "Issues with saving rates", "Good saving rates",
                             "Many benefits like reasonable charges", "NA"))
library(dplyr)
library(stringr)

df %>% 
    mutate(Charges_Fees = if_else(str_detect(comment, regex("charges?", ignore_case = TRUE)) &
                                      !str_detect(comment, regex("benefits?", ignore_case = TRUE)), 1, 0),
           Poor_Rates = if_else(str_detect(comment, regex("don.?t\\slike|issue", ignore_case = TRUE)) &
                                    str_detect(comment, regex("saving\\srates", ignore_case = TRUE)), 1, 0)) %>% 
    select(-comment) %>% 
    mutate_all(~if_else(is.na(.), 0, .))
#>    Unique.respondent.number Charges_Fees Poor_Rates
#> 1                         1            1          1
#> 2                         2            0          0
#> 3                         3            0          0
#> 4                         4            0          0
#> 5                         5            0          0
#> 6                         6            1          0
#> 7                         7            0          1
#> 8                         8            0          0
#> 9                         9            0          0
#> 10                       10            0          0

About the "only integers 0 or 1 are allowed" part, "NA" is not a character string, is the way R deals with missing values, it stands for "Not Available", but you can replace that with 0 if you want (as shown in the example above).

Yes, you can create the regular expression separately and reference it later

negative_sentiments <- regex("don.?t\\slike|issue|other words", ignore_case = TRUE)

You just have to find the right logical statement, to give you a hint, once you have created a variable with mutate you can refence its value, so you could check if any of the previos variables have value 1

I have new version of R installed 3.6.0, I refreshed everything and reinstalled dplyr but I have this error:

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

I don't really know why this code works for you but it does not for me...
Still the same error:

Error in UseMethod("mutate_") : 
  no applicable method for 'mutate_' applied to an object of class "function"

Do you know how I could fix that?

Slavek

Also, what should I include in the "Blank" code to indicate blank fields (if blank or "NA" or "No comments" than 1 otherwise 0)?

Slavek

Are you running the exact same code on a clean R session? try restarting your R sesion with Ctrl+Shift+F10

Yes, restarted, used your key combination, source display correctly:

datapasta::df_paste(source)
data.frame(stringsAsFactors=FALSE,
   Unique.respondent.number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
                    comment = c("I have seen many various charges in my life,
                                but I don’t like your saving rates",
                                "I like R Studio", "No comment",
                                "Main benefit is having low charges", NA, "Charge could be an issue",
                                "Issues with saving rates", "Good saving rates",
                                "Many benefits like reasonable charges", "N/A")

# A tibble: 10 x 2
   `Unique respondent number` comment                                                                        
                        <dbl> <chr>                                                                          
 1                          1 I have seen many various charges in my life, but I don’t like your saving rates
 2                          2 I like R Studio                                                                
 3                          3 No comment                                                                     
 4                          4 Main benefit is having low charges                                             
 5                          5 NA                                                                             
 6                          6 Charge could be an issue                                                       
 7                          7 Issues with saving rates                                                       
 8                          8 Good saving rates                                                              
 9                          9 Many benefits like reasonable charges                                          
10                         10 N/A            

but issues with dplyr (maybe a special installation is required?):

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

and with your code...:

Error in UseMethod("mutate_") : 
  no applicable method for 'mutate_' applied to an object of class "function"

:frowning_face:

Nope, I'm just using the CRAN version of dplyr, nothing special about it.

Ok, I have run the same code in the main R console (R x64 3.6.0) and it's working. What is the solution? Fresh R installation?

Actually, It worked once. Now an error in R (not R Studio):

Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) : 
  argument `str` should be a character vector (or an object coercible to)

Also, a silly question but I cannot fix it myself:

# I have diselected this long variable just for printing purposes

How can I select the entire df? URN, comment and new variables?

I don't know the solution for the problem regarding running the code. Maybe Andres or some Rstudio people can help you with that. I can only say that it works perfectly for me.

select is a function in dplyr package, which can be used to select a subset of the columns. Here, Andres selected all but the comment column. If you want to have the whole data frame, just comment out that line.

If you are not familiar with these functions, you can check out Chapter 5 of R4DS, a free online book:

Thank you for your responses. I'm trying to find solutions with my limited R knowledge.
I cannot find any reference to the data source (called "source") in the code. Is it normal? We use reference to "comment" which is source$comment.
I have also tried running a part of the code without pipes:

result <- mutate(Charges_Fees = if_else(str_detect(comment, regex("charges?", ignore_case = TRUE)) &
                                          !str_detect(comment, regex("benefits?", ignore_case = TRUE)), 1, 0),
                 Poor_Rates = if_else(str_detect(comment, regex("don.?t\\slike|issue", ignore_case = TRUE)) &
                                        str_detect(comment, regex("saving\\srates", ignore_case = TRUE)), 1, 0))

and the error is following:

Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) : 
  argument `str` should be a character vector (or an object coercible to)

When I change 'comment' into 'source$comment':


result <- mutate(Charges_Fees = if_else(str_detect(source$comment, regex("charges?", ignore_case = TRUE)) &
                                          !str_detect(source$comment, regex("benefits?", ignore_case = TRUE)), 1, 0),
                 Poor_Rates = if_else(str_detect(source$comment, regex("don.?t\\slike|issue", ignore_case = TRUE)) &
                                        str_detect(source$comment, regex("saving\\srates", ignore_case = TRUE)), 1, 0))

the error is different:

Error in mutate_(.data, .dots = compat_as_lazy_dots(...)) : 
  argument ".data" is missing, with no default

I don't want to give up so quickly!

In my answer I'm using your sample data and I have call it "df", you have to replace this with your own dataset, i.e. "source".

If you run mutate() without pipes then you have to provide the .data=source argument inside the function.

The solution for this is reading the book Yarnabrina pointed out in his response

Hurray!!!! Thank you for being so patient.
I knew it must have been a silly error! My little data example is called "source" so I used this name :slight_smile:

Final question please, please.

When I remove this bit

  select(-comment) %>% 

from the code (as suggested by Yarnabrina "If you want to have the whole data frame, just comment out that line.") and use this code:

source %>% 
  mutate(Charges_Fees = if_else(str_detect(comment, regex("charges?", ignore_case = TRUE)) &
                                  !str_detect(comment, regex("benefits?", ignore_case = TRUE)), 1, 0),
         Poor_Rates = if_else(str_detect(comment, regex("don.?t\\slike|issue", ignore_case = TRUE)) &
                                str_detect(comment, regex("saving\\srates", ignore_case = TRUE)), 1, 0)) %>% 
  mutate_all(~if_else(is.na(.), 0, .))

my error is:

Error: `false` must be a double vector, not a character vector

What am I doing wrong? It's just simply removing one condition from the chain of pipes...

Yarnabrina is right, try this

source %>% 
    mutate(Charges_Fees = if_else(str_detect(comment, regex("charges?", ignore_case = TRUE)) &
                                      !str_detect(comment, regex("benefits?", ignore_case = TRUE)), 1, 0),
           Poor_Rates = if_else(str_detect(comment, regex("don.?t\\slike|issue", ignore_case = TRUE)) &
                                    str_detect(comment, regex("saving\\srates", ignore_case = TRUE)), 1, 0)) %>% 
    mutate_if(is.numeric, ~if_else(is.na(.), 0, .))

You are my Master!

Thank you!!!