Dear R users,
I have found a few Text Analysis packages which might be used for sentiment analysis, word clouds or even looking for phrases using Rapid Automatic Keyword Extraction like that:
## Using RAKE
stats <- keywords_rake(x = x, term = "lemma", group = "doc_id",
relevant = x$upos %in% c("NOUN", "ADJ"))
stats$key <- factor(stats$keyword, levels = rev(stats$keyword))
barchart(key ~ rake, data = head(subset(stats, freq > 3), 20), col = "red",
main = "Keywords identified by RAKE",
xlab = "Rake")
Unfortunately, I cannot find anything with ability to specify rules which move comments into specific categories.
For example, we have a data frame with list of respondents giving comments and, therefore, just two variables: “Unique respondent number” and his/her “comment”.
Now I need to apply following rules to the “comment” variable for each “Unique respondent number”:
- If a sentence contains a word "charge" or "charges" but does not contain "benefit" or "benefits" a new variable called “Charges/ Fees” should be created and value 1 given for a record where this sentence is (otherwise 0).
- If a sentence has a negative sentiment (I believe I could specify words which determine whether the sentiment is negative, positive or neutral) and it contains a phrase "savings rates" a new variable called “Poor Rates” should be created and value 1 given for a record where this sentence is (otherwise 0).
- If a sentence does not meet the 2 criteria above, a new variable called “Other” should be created and value 1 given for a record where this sentence is (otherwise 0).
- If there is no comment, so “comment” field is blank, or a character from a list of NAs given (I guess I could set up this list with words like “NA”, “No comment”, “Nothing to say”) the fourth variable called “Blank” should be created and value 1 given for this respondent (otherwise 0).
In the end, I should end up having four new variables added to the existing data frame with value 1 when the criteria above are met (“Charges/ Fees”, “Poor Rates”, “Other”, “Blank”) with 1s or 0s for each Unique respondent number. Obviously, some comments can meet criteria 1-3 in many combinations for example:
- If first respondent’s comment was “I have seen many various charges in my life, but I don’t like your saving rates”, Unique respondent number=1 should get:
- Charges/ Fees=1,
- Poor Rates=1,
- Other=0,
- Blank=0
- If second respondent’s comment was “I like R Studio”, Unique respondent number=2 should get:
- Charges/ Fees=0,
- Poor Rates=0,
- Other=1,
- Blank=0
- If third respondent’s comment was “No comment”, Unique respondent number=3 should get:
- Charges/ Fees=0,
- Poor Rates=0,
- Other=0,
- Blank=1
- If fourth respondent’s comment was “Main benefit is having low charges”, Unique respondent number=4 should get:
- Charges/ Fees=0,
- Poor Rates=0,
- Other=1,
- Blank=0
- If fourth respondent’s comment field was left blank, Unique respondent number=5 should get:
- Charges/ Fees=0,
- Poor Rates=0,
- Other=0,
- Blank=1
Finally, I need to count a proportion of 1s appearing in “Charges/ Fees”, “Poor Rates”, “Other” and “Blank” in total number of df records.
If my df had only 5 records mentioned above the results should be following:
- Charges/ Fees=20%
- Poor Rates=20%
- Other=40%
- Blank=40%
Is it challenging enough?