Hi,
I have following short data frame:
df <- data.frame(stringsAsFactors=FALSE,
URN = c("aaa", "bbb", "ccc"),
Q9a = c("Satisfied", "Contentsatifiedstressfree",
"satisfied please d"),
Q9b = c(" satisfying", "was not satisfied", "unhappy"),
Q9c = c("happy", "pleased", NA)
)
df
and I want to recode text into adjusted categories using general expressions:
library(tidyverse)
library(stringr)
results <- df %>%
mutate_at(vars(matches("Q\\d[a-c]")),
.funs = list(Rec = ~ case_when(
str_detect(., regex("Sati?|Satti?",
ignore_case = TRUE, multiline = TRUE))
& !str_detect(., regex("un?satis|dis?satis|no[nt][- ]?satis",
ignore_case = TRUE, multiline = TRUE)) ~ "Satisfied",
str_detect(., regex("Unhappy",
ignore_case = TRUE, multiline = TRUE))
& !str_detect(., regex("Happy",
ignore_case = TRUE, multiline = TRUE)) ~ "Happy",
str_detect(., regex("Happy",
ignore_case = TRUE, multiline = TRUE))
& !str_detect(., regex("Unhappy",
ignore_case = TRUE, multiline = TRUE)) ~ "Unhappy",
str_detect(., regex("Please?", ignore_case = TRUE, multiline = TRUE))
& !str_detect(., regex("Easy",
ignore_case = TRUE, multiline = TRUE)) ~ "Pleased"))
)
results
The first question I have is why "unhappy" and "satisfied please d" were not picked up when ignore_case had been used?
The second question is really about a way of simplifying opposite impressions like "Happy"/"Unhappy", "Satisfied"/"Dissatisfied". Is any way of replacing these lines of the code:
str_detect(., regex("Unhappy", ignore_case = TRUE, multiline = TRUE))
& !str_detect(., regex("Happy", ignore_case = TRUE, multiline = TRUE)) ~ "Happy",
str_detect(., regex("Happy", ignore_case = TRUE, multiline = TRUE))
& !str_detect(., regex("Unhappy", ignore_case = TRUE, multiline = TRUE)) ~ "Unhappy",
by something easier including typical English opposite prefixes such as "un", "diss", "Not" etc? Also, my "happy" statement was incorrectly codded into "Unhappy" and I cannot figure out why.
Lastly, I have issues with vars(matches("Q\d[a-c]"). This code is working for the df above but another data frame I have has questions Q3_A, Q3_B, Q3_C instead of Q9a, Q9b, Q9c. How can I modify matches("Q\d[a-c]" to have the same effect but for new variables? I have tried some options but no effect...
Can you help?