Vars matches and general expressions issues

Hi,
I have following short data frame:

df <- data.frame(stringsAsFactors=FALSE,
                 URN = c("aaa", "bbb", "ccc"),
                 Q9a = c("Satisfied", "Contentsatifiedstressfree",
                         "satisfied please d"),
                 Q9b = c(" satisfying", "was not satisfied", "unhappy"),
                 Q9c = c("happy", "pleased", NA)
)

df

and I want to recode text into adjusted categories using general expressions:

library(tidyverse)
library(stringr)

results <- df %>% 
  mutate_at(vars(matches("Q\\d[a-c]")),
            .funs = list(Rec = ~ case_when(
              str_detect(., regex("Sati?|Satti?", 
                                  ignore_case = TRUE, multiline = TRUE)) 
              & !str_detect(., regex("un?satis|dis?satis|no[nt][- ]?satis", 
                                     ignore_case = TRUE, multiline = TRUE)) ~ "Satisfied",
              str_detect(., regex("Unhappy", 
                                  ignore_case = TRUE, multiline = TRUE)) 
              & !str_detect(., regex("Happy", 
                                     ignore_case = TRUE, multiline = TRUE)) ~ "Happy",
              str_detect(., regex("Happy", 
                                  ignore_case = TRUE, multiline = TRUE)) 
              & !str_detect(., regex("Unhappy", 
                                     ignore_case = TRUE, multiline = TRUE)) ~ "Unhappy",
              str_detect(., regex("Please?", ignore_case = TRUE, multiline = TRUE))
              & !str_detect(., regex("Easy", 
                                     ignore_case = TRUE, multiline = TRUE)) ~ "Pleased"))
  )

  results

The first question I have is why "unhappy" and "satisfied please d" were not picked up when ignore_case had been used?

The second question is really about a way of simplifying opposite impressions like "Happy"/"Unhappy", "Satisfied"/"Dissatisfied". Is any way of replacing these lines of the code:

str_detect(., regex("Unhappy", ignore_case = TRUE, multiline = TRUE)) 
& !str_detect(., regex("Happy", ignore_case = TRUE, multiline = TRUE)) ~ "Happy",
str_detect(., regex("Happy",  ignore_case = TRUE, multiline = TRUE)) 
& !str_detect(., regex("Unhappy",  ignore_case = TRUE, multiline = TRUE)) ~ "Unhappy",

by something easier including typical English opposite prefixes such as "un", "diss", "Not" etc? Also, my "happy" statement was incorrectly codded into "Unhappy" and I cannot figure out why.

Lastly, I have issues with vars(matches("Q\d[a-c]"). This code is working for the df above but another data frame I have has questions Q3_A, Q3_B, Q3_C instead of Q9a, Q9b, Q9c. How can I modify matches("Q\d[a-c]" to have the same effect but for new variables? I have tried some options but no effect...

Can you help?

The last code block has its logic reversed. Unhappy and NOT Happy should map to Unhappy.

multiline is unnecessary unless there are embedded newlines

The expressions can be simplified by lower casing before hand

Simplified example

suppressPackageStartupMessages({library(dplyr)
library(stringr)})

df <- data.frame(stringsAsFactors=FALSE,
URN = c("aaa", "bbb", "ccc"),
Q9a = c("Satisfied", "Contentsatifiedstressfree",
"satisfied please d"),
Q9b = c(" satisfying", "was not satisfied", "unhappy"),
Q9c = c("happy", "pleased", NA))

df %>% mutate(Q9c = tolower(Q9c)) %>%
       select(Q9c) %>%
       mutate(Q9c_Rec = case_when(
         str_detect(Q9c, "^happy") ~ "Happy",
         str_detect(Q9c, "unhappy") ~ "Unhappy",
         str_detect(Q9c, "please?") ~ "Happy"
       ))
#>       Q9c Q9c_Rec
#> 1   happy   Happy
#> 2 pleased   Happy
#> 3    <NA>    <NA>

df %>% mutate(Q9b = tolower(Q9b)) %>%
  select(Q9b) %>%
  mutate(Q9b_Rec = case_when(
    str_detect(Q9b, "^happy") ~ "Happy",
    str_detect(Q9b, "unhappy") ~ "Unhappy",
    str_detect(Q9b, "please?") ~ "Happy"
  ))
#>                 Q9b Q9b_Rec
#> 1        satisfying    <NA>
#> 2 was not satisfied    <NA>
#> 3           unhappy Unhappy

Created on 2020-09-07 by the reprex package (v0.3.0)

Thank you but I can see separate sets of codes: one for Q9c, one for Q9b. My solution should take all these variables in one code without specifying variables' names. The code simply needs to find variables starting from Q and ending a, b, c, d (or A, B, C, D). Also, I still need to apply all options in your solution separately (unsatisfied, not satisfied, dissatisfied etc.). Is any way of applying prefixes to all adjectives? Almost every English adjective changes to negative after adding "un", "diss", "Not" etc.

The principal problem presented was pattern matching in the case_when block, which I addressed rather than its wrapper to apply to the different cases. For the more NLP approach, see packages, such as tidytext that provide stemmers to extract linguistic tokens from a corpus. Whether that level of generality is appropriate to the universe of data in the use case cannot be said.

I can also see that adding ^ in the beginning of words would exclude expressions like the above so first response for Q9b will be missing rather than Satisfied...

The example did not address that class of patterns, dealing with variants of satisfy. The principle, however is identical. As with tolower() begin with a pass to remove whitepace before moving on to distinguishing between positive and negative valences. That can be done by a pass to eliminate un, dis, non, etc. before testing for the presence of the stf root.

I still cannot find a final solution myself after looking at technocrat's solution .

I was expecting to see:

  1. a fix to mutate_at(vars(matches("Q\d[a-c]")) to take into account new variable names (Q3_A, Q3_B, Q3_C instead of Q9a, Q9b, Q9c) in one line, not in three separate sets of codes like in technocrat solution
  2. Some sort of one line of code taking into account all possible prefixes changing meaning into negative sentiment ("un", "diss", "Not" etc.)
  3. Another version of "^happy" but taking into account spaces around words searched by str_detect. Perhaps there is one line code working like trim function in Excel?

Are the above doable in R? I still believe in its power!

Q\d([a-c]|[A-C])

(\bun)|(\bdis)|(\bnot)/gi

I dont understand this...

with regex you can match multiple spaces, and replace with a single space (if thats what you mean?)
\s\s+

Thank you.
Answering your question. There are phrases with a space in front but using "^satisf?" would ignore these phrases. I would like to pick up words with a space in front like " satisfying" but ignore words with a negative prefix like "not satisfying" or "dissatisfied".

Also, can I apply your code

(\bun)|(\bdis)|(\bnot)/gi

to all lines of my code? Please look at my original question.

library(tidyverse)

(df <- data.frame(stringsAsFactors=FALSE,
                 URN = c("aaa", "bbb", "ccc"),
                 Q9a = c("Satisfied", "Contentsatifiedstressfree",
                         "satisfied please d"),
                 Q9b = c(" satisfying", "was not satisfied", "unhappy"),
                 Q9C = c("happy", "pleased", NA))) #uppercase C

negation <- function(x){
  str_detect(tolower(x),"(\\bun)|(\\bdis)|(not )")  
}


mmake <- function(x){
  
  assign(paste0("match_",x),function(a){
    str_detect(tolower(a),paste0(x,"?")) & !negation(tolower(a))
  },envir = .GlobalEnv)
}

mmake("satis")
mmake("happy")
mmake("please")


df %>% mutate(across(matches("Q\\d([a-c]|[A-C])"),
                          list("satis"=match_satis,
                               "happy"=match_happy,
                               "please"=match_please
),.names = "{.col}_{.fn}"))