Vars matches and general expressions issues

Hi,
I have following short data frame:

df <- data.frame(stringsAsFactors=FALSE,
                 URN = c("aaa", "bbb", "ccc"),
                 Q9a = c("Satisfied", "Contentsatifiedstressfree",
                         "satisfied please d"),
                 Q9b = c(" satisfying", "was not satisfied", "unhappy"),
                 Q9c = c("happy", "pleased", NA)
)

df

and I want to recode text into adjusted categories using general expressions:

library(tidyverse)
library(stringr)

results <- df %>% 
  mutate_at(vars(matches("Q\\d[a-c]")),
            .funs = list(Rec = ~ case_when(
              str_detect(., regex("Sati?|Satti?", 
                                  ignore_case = TRUE, multiline = TRUE)) 
              & !str_detect(., regex("un?satis|dis?satis|no[nt][- ]?satis", 
                                     ignore_case = TRUE, multiline = TRUE)) ~ "Satisfied",
              str_detect(., regex("Unhappy", 
                                  ignore_case = TRUE, multiline = TRUE)) 
              & !str_detect(., regex("Happy", 
                                     ignore_case = TRUE, multiline = TRUE)) ~ "Happy",
              str_detect(., regex("Happy", 
                                  ignore_case = TRUE, multiline = TRUE)) 
              & !str_detect(., regex("Unhappy", 
                                     ignore_case = TRUE, multiline = TRUE)) ~ "Unhappy",
              str_detect(., regex("Please?", ignore_case = TRUE, multiline = TRUE))
              & !str_detect(., regex("Easy", 
                                     ignore_case = TRUE, multiline = TRUE)) ~ "Pleased"))
  )

  results

The first question I have is why "unhappy" and "satisfied please d" were not picked up when ignore_case had been used?

The second question is really about a way of simplifying opposite impressions like "Happy"/"Unhappy", "Satisfied"/"Dissatisfied". Is any way of replacing these lines of the code:

str_detect(., regex("Unhappy", ignore_case = TRUE, multiline = TRUE)) 
& !str_detect(., regex("Happy", ignore_case = TRUE, multiline = TRUE)) ~ "Happy",
str_detect(., regex("Happy",  ignore_case = TRUE, multiline = TRUE)) 
& !str_detect(., regex("Unhappy",  ignore_case = TRUE, multiline = TRUE)) ~ "Unhappy",

by something easier including typical English opposite prefixes such as "un", "diss", "Not" etc? Also, my "happy" statement was incorrectly codded into "Unhappy" and I cannot figure out why.

Lastly, I have issues with vars(matches("Q\d[a-c]"). This code is working for the df above but another data frame I have has questions Q3_A, Q3_B, Q3_C instead of Q9a, Q9b, Q9c. How can I modify matches("Q\d[a-c]" to have the same effect but for new variables? I have tried some options but no effect...

Can you help?

The last code block has its logic reversed. Unhappy and NOT Happy should map to Unhappy.

multiline is unnecessary unless there are embedded newlines

The expressions can be simplified by lower casing before hand

Simplified example

suppressPackageStartupMessages({library(dplyr)
library(stringr)})

df <- data.frame(stringsAsFactors=FALSE,
URN = c("aaa", "bbb", "ccc"),
Q9a = c("Satisfied", "Contentsatifiedstressfree",
"satisfied please d"),
Q9b = c(" satisfying", "was not satisfied", "unhappy"),
Q9c = c("happy", "pleased", NA))

df %>% mutate(Q9c = tolower(Q9c)) %>%
       select(Q9c) %>%
       mutate(Q9c_Rec = case_when(
         str_detect(Q9c, "^happy") ~ "Happy",
         str_detect(Q9c, "unhappy") ~ "Unhappy",
         str_detect(Q9c, "please?") ~ "Happy"
       ))
#>       Q9c Q9c_Rec
#> 1   happy   Happy
#> 2 pleased   Happy
#> 3    <NA>    <NA>

df %>% mutate(Q9b = tolower(Q9b)) %>%
  select(Q9b) %>%
  mutate(Q9b_Rec = case_when(
    str_detect(Q9b, "^happy") ~ "Happy",
    str_detect(Q9b, "unhappy") ~ "Unhappy",
    str_detect(Q9b, "please?") ~ "Happy"
  ))
#>                 Q9b Q9b_Rec
#> 1        satisfying    <NA>
#> 2 was not satisfied    <NA>
#> 3           unhappy Unhappy

Created on 2020-09-07 by the reprex package (v0.3.0)

Thank you but I can see separate sets of codes: one for Q9c, one for Q9b. My solution should take all these variables in one code without specifying variables' names. The code simply needs to find variables starting from Q and ending a, b, c, d (or A, B, C, D). Also, I still need to apply all options in your solution separately (unsatisfied, not satisfied, dissatisfied etc.). Is any way of applying prefixes to all adjectives? Almost every English adjective changes to negative after adding "un", "diss", "Not" etc.

The principal problem presented was pattern matching in the case_when block, which I addressed rather than its wrapper to apply to the different cases. For the more NLP approach, see packages, such as tidytext that provide stemmers to extract linguistic tokens from a corpus. Whether that level of generality is appropriate to the universe of data in the use case cannot be said.

I can also see that adding ^ in the beginning of words would exclude expressions like the above so first response for Q9b will be missing rather than Satisfied...

The example did not address that class of patterns, dealing with variants of satisfy. The principle, however is identical. As with tolower() begin with a pass to remove whitepace before moving on to distinguishing between positive and negative valences. That can be done by a pass to eliminate un, dis, non, etc. before testing for the presence of the stf root.

I still cannot find a final solution myself after looking at technocrat's solution .

I was expecting to see:

  1. a fix to mutate_at(vars(matches("Q\d[a-c]")) to take into account new variable names (Q3_A, Q3_B, Q3_C instead of Q9a, Q9b, Q9c) in one line, not in three separate sets of codes like in technocrat solution
  2. Some sort of one line of code taking into account all possible prefixes changing meaning into negative sentiment ("un", "diss", "Not" etc.)
  3. Another version of "^happy" but taking into account spaces around words searched by str_detect. Perhaps there is one line code working like trim function in Excel?

Are the above doable in R? I still believe in its power!

Q\d([a-c]|[A-C])

(\bun)|(\bdis)|(\bnot)/gi

I dont understand this...

with regex you can match multiple spaces, and replace with a single space (if thats what you mean?)
\s\s+

Thank you.
Answering your question. There are phrases with a space in front but using "^satisf?" would ignore these phrases. I would like to pick up words with a space in front like " satisfying" but ignore words with a negative prefix like "not satisfying" or "dissatisfied".

Also, can I apply your code

(\bun)|(\bdis)|(\bnot)/gi

to all lines of my code? Please look at my original question.

library(tidyverse)

(df <- data.frame(stringsAsFactors=FALSE,
                 URN = c("aaa", "bbb", "ccc"),
                 Q9a = c("Satisfied", "Contentsatifiedstressfree",
                         "satisfied please d"),
                 Q9b = c(" satisfying", "was not satisfied", "unhappy"),
                 Q9C = c("happy", "pleased", NA))) #uppercase C

negation <- function(x){
  str_detect(tolower(x),"(\\bun)|(\\bdis)|(not )")  
}


mmake <- function(x){
  
  assign(paste0("match_",x),function(a){
    str_detect(tolower(a),paste0(x,"?")) & !negation(tolower(a))
  },envir = .GlobalEnv)
}

mmake("satis")
mmake("happy")
mmake("please")


df %>% mutate(across(matches("Q\\d([a-c]|[A-C])"),
                          list("satis"=match_satis,
                               "happy"=match_happy,
                               "please"=match_please
),.names = "{.col}_{.fn}"))

Looks amazing but I have following error:

Error in across(matches("Q\\d([a-c]|[A-C])"), list(satis = match_satis,  : 
  could not find function "across"

You need the latest version of dplyr. Across was added in June

Thank you but the output is TRUE/FALSE and 9 categories instead of just 3 are created...

I think you are using the word category instead of variable, but ok.
What are the 3 variables that should be created, and what should they contain ?

I still want my original df layout so I want recoded variables Q9a_Rec, Q9b_Rec and Q9C_Rec as a result but with recoded phrases:

library(tidyverse)
library(stringr)

results <- df %>% 
  mutate_at(vars(matches("Q\\d[a-c]|[A-C]")),
            .funs = list(Rec = ~ case_when(
              str_detect(., regex("Sati?|Satti?", 
                                  ignore_case = TRUE, multiline = TRUE)) 
              & !str_detect(., regex("un?satis|dis?satis|no[nt][- ]?satis", 
                                     ignore_case = TRUE, multiline = TRUE)) ~ "Satisfied",
              str_detect(., regex("Unhappy", 
                                  ignore_case = TRUE, multiline = TRUE)) 
              & !str_detect(., regex("Happy", 
                                     ignore_case = TRUE, multiline = TRUE)) ~ "Happy",
              str_detect(., regex("Happy", 
                                  ignore_case = TRUE, multiline = TRUE)) 
              & !str_detect(., regex("Unhappy", 
                                     ignore_case = TRUE, multiline = TRUE)) ~ "Unhappy",
              str_detect(., regex("Please?", ignore_case = TRUE, multiline = TRUE))
              & !str_detect(., regex("Easy", 
                                     ignore_case = TRUE, multiline = TRUE)) ~ "Pleased"))
  )

  results

Unfortunately, "happy" and "unhappy" are written in separate lines with extra & !str_detect line to exclude opposites.
Also, my code is not working properly as Q9c for aaa, Q9b for bbb and Q9b for ccc are incorectly allocated after using my code.
I want just 3 new variables as this example contains only "happy", "satisfied" and "easy" but, as you can imagine, real data contains dozens of phrases like these.
Without a solution, my code would become huge after adding other records with phrases like "efficient", "helpful", "friendly" etc.

Can you help please?

I would do it like this

library(tidyverse)
library(stringr)


(df <- data.frame(stringsAsFactors=FALSE,
                  URN = c("aaa", "bbb", "ccc"),
                  Q9a = c("Satisfied", "Contentsatifiedstressfree",
                          "satisfied please d"),
                  Q9b = c(" satisfying", "was not satisfied", "unhappy"),
                  Q9C = c("happy", "pleased", NA))) #uppercase C


negation <- function(x){
  str_detect(tolower(x),"(\\bun)|(\\bdis)|(not )")  
}


mmake <- function(x){
  
  assign(paste0("match_",x),function(a){
   case_when(str_detect(tolower(a),paste0(x,"?")) & !negation(tolower(a)) ~ x,
             str_detect(tolower(a),paste0(x,"?")) ~ paste0("!",x),
             TRUE ~ NA_character_)
  },envir = .GlobalEnv)
}

mmake("satis")
mmake("happy")
mmake("please")

masterf <- function(x){
coalesce(match_satis(x),
         match_happy(x),
         match_please(x))
}

df %>% mutate(across(matches("Q\\d([a-c]|[A-C])"),
                     list("result"=masterf),.names = "{.col}_{.fn}"))

#  URN                        Q9a                Q9b      Q9C  Q9a_result  Q9b_result  Q9C_result
#1 aaa                  Satisfied         satisfying    happy       satis       satis       happy
#2 bbb  Contentsatifiedstressfree  was not satisfied  pleased       satis      !satis      please
#3 ccc         satisfied please d            unhappy     <NA>       satis      !happy        <NA>

Absolutely brilliant master! :grinning:
I have some questions though:

  1. Using str_detect and regex allows picking up words with spelling mistakes ("satisfied" but also "sattisfied", "friendly" but also "fredly" etc. How can I do it with the function? Do I need to specify just a part of the word which is always the same? What about excluding some similar words with different meaning? For example I need to pick up "pleased" but not "pleasant" or "pleasent" (so -ed suffix is fine but -ant or -ent are not)? is this possible?
  2. I'm new to this sophisticated function solution so can you let me know if descriptions in the mmake must be identical to the part in in coalesce (so "satisf" must be match_satisf). I understand that this description ("satisf") will appear in the _result columns and I cannot change it into "Satisfied" simply. I understand I can recode it later?
    Please have a look at my expanded df and my results to see clearly what my challenges are:
df <- data.frame(stringsAsFactors=FALSE,
                 URN = c("aaa", "bbb", "ccc", "ddd", "eee"),
                 Q9a = c("Satisfied", "Contentsatifiedstressfree",
                         "satisfied please d", "not efficient", "pleasent"),
                 Q9b = c(" satisfying", "was not satisfied", "unhappy", "efficient", "satified"),
                 Q9C = c("happy", "pleasant", NA, "unfriendly", "unfredly")
)

df

library(tidyverse)

negation <- function(x){
  str_detect(tolower(x),"(\\bun)|(\\bdis)|(not )")  
}


mmake <- function(x){
  
  assign(paste0("match_",x),function(a){
    case_when(str_detect(tolower(a),paste0(x,"?")) & !negation(tolower(a)) ~ x,
              str_detect(tolower(a),paste0(x,"?")) ~ paste0("!",x),
              TRUE ~ NA_character_)
  },envir = .GlobalEnv)
}

mmake("satisf")
mmake("happy")
mmake("please")
mmake("efficient")
mmake("friendly")

masterf <- function(x){
  coalesce(match_satisf(x),
           match_happy(x),
           match_please(x),
           match_efficient(x),
           match_friendly(x))
}

results <- df %>% mutate(across(matches("Q\\d([a-c]|[A-C])"),
                     list("result"=masterf),.names = "{.col}_{.fn}"))

results

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.