str_detect without negatives

Slavek · January 20, 2021, 3:58pm

Hi,
I have this simple data frame:

df <- data.frame(stringsAsFactors=FALSE,
                 URN = c("aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii", "jjj","kkk","lll"),
                 all_comment = c("I trust", "untrusting", NA, "not trusty", "trustworthy", "he is not honest", "dishonest person", "reliable guy", "he is unreliable", "like it","doesn't like","unlikely")
)

df

Now I am trying to flag all sentences with "trust" or its synonyms but I must exclude phrases with a negative meaning of trust (so with prefixes like "dis", "un", "not "). I have done this:

library(dplyr)
library(stringr)

TRUST.RESULT <- df %>% 
  mutate(
    TMC.TRUST = if_else(str_detect(all_comment, regex("trust|
trusting|
trustworthy|
trust-worthy|
trusty|
confident|
confidence|
honest|
honesty|
reliable|
reliability|
safe|
safety|
secure|
security|
assured|
care|
careful|
dependable|
sure|
integrity|
genuine|
professional|
profesional|
proffessional|
proffesional", ignore_case = TRUE, multiline = TRUE))
                        &!str_detect(all_comment, regex("untrust|
untrusting|
untrustworthy|
untrust-worthy|
untrusty|
unconfident|
unconfidence|
unhonest|
unhonesty|
unreliable|
unreliability|
unsafe|
unsafety|
unsecure|
unsecurity|
unassured|
uncare|
uncareful|
undependable|
unsure|
unintegrity|
ungenuine|
unprofessional|
unprofesional|
unproffessional|
unproffesional|
distrust|
distrusting|
distrustworthy|
distrust-worthy|
distrusty|
disconfident|
disconfidence|
dishonest|
dishonesty|
disreliable|
disreliability|
dissafe|
dissafety|
dissecure|
dissecurity|
disassured|
discare|
discareful|
disdependable|
dissure|
disintegrity|
disgenuine|
disprofessional|
disprofesional|
disproffessional|
disproffesional|
not//strust|
not//strusting|
not//strustworthy|
not//strust-worthy|
not//strusty|
not//sconfident|
not//sconfidence|
not//shonest|
not//shonesty|
not//sreliable|
not//sreliability|
not//ssafe|
not//ssafety|
not//ssecure|
not//ssecurity|
not//sassured|
not//scare|
not//scareful|
not//sdependable|
not//ssure|
not//sintegrity|
not//sgenuine|
not//sprofessional|
not//sprofesional|
not//sproffessional|
not//sproffesional", ignore_case = TRUE)),  1, 0),
TMC.LIKE = if_else(str_detect(all_comment, regex("Like", ignore_case = TRUE, multiline = TRUE))
                        &!str_detect(all_comment, regex("dislike|
unlikely", ignore_case = TRUE)),  1, 0)
  )

TRUST.RESULT

but I am sure there is a way of replacing

&!str_detect(all_comment, regex

by something else (not case sensitive).

Also, I don't know why "reliable guy" is not picked up (respondent hhh) but "likely" (respondent lll) is.

Can you help please?

technocrat · January 20, 2021, 6:36pm

As a first approximation

suppressPackageStartupMessages({
  library(dplyr)
})

all_comments <- c("I","am","a","trustworthy","person") %>% tolower()
syns <- c("trustworthy","trust-worthy","trusty","confident","confidence","honest","honesty","reliable","reliability","safe","safety","secure","security","assured","care","careful","dependable","sure","integrity","genuine","professional","profesional","proffessional","untrusting","untrustworthy","untrust-worthy","untrusty","unconfident","unconfidence","unhonest","unhonesty","unreliable","unreliability","unsafe","unsafety","unsecure","unsecurity","unassured","uncare","uncareful","undependable","unsure","unintegrity","ungenuine","unprofessional","unprofesional","unproffessional","unproffesional","distrust","distrusting","distrustworthy","distrust-worthy","distrusty","disconfident","disconfidence","dishonest","dishonesty","disreliable","disreliability","dissafe","dissafety","dissecure","dissecurity","disassured","discare","discareful","disdependable","dissure","disintegrity","disgenuine","disprofessional","disprofesional","disproffessional","disproffesional")

all_comments[which(all_comments %in% syns)] 
#> [1] "trustworthy"

Then move on to an NLP approach

Slavek · January 20, 2021, 9:01pm

Well, I am not as advanced as you think. I don't know how I can apply that to my initial code so to create a variable called TMC.TRUST with 1 if a sentence contains any synonym of "trust" from the list with an exclusion of all the words with prefixes "un", "dis", "not " etc...and a variable called "TMC.LIKE" with "like" or any of its synonyms listed (apart from exclusions listed "dislike", "unlikely" etc.)

technocrat · January 21, 2021, 12:25am

is missing from the example code. Reverse engineering a problem is something reduces the number and quality of answers. Not all the data is needed, it doesn't even have to be real and it can be a suitable built-in dataset.

Slavek · January 21, 2021, 9:25am

Well. I think, everything is here:

df <- data.frame(stringsAsFactors=FALSE,
                 URN = c("aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii", "jjj","kkk","lll"),
                 all_comment = c("I trust", "untrusting", NA, "not trusty", "trustworthy", "he is not honest", "dishonest person", "reliable guy", "he is unreliable", "like it","doesn't like","unlikely")
)

df



library(dplyr)
library(stringr)

TRUST.RESULT <- df %>% 
  mutate(
    TMC.TRUST = if_else(str_detect(all_comment, regex("trust|
trusting|
trustworthy|
trust-worthy|
trusty|
confident|
confidence|
honest|
honesty|
reliable|
reliability|
proffesional", ignore_case = TRUE, multiline = TRUE))
                        &!str_detect(all_comment, regex("untrust|
untrusting|
untrustworthy|
untrust-worthy|
untrusty|
unconfident|
unconfidence|
unhonest|
unhonesty|
unreliable|
unreliability|
unproffesional|
distrust|
distrusting|
distrustworthy|
distrust-worthy|
distrusty|
disconfident|
disconfidence|
dishonest|
dishonesty|
disreliable|
disreliability|
disproffesional|
not//strust|
not//strusting|
not//strustworthy|
not//strust-worthy|
not//strusty|
not//sconfident|
not//sconfidence|
not//shonest|
not//shonesty|
not//sreliable|
not//sreliability|
not//sproffesional", ignore_case = TRUE)),  1, 0),
    TMC.LIKE = if_else(str_detect(all_comment, regex("Like", ignore_case = TRUE, multiline = TRUE))
                       &!str_detect(all_comment, regex("dislike|
                                                       unlikely", ignore_case = TRUE)),  1, 0)
  )

TRUST.RESULT

it is almost working apart from "reliable guy" (respondent hhh) and "likely" (respondent lll) .
I am sure there is a shorter way of programming negatives. I have started with a function:


library(tidyverse)

negation <- function(x){
  str_detect(tolower(x),"(\\bun)|(\\bdis)|(not )")  
}


mmake <- function(x){
  
  assign(paste0("match_",x),function(a){
    case_when(str_detect(tolower(a),paste0(x,"?")) & !negation(tolower(a)) ~ x,
              str_detect(tolower(a),paste0(x,"?")) ~ paste0("NOT ",x),
              TRUE ~ NA_character_)
  },envir = .GlobalEnv)
}

mmake("trust")
mmake("honest")

masterf <- function(x){
  coalesce(match_trust(x),
           match_honest(x))
}

results <- df %>% mutate(across(matches("Q\\d([a-c]|[A-C])"),
                                list("result"=masterf),.names = "{.col}_{.fn}"))

results

but it needs amendments and corrections!

technocrat · January 22, 2021, 12:52am

Thanks.

Regex is powerful. It's also difficult. It should be broken down into simpler pieces whenever possible.

A few notes:

I assumed that professional in all_comments was a typo
lower-casing and newline elimination were done first
\\b matches word boundaries
two passes were used: match the target words, then eliminate those prefixed by negations (doesn't should be provided for)

suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)
})

#  like `data`, `df` is the name of a function, and some opertions give
#  precedence to the function over the defined object, leading to errors

DF <- data.frame(
  stringsAsFactors = FALSE,
  URN = c("aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii", "jjj", "kkk", "lll"),
  all_comment = c("I trust", "untrusting", NA, "not trusty", "trustworthy", "he is not honest", "dishonest person", "reliable guy", "he is unreliable", "like it", "doesn't like", "unlikely")
)

DF
#>    URN      all_comment
#> 1  aaa          I trust
#> 2  bbb       untrusting
#> 3  ccc             <NA>
#> 4  ddd       not trusty
#> 5  eee      trustworthy
#> 6  fff he is not honest
#> 7  ggg dishonest person
#> 8  hhh     reliable guy
#> 9  iii he is unreliable
#> 10 jjj          like it
#> 11 kkk     doesn't like
#> 12 lll         unlikely

pattern1 <- "\\btrust|\\btrusting|\\btrustworthy|\\btrust-worthy|\\bconfident|\\bconfidence|\\bhonest|\\breliabl|\\bprofessional|\\blike"

  
pattern2 <- "\\bun|\\bdis|\\bnot"

TRUST_RESULT <- DF %>%
  mutate(all_comment = str_to_lower(all_comment),
    all_comment      = str_squish(all_comment),
    TRUST_RESULT     = ifelse(str_detect(all_comment,pattern1),1,0),
    TRUST_RESULT     = ifelse(str_detect(all_comment,pattern2),0,1))

TRUST_RESULT
#>    URN      all_comment TRUST_RESULT
#> 1  aaa          i trust            1
#> 2  bbb       untrusting            0
#> 3  ccc             <NA>           NA
#> 4  ddd       not trusty            0
#> 5  eee      trustworthy            1
#> 6  fff he is not honest            0
#> 7  ggg dishonest person            0
#> 8  hhh     reliable guy            1
#> 9  iii he is unreliable            0
#> 10 jjj          like it            1
#> 11 kkk     doesn't like            1
#> 12 lll         unlikely            0

nirgrahamuk · January 22, 2021, 9:26am

minor point, mutate wont layer same variable creation...
i.e.

 mutate(all_comment = str_to_lower(all_comment),
    all_comment      = str_squish(all_comment),
    TRUST_RESULT     = ifelse(str_detect(all_comment,pattern1),1,0),
    TRUST_RESULT     = ifelse(str_detect(all_comment,pattern2),0,1))

is identical to

 mutate(all_comment = str_to_lower(all_comment),
    all_comment      = str_squish(all_comment),
    TRUST_RESULT     = ifelse(str_detect(all_comment,pattern2),0,1))

In other words, this doesnt quite work, as it relies on all the comments being either synonyms of trust or its negation, and the mutate just catches if the negation is happening or not.
if you were to add a new entry to the test df something like URN "000" all_comment "monkey" it would show

DF <- data.frame(
  stringsAsFactors = FALSE,
  URN = c("000","aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii", "jjj", "kkk", "lll"),
  all_comment = c("monkey","I trust", "untrusting", NA, "not trusty", "trustworthy", "he is not honest", "dishonest person", "reliable guy", "he is unreliable", "like it", "doesn't like", "unlikely")
)

TRUST_RESULT2 <- DF %>%
  mutate(all_comment = str_to_lower(all_comment),
         all_comment      = str_squish(all_comment),
         TRUST_RESULT     = ifelse(str_detect(all_comment,pattern1),1,0)) %>%
 mutate(   TRUST_RESULT     = ifelse(TRUST_RESULT==1 & 
                               str_detect(all_comment,pattern2),0,TRUST_RESULT))

Slavek · January 22, 2021, 9:29am

Excellent! Thank you.
Yes, there are many typos in text box given by respondents so they all should be taken into account.
Is any way of including doesn't in your patterns?

Slavek · January 22, 2021, 9:40am

Also:

is it possible to have multiple lines patterns if lists of phrases are longer?
can I use more than single words so phrases like "like it"?

technocrat · January 22, 2021, 8:14pm

I'm guilty of an implicit assumption—the data presented is representative, and if an answer in full generality is required, that should be specifically stated.

technocrat · January 22, 2021, 8:15pm

pattern <- ... |n\\'t|...

If a row/column combination has an embedded \n, str_squish takes care of it.

technocrat · January 22, 2021, 8:18pm

So long as you don't need to distinguish between like and like it, the sample code captures both.

For typos, see hunspell

Slavek · January 22, 2021, 11:20pm

My example is short but some people may say I always trust and some a trust is a myth (the first one is positive about trust the second one is negative and should be excluded), that is why I am wondering if I could use phrases instead of single words...

technocrat · January 23, 2021, 12:10am

The sample data we've been working with already contains phrases and the code deals with a limited context issue—not. Expanding that to handle words that are not immediately adjacent requires a key word in context (KWIC) approach, available in NLP packages. Then, there are edge cases, like "trust but verify."

system · January 30, 2021, 12:10am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.