Regular expressions help

Slavek · May 18, 2020, 11:21am

Dear R masters,
I have gone through many text related projects with your help but I cannot solve this simple example myself again:

library(tidyverse)
library(stringr)

sample_data <- data.frame(stringsAsFactors=FALSE,
           InterviewID = c(94, 59, 100, 86, 60, 101, 61, 7),
           AComm_1 = c("None", "neen", "xxxxx.",
                       "None of products", "geen speciale", "geen commentaren",
                       "Goood!!!!", "aa"),
           ModelLong = c("A", "A", "A",
                         "B", "B",
                         "B", "xxx", "xxx")
)

sample_data

# List of full sentencies which should be excluded (ecxlude if the sentence contains ONLY this element but it's not part of a full sentence)
blank_statements <- regex("none|geen\\sspeciale\\scommentaar|neen",
                          ignore_case = TRUE)

results <- sample_data %>%
mutate(TMC.Blank = ifelse(test = (is.na(x = sample_data$AComm_1)),yes = 1,
                          no = ifelse((test = (str_length(string = sample_data$AComm_1) < 4) | # Remove sentences with less than 4 characters
                                         (str_detect(string = AComm_1,pattern = blank_statements))| # Remove sentences containing ONLY phrases listed in the blank_statements
                                         (str_length(string = sample_data$AComm_1) < 10) & (str_detect(AComm_1, "(.)\\1{3,}"))| # Remove sentences shorter than 10 caracters containing repeated characters (like xxx, aaaaa)
                                         (str_length(string = sample_data$AComm_1) < 10) & (str_detect(AComm_1, regex("none|neen", ignore_case = TRUE)) # Remove sentences shorter than 10 caracters containing specific words
                          ),yes = 1,no = 0))
       
results

First of all, I think my syntax is overcomplicated. I don't think I should use reference to my data source with piping but the code is not working without references to sample_data.

Secondly, there are two ways of excluding "none" from my comments in my code and I think one is unnecessary: I simply want to replace sentences containing just "None" with a blank but keep other sentences containing "none" ("None of products").

Thirdly, I don't know why comments with repeated characters are not replaced by blanks in my results df.

Lastly, I don't know how to use exceptions to the repeated characters code. So I want to replace short comments including repeated characters such as xxx or aaaa by blanks but include !!!! ("Good!!!!").

Can you help?

nirgrahamuk · May 18, 2020, 11:42am

can you check your mutate code as shared.
There is 1 more open bracket than there are closing brackets

Slavek · May 18, 2020, 12:15pm

That is one of reasons I think this code is overcomplicated. Too many chances to make a mistake. I think the entire code should be rewritten and simplified...

ulfelder · May 18, 2020, 2:52pm

You should be able to address your problem with "None" by including anchors in that regex, i.e., ^none$ instead of none.

Slavek · May 18, 2020, 4:27pm

Thank you but in which regrex? There are two places with that and I think one is not required.
I also still cannot fix the error...

siddharthprabhu · May 18, 2020, 5:01pm

This is a good opportunity to use case_when(). Here I've just provided the logic to match your stated conditions and am populating the new variable TMC.Blank with 1 or 0 accordingly.

library(tidyverse)

sample_data <- data.frame(stringsAsFactors=FALSE,
                          InterviewID = c(94, 59, 100, 86, 60, 101, 61, 7),
                          AComm_1 = c("None", "neen", "xxxxx.", "None of products",
                                      "geen speciale", "geen commentaren", "Goood!!!!", "aa"),
                          ModelLong = c("A", "A", "A", "B", "B", "B", "xxx", "xxx"))

blank_statements <- c("none", "geen", "speciale", "commentaar", "neen")

exact_match <- regex(str_c("^", blank_statements, "$", collapse = "|"), ignore_case = TRUE)
partial_match <- regex(str_c(blank_statements, collapse = "|"), ignore_case = TRUE)

sample_data %>%
  mutate(TMC.Blank = case_when(is.na(AComm_1) ~ 1L,
                               str_length(AComm_1) < 4 ~ 1L,
                               str_detect(AComm_1, exact_match) ~ 1L,
                               str_length(AComm_1) < 10 & str_detect(AComm_1, partial_match) ~ 1L,
                               str_length(AComm_1) < 10 & str_detect(AComm_1, "\\D{3,}") ~ 1L))
#>   InterviewID          AComm_1 ModelLong TMC.Blank
#> 1          94             None         A         1
#> 2          59             neen         A         1
#> 3         100           xxxxx.         A         1
#> 4          86 None of products         B        NA
#> 5          60    geen speciale         B        NA
#> 6         101 geen commentaren         B        NA
#> 7          61        Goood!!!!       xxx         1
#> 8           7               aa       xxx         1

^{Created on 2020-05-18 by the reprex package (v0.3.0)}

Note: I didn't quite follow what you mean by the statement below so for now I've just matched alphabets that repeat 3 or more times (you can change the value to suit).

Slavek · May 19, 2020, 3:28pm

Thank you!
What I meant was that any expression with repeated '!' should not be coded as blank as this is a specific type or repeated character. For example xxxxx should be recoded into blank whereas !!!!! (for example Goood!!!!) should not.

siddharthprabhu · May 19, 2020, 3:40pm

Okay, then no changes should be required to the code snippet above since the metacharacter \D only matches non-digit characters.

Slavek · May 19, 2020, 3:49pm

But your code assigned 1 to TMC.Blank (the same way as for xxxxx.) but it shouldn't...

siddharthprabhu · May 19, 2020, 4:29pm

That's because 'Goood!!!!' has the character 'o' repeated 3 times. It didn't match the '!!!!'.

Slavek · May 19, 2020, 6:54pm

Yes but changing 'Goood!!!' into 'Good!!!' or even 'God!!!' does not help. The code still treats that as multicharacter sting.

siddharthprabhu · May 19, 2020, 8:08pm

Sorry, my regex for the last pattern was completely wrong. This should do the trick.

library(tidyverse)

sample_data <- data.frame(stringsAsFactors=FALSE,
                          InterviewID = c(94, 59, 100, 86, 60, 101, 61, 7),
                          AComm_1 = c("None", "neen", "xxxxx.", "None of products",
                                      "geen speciale", "geen commentaren", "Goood!!!!", "aa"),
                          ModelLong = c("A", "A", "A", "B", "B", "B", "xxx", "xxx"))

blank_statements <- c("none", "geen", "speciale", "commentaar", "neen")

exact_match <- regex(str_c("^", blank_statements, "$", collapse = "|"), ignore_case = TRUE)
partial_match <- regex(str_c(blank_statements, collapse = "|"), ignore_case = TRUE)

sample_data %>%
  mutate(TMC.Blank = case_when(is.na(AComm_1) ~ 1L,
                               str_length(AComm_1) < 4 ~ 1L,
                               str_detect(AComm_1, exact_match) ~ 1L,
                               str_length(AComm_1) < 10 & str_detect(AComm_1, partial_match) ~ 1L,
                               str_length(AComm_1) < 10 & str_detect(AComm_1, "([a-zA-Z])\\1{2,}") ~ 1L))
#>   InterviewID          AComm_1 ModelLong TMC.Blank
#> 1          94             None         A         1
#> 2          59             neen         A         1
#> 3         100           xxxxx.         A         1
#> 4          86 None of products         B        NA
#> 5          60    geen speciale         B        NA
#> 6         101 geen commentaren         B        NA
#> 7          61        Goood!!!!       xxx         1
#> 8           7               aa       xxx         1

^{Created on 2020-05-20 by the reprex package (v0.3.0)}

Slavek · May 21, 2020, 9:14am

Thank you very much Master!

system · May 28, 2020, 9:14am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.