Regular expressions help

Dear R masters,
I have gone through many text related projects with your help but I cannot solve this simple example myself again:

library(tidyverse)
library(stringr)

sample_data <- data.frame(stringsAsFactors=FALSE,
           InterviewID = c(94, 59, 100, 86, 60, 101, 61, 7),
           AComm_1 = c("None", "neen", "xxxxx.",
                       "None of products", "geen speciale", "geen commentaren",
                       "Goood!!!!", "aa"),
           ModelLong = c("A", "A", "A",
                         "B", "B",
                         "B", "xxx", "xxx")
)

sample_data

# List of full sentencies which should be excluded (ecxlude if the sentence contains ONLY this element but it's not part of a full sentence)
blank_statements <- regex("none|geen\\sspeciale\\scommentaar|neen",
                          ignore_case = TRUE)

results <- sample_data %>%
mutate(TMC.Blank = ifelse(test = (is.na(x = sample_data$AComm_1)),yes = 1,
                          no = ifelse((test = (str_length(string = sample_data$AComm_1) < 4) | # Remove sentences with less than 4 characters
                                         (str_detect(string = AComm_1,pattern = blank_statements))| # Remove sentences containing ONLY phrases listed in the blank_statements
                                         (str_length(string = sample_data$AComm_1) < 10) & (str_detect(AComm_1, "(.)\\1{3,}"))| # Remove sentences shorter than 10 caracters containing repeated characters (like xxx, aaaaa)
                                         (str_length(string = sample_data$AComm_1) < 10) & (str_detect(AComm_1, regex("none|neen", ignore_case = TRUE)) # Remove sentences shorter than 10 caracters containing specific words
                          ),yes = 1,no = 0))
       
results

First of all, I think my syntax is overcomplicated. I don't think I should use reference to my data source with piping but the code is not working without references to sample_data.

Secondly, there are two ways of excluding "none" from my comments in my code and I think one is unnecessary: I simply want to replace sentences containing just "None" with a blank but keep other sentences containing "none" ("None of products").

Thirdly, I don't know why comments with repeated characters are not replaced by blanks in my results df.

Lastly, I don't know how to use exceptions to the repeated characters code. So I want to replace short comments including repeated characters such as xxx or aaaa by blanks but include !!!! ("Good!!!!").

Can you help?

can you check your mutate code as shared.
There is 1 more open bracket than there are closing brackets

That is one of reasons I think this code is overcomplicated. Too many chances to make a mistake. I think the entire code should be rewritten and simplified...

You should be able to address your problem with "None" by including anchors in that regex, i.e., ^none$ instead of none.

Thank you but in which regrex? There are two places with that and I think one is not required.
I also still cannot fix the error...

This is a good opportunity to use case_when(). Here I've just provided the logic to match your stated conditions and am populating the new variable TMC.Blank with 1 or 0 accordingly.

library(tidyverse)

sample_data <- data.frame(stringsAsFactors=FALSE,
                          InterviewID = c(94, 59, 100, 86, 60, 101, 61, 7),
                          AComm_1 = c("None", "neen", "xxxxx.", "None of products",
                                      "geen speciale", "geen commentaren", "Goood!!!!", "aa"),
                          ModelLong = c("A", "A", "A", "B", "B", "B", "xxx", "xxx"))

blank_statements <- c("none", "geen", "speciale", "commentaar", "neen")

exact_match <- regex(str_c("^", blank_statements, "$", collapse = "|"), ignore_case = TRUE)
partial_match <- regex(str_c(blank_statements, collapse = "|"), ignore_case = TRUE)

sample_data %>%
  mutate(TMC.Blank = case_when(is.na(AComm_1) ~ 1L,
                               str_length(AComm_1) < 4 ~ 1L,
                               str_detect(AComm_1, exact_match) ~ 1L,
                               str_length(AComm_1) < 10 & str_detect(AComm_1, partial_match) ~ 1L,
                               str_length(AComm_1) < 10 & str_detect(AComm_1, "\\D{3,}") ~ 1L))
#>   InterviewID          AComm_1 ModelLong TMC.Blank
#> 1          94             None         A         1
#> 2          59             neen         A         1
#> 3         100           xxxxx.         A         1
#> 4          86 None of products         B        NA
#> 5          60    geen speciale         B        NA
#> 6         101 geen commentaren         B        NA
#> 7          61        Goood!!!!       xxx         1
#> 8           7               aa       xxx         1

Created on 2020-05-18 by the reprex package (v0.3.0)

Note: I didn't quite follow what you mean by the statement below so for now I've just matched alphabets that repeat 3 or more times (you can change the value to suit).

1 Like

Thank you!
What I meant was that any expression with repeated '!' should not be coded as blank as this is a specific type or repeated character. For example xxxxx should be recoded into blank whereas !!!!! (for example Goood!!!!) should not.

Okay, then no changes should be required to the code snippet above since the metacharacter \D only matches non-digit characters.

But your code assigned 1 to TMC.Blank (the same way as for xxxxx.) but it shouldn't...

That's because 'Goood!!!!' has the character 'o' repeated 3 times. It didn't match the '!!!!'.

Yes but changing 'Goood!!!' into 'Good!!!' or even 'God!!!' does not help. The code still treats that as multicharacter sting.

Sorry, my regex for the last pattern was completely wrong. This should do the trick.

library(tidyverse)

sample_data <- data.frame(stringsAsFactors=FALSE,
                          InterviewID = c(94, 59, 100, 86, 60, 101, 61, 7),
                          AComm_1 = c("None", "neen", "xxxxx.", "None of products",
                                      "geen speciale", "geen commentaren", "Goood!!!!", "aa"),
                          ModelLong = c("A", "A", "A", "B", "B", "B", "xxx", "xxx"))

blank_statements <- c("none", "geen", "speciale", "commentaar", "neen")

exact_match <- regex(str_c("^", blank_statements, "$", collapse = "|"), ignore_case = TRUE)
partial_match <- regex(str_c(blank_statements, collapse = "|"), ignore_case = TRUE)

sample_data %>%
  mutate(TMC.Blank = case_when(is.na(AComm_1) ~ 1L,
                               str_length(AComm_1) < 4 ~ 1L,
                               str_detect(AComm_1, exact_match) ~ 1L,
                               str_length(AComm_1) < 10 & str_detect(AComm_1, partial_match) ~ 1L,
                               str_length(AComm_1) < 10 & str_detect(AComm_1, "([a-zA-Z])\\1{2,}") ~ 1L))
#>   InterviewID          AComm_1 ModelLong TMC.Blank
#> 1          94             None         A         1
#> 2          59             neen         A         1
#> 3         100           xxxxx.         A         1
#> 4          86 None of products         B        NA
#> 5          60    geen speciale         B        NA
#> 6         101 geen commentaren         B        NA
#> 7          61        Goood!!!!       xxx         1
#> 8           7               aa       xxx         1

Created on 2020-05-20 by the reprex package (v0.3.0)

1 Like

Thank you very much Master!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.