Regex issue - phrases with space in the end not picked up from the list

Hi,
My guru andresrcs helped me to create the code merging some string variables:

source <- data.frame(
  stringsAsFactors = FALSE,
               URN = c("aaa", "bbb", "ccc", "ddd"),
              Name = c("xxx", "xxx", "yyy", "yyy"),
              Date = c("2019-04-29", "2019-11-04", "2019-06-18", "2019-06-16"),
                CommQ1 = c("nein", NA, "Kein Kommentar ", "keine Ahnung"),
                CommQ2 = c("xxx", "Kein Kommentar", "nein.",NA),
                CommQ3 = c("Reperaturkosten stimmten", "nein", "aa", "keine Ahnung "),
                Q4 = c(2019, 2020, 2020, 2021),
                Q5 = c(10, 9, 8, 5)
)

blank_statements <- regex("^(Kein\\sKommentar|keine\\sAhnung|
                          nein)$", ignore_case = TRUE, multiline = TRUE)
library(tidyverse)
comments <- source %>% 
  mutate(across(contains("Comm"), ~str_remove_all(.x, blank_statements)),
         across(contains("Comm"), ~str_remove_all(.x, "^.{1,7}$"))) %>%
  unite("all_comments", contains("Comm"), sep = "/", remove = FALSE, na.rm = FALSE) %>%
  mutate(across(contains("comments"), ~ str_remove_all(.x, "NA")),
         across(contains("comments"), ~ str_remove_all(.x, "[:cntrl:]")),
         across(contains("comments"), ~ str_replace_all(.x, "\\s\\s+", " ")),
         across(contains("comments"), ~ str_replace_all(.x, "//+", "/")),
         across(contains("comments"), ~ str_remove (.x, "/$")),
         across(contains("comments"), ~ str_remove (.x, "^/"))) 

comments

Most of impressions from this list:

blank_statements <- regex("^(Kein\\sKommentar|keine\\sAhnung|
                          nein)$", ignore_case = TRUE, multiline = TRUE)

are picked up correctly but, for some reason, the phrases with spaces after them ("keine Ahnung ", "Kein Kommentar ") are not.
What am I doing wrong?

Ditch the spaces more simply

suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)
})

source <- data.frame(
  stringsAsFactors = FALSE,
  URN = c("aaa", "bbb", "ccc", "ddd"),
  Name = c("xxx", "xxx", "yyy", "yyy"),
  Date = c("2019-04-29", "2019-11-04", "2019-06-18", "2019-06-16"),
  CommQ1 = c("nein", NA, "Kein Kommentar ", "keine Ahnung"),
  CommQ2 = c("xxx", "Kein Kommentar", "nein.",NA),
  CommQ3 = c("Reperaturkosten stimmten", "nein", "aa", "keine Ahnung "),
  Q4 = c(2019, 2020, 2020, 2021),
  Q5 = c(10, 9, 8, 5)
)

source %>% mutate(CommQ1 = str_trim(CommQ1),
                  CommQ2 = str_trim(CommQ2),
                  CommQ3 = str_trim(CommQ3))
#>   URN Name       Date         CommQ1         CommQ2                   CommQ3
#> 1 aaa  xxx 2019-04-29           nein            xxx Reperaturkosten stimmten
#> 2 bbb  xxx 2019-11-04           <NA> Kein Kommentar                     nein
#> 3 ccc  yyy 2019-06-18 Kein Kommentar          nein.                       aa
#> 4 ddd  yyy 2019-06-16   keine Ahnung           <NA>             keine Ahnung
#>     Q4 Q5
#> 1 2019 10
#> 2 2020  9
#> 3 2020  8
#> 4 2021  5

Excellent!
Is it possible to use str_trim to all variables with "Comm" in their names rather then listing them one by one?

Yes, just as you did earlier with across. I was just doing lazy evaluation.

1 Like

Thank you.
Would the above take into account just a space or a single character after the comment?
"Kein Kommentar " should be treated as "no comment" but
"Kein Kommentar about this specific thing but comething about something else" should stay.

like that?

library(tidyverse)
comments <- source %>% 
  mutate(across(contains("Comm"), ~str_trim(.x))) %>%
  mutate(across(contains("Comm"), ~str_remove_all(.x, blank_statements)),
         across(contains("Comm"), ~str_remove_all(.x, "^.{1,7}$"))) %>%
  unite("all_comments", contains("Comm"), sep = "/", remove = FALSE, na.rm = FALSE) %>%
  mutate(across(contains("comments"), ~ str_remove_all(.x, "NA")),
         across(contains("comments"), ~ str_remove_all(.x, "[:cntrl:]")),
         across(contains("comments"), ~ str_replace_all(.x, "\\s\\s+", " ")),
         across(contains("comments"), ~ str_replace_all(.x, "//+", "/")),
         across(contains("comments"), ~ str_remove (.x, "/$")),
         across(contains("comments"), ~ str_remove (.x, "^/"))) 

No, because blank_statements is not in namespace. One of the very valuable things about always using a reprex is that everything is done from a fresh session so any missing objects in namespace are immediately obvious at the foot of the output.

suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)
  library(tidyr)
})

source <- data.frame(
  stringsAsFactors = FALSE,
  URN = c("aaa", "bbb", "ccc", "ddd"),
  Name = c("xxx", "xxx", "yyy", "yyy"),
  Date = c("2019-04-29", "2019-11-04", "2019-06-18", "2019-06-16"),
  CommQ1 = c("nein", NA, "Kein Kommentar ", "keine Ahnung"),
  CommQ2 = c("xxx", "Kein Kommentar", "nein.",NA),
  CommQ3 = c("Reperaturkosten stimmten", "nein", "aa", "keine Ahnung "),
  Q4 = c(2019, 2020, 2020, 2021),
  Q5 = c(10, 9, 8, 5)
)

comments <- source %>% 
  mutate(across(contains("Comm"), ~str_trim(.x))) %>%
  mutate(across(contains("Comm"), ~str_remove_all(.x, blank_statements)),
         across(contains("Comm"), ~str_remove_all(.x, "^.{1,7}$"))) %>%
  unite("all_comments", contains("Comm"), sep = "/", remove = FALSE, na.rm = FALSE) %>%
  mutate(across(contains("comments"), ~ str_remove_all(.x, "NA")),
         across(contains("comments"), ~ str_remove_all(.x, "[:cntrl:]")),
         across(contains("comments"), ~ str_replace_all(.x, "\\s\\s+", " ")),
         across(contains("comments"), ~ str_replace_all(.x, "//+", "/")),
         across(contains("comments"), ~ str_remove (.x, "/$")),
         across(contains("comments"), ~ str_remove (.x, "^/"))) 
#> Error: Problem with `mutate()` input `..1`.
#> ℹ `..1 = across(contains("Comm"), ~str_remove_all(.x, blank_statements))`.
#> x object 'blank_statements' not found

oops, I can see this line of the code:

blank_statements <- regex("^(Kein\\sKommentar|keine\\sAhnung|
                          nein)$", ignore_case = TRUE, multiline = TRUE)

was missing so

suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)
  library(tidyr)
})

source <- data.frame(
  stringsAsFactors = FALSE,
  URN = c("aaa", "bbb", "ccc", "ddd"),
  Name = c("xxx", "xxx", "yyy", "yyy"),
  Date = c("2019-04-29", "2019-11-04", "2019-06-18", "2019-06-16"),
  CommQ1 = c("nein", NA, "Kein Kommentar ", "keine Ahnung"),
  CommQ2 = c("xxx", "Kein Kommentar", "nein.",NA),
  CommQ3 = c("Reperaturkosten stimmten", "nein", "aa", "keine Ahnung "),
  Q4 = c(2019, 2020, 2020, 2021),
  Q5 = c(10, 9, 8, 5)
)

blank_statements <- regex("^(Kein\\sKommentar|keine\\sAhnung|
                          nein)$", ignore_case = TRUE, multiline = TRUE)
comments <- source %>% 
  mutate(across(contains("Comm"), ~str_trim(.x))) %>%
  mutate(across(contains("Comm"), ~str_remove_all(.x, blank_statements)),
         across(contains("Comm"), ~str_remove_all(.x, "^.{1,7}$"))) %>%
  unite("all_comments", contains("Comm"), sep = "/", remove = FALSE, na.rm = FALSE) %>%
  mutate(across(contains("comments"), ~ str_remove_all(.x, "NA")),
         across(contains("comments"), ~ str_remove_all(.x, "[:cntrl:]")),
         across(contains("comments"), ~ str_replace_all(.x, "\\s\\s+", " ")),
         across(contains("comments"), ~ str_replace_all(.x, "//+", "/")),
         across(contains("comments"), ~ str_remove (.x, "/$")),
         across(contains("comments"), ~ str_remove (.x, "^/"))) 

Please make your reprex "minimal", for example this is more than enough to exemplify this specific problem.

library(tidyverse)

source <- data.frame(
    stringsAsFactors = FALSE,
    URN = c("aaa", "bbb", "ccc", "ddd"),
    CommQ1 = c("nein", NA, "Kein Kommentar ", "keine Ahnung"),
    CommQ2 = c("xxx", "Kein Kommentar", "nein.",NA),
    CommQ3 = c("Reperaturkosten stimmten", "nein", "aa", "keine Ahnung ")
)

blank_statements <- regex("^(Kein\\sKommentar|keine\\sAhnung|
                          nein)\\s*$", ignore_case = TRUE, multiline = TRUE)
source %>% 
    mutate(across(contains("Comm"), ~str_remove_all(.x, blank_statements)))
#>   URN CommQ1 CommQ2                   CommQ3
#> 1 aaa   nein    xxx Reperaturkosten stimmten
#> 2 bbb   <NA>                            nein
#> 3 ccc         nein.                       aa
#> 4 ddd          <NA>

Created on 2021-05-28 by the reprex package (v2.0.0)

Also, notice that most of your problems come from your lack of regular expressions knowledge so please consider investing your time learning the topic.

2 Likes

Regular expressions provide a rich, expressive means of pattern matching, so there are a lot of options. Sometimes, however, it pays to take a lesson from an apocryphal Michaelangelo story. When asked how he created David from a Carrera marble monolith, he replied

I simply removed everything that wasn't David

That was my suggested approach—focusing on what needed to be taken away, rather than what needed to be kept.

Thank you. What is the best source of this knowledge you would recommend?

Here you can find the basics and suggested readings.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.