Regex issue - phrases with space in the end not picked up from the list

Slavek · May 25, 2021, 2:23pm

Hi,
My guru andresrcs helped me to create the code merging some string variables:

source <- data.frame(
  stringsAsFactors = FALSE,
               URN = c("aaa", "bbb", "ccc", "ddd"),
              Name = c("xxx", "xxx", "yyy", "yyy"),
              Date = c("2019-04-29", "2019-11-04", "2019-06-18", "2019-06-16"),
                CommQ1 = c("nein", NA, "Kein Kommentar ", "keine Ahnung"),
                CommQ2 = c("xxx", "Kein Kommentar", "nein.",NA),
                CommQ3 = c("Reperaturkosten stimmten", "nein", "aa", "keine Ahnung "),
                Q4 = c(2019, 2020, 2020, 2021),
                Q5 = c(10, 9, 8, 5)
)

blank_statements <- regex("^(Kein\\sKommentar|keine\\sAhnung|
                          nein)$", ignore_case = TRUE, multiline = TRUE)
library(tidyverse)
comments <- source %>% 
  mutate(across(contains("Comm"), ~str_remove_all(.x, blank_statements)),
         across(contains("Comm"), ~str_remove_all(.x, "^.{1,7}$"))) %>%
  unite("all_comments", contains("Comm"), sep = "/", remove = FALSE, na.rm = FALSE) %>%
  mutate(across(contains("comments"), ~ str_remove_all(.x, "NA")),
         across(contains("comments"), ~ str_remove_all(.x, "[:cntrl:]")),
         across(contains("comments"), ~ str_replace_all(.x, "\\s\\s+", " ")),
         across(contains("comments"), ~ str_replace_all(.x, "//+", "/")),
         across(contains("comments"), ~ str_remove (.x, "/$")),
         across(contains("comments"), ~ str_remove (.x, "^/"))) 

comments

Most of impressions from this list:

blank_statements <- regex("^(Kein\\sKommentar|keine\\sAhnung|
                          nein)$", ignore_case = TRUE, multiline = TRUE)

are picked up correctly but, for some reason, the phrases with spaces after them ("keine Ahnung ", "Kein Kommentar ") are not.
What am I doing wrong?

technocrat · May 25, 2021, 5:40pm

Ditch the spaces more simply

suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)
})

source <- data.frame(
  stringsAsFactors = FALSE,
  URN = c("aaa", "bbb", "ccc", "ddd"),
  Name = c("xxx", "xxx", "yyy", "yyy"),
  Date = c("2019-04-29", "2019-11-04", "2019-06-18", "2019-06-16"),
  CommQ1 = c("nein", NA, "Kein Kommentar ", "keine Ahnung"),
  CommQ2 = c("xxx", "Kein Kommentar", "nein.",NA),
  CommQ3 = c("Reperaturkosten stimmten", "nein", "aa", "keine Ahnung "),
  Q4 = c(2019, 2020, 2020, 2021),
  Q5 = c(10, 9, 8, 5)
)

source %>% mutate(CommQ1 = str_trim(CommQ1),
                  CommQ2 = str_trim(CommQ2),
                  CommQ3 = str_trim(CommQ3))
#>   URN Name       Date         CommQ1         CommQ2                   CommQ3
#> 1 aaa  xxx 2019-04-29           nein            xxx Reperaturkosten stimmten
#> 2 bbb  xxx 2019-11-04           <NA> Kein Kommentar                     nein
#> 3 ccc  yyy 2019-06-18 Kein Kommentar          nein.                       aa
#> 4 ddd  yyy 2019-06-16   keine Ahnung           <NA>             keine Ahnung
#>     Q4 Q5
#> 1 2019 10
#> 2 2020  9
#> 3 2020  8
#> 4 2021  5

Slavek · May 26, 2021, 8:46am

Excellent!
Is it possible to use str_trim to all variables with "Comm" in their names rather then listing them one by one?

technocrat · May 26, 2021, 8:49am

Yes, just as you did earlier with across. I was just doing lazy evaluation.

Slavek · May 26, 2021, 8:50am

Thank you.
Would the above take into account just a space or a single character after the comment?
"Kein Kommentar " should be treated as "no comment" but
"Kein Kommentar about this specific thing but comething about something else" should stay.

Slavek · May 26, 2021, 8:59am

like that?

library(tidyverse)
comments <- source %>% 
  mutate(across(contains("Comm"), ~str_trim(.x))) %>%
  mutate(across(contains("Comm"), ~str_remove_all(.x, blank_statements)),
         across(contains("Comm"), ~str_remove_all(.x, "^.{1,7}$"))) %>%
  unite("all_comments", contains("Comm"), sep = "/", remove = FALSE, na.rm = FALSE) %>%
  mutate(across(contains("comments"), ~ str_remove_all(.x, "NA")),
         across(contains("comments"), ~ str_remove_all(.x, "[:cntrl:]")),
         across(contains("comments"), ~ str_replace_all(.x, "\\s\\s+", " ")),
         across(contains("comments"), ~ str_replace_all(.x, "//+", "/")),
         across(contains("comments"), ~ str_remove (.x, "/$")),
         across(contains("comments"), ~ str_remove (.x, "^/")))

technocrat · May 27, 2021, 6:18am

No, because blank_statements is not in namespace. One of the very valuable things about always using a reprex is that everything is done from a fresh session so any missing objects in namespace are immediately obvious at the foot of the output.

suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)
  library(tidyr)
})

source <- data.frame(
  stringsAsFactors = FALSE,
  URN = c("aaa", "bbb", "ccc", "ddd"),
  Name = c("xxx", "xxx", "yyy", "yyy"),
  Date = c("2019-04-29", "2019-11-04", "2019-06-18", "2019-06-16"),
  CommQ1 = c("nein", NA, "Kein Kommentar ", "keine Ahnung"),
  CommQ2 = c("xxx", "Kein Kommentar", "nein.",NA),
  CommQ3 = c("Reperaturkosten stimmten", "nein", "aa", "keine Ahnung "),
  Q4 = c(2019, 2020, 2020, 2021),
  Q5 = c(10, 9, 8, 5)
)

comments <- source %>% 
  mutate(across(contains("Comm"), ~str_trim(.x))) %>%
  mutate(across(contains("Comm"), ~str_remove_all(.x, blank_statements)),
         across(contains("Comm"), ~str_remove_all(.x, "^.{1,7}$"))) %>%
  unite("all_comments", contains("Comm"), sep = "/", remove = FALSE, na.rm = FALSE) %>%
  mutate(across(contains("comments"), ~ str_remove_all(.x, "NA")),
         across(contains("comments"), ~ str_remove_all(.x, "[:cntrl:]")),
         across(contains("comments"), ~ str_replace_all(.x, "\\s\\s+", " ")),
         across(contains("comments"), ~ str_replace_all(.x, "//+", "/")),
         across(contains("comments"), ~ str_remove (.x, "/$")),
         across(contains("comments"), ~ str_remove (.x, "^/"))) 
#> Error: Problem with `mutate()` input `..1`.
#> ℹ `..1 = across(contains("Comm"), ~str_remove_all(.x, blank_statements))`.
#> x object 'blank_statements' not found

Slavek · May 27, 2021, 2:22pm

oops, I can see this line of the code:

blank_statements <- regex("^(Kein\\sKommentar|keine\\sAhnung|
                          nein)$", ignore_case = TRUE, multiline = TRUE)

was missing so

suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)
  library(tidyr)
})

source <- data.frame(
  stringsAsFactors = FALSE,
  URN = c("aaa", "bbb", "ccc", "ddd"),
  Name = c("xxx", "xxx", "yyy", "yyy"),
  Date = c("2019-04-29", "2019-11-04", "2019-06-18", "2019-06-16"),
  CommQ1 = c("nein", NA, "Kein Kommentar ", "keine Ahnung"),
  CommQ2 = c("xxx", "Kein Kommentar", "nein.",NA),
  CommQ3 = c("Reperaturkosten stimmten", "nein", "aa", "keine Ahnung "),
  Q4 = c(2019, 2020, 2020, 2021),
  Q5 = c(10, 9, 8, 5)
)

blank_statements <- regex("^(Kein\\sKommentar|keine\\sAhnung|
                          nein)$", ignore_case = TRUE, multiline = TRUE)
comments <- source %>% 
  mutate(across(contains("Comm"), ~str_trim(.x))) %>%
  mutate(across(contains("Comm"), ~str_remove_all(.x, blank_statements)),
         across(contains("Comm"), ~str_remove_all(.x, "^.{1,7}$"))) %>%
  unite("all_comments", contains("Comm"), sep = "/", remove = FALSE, na.rm = FALSE) %>%
  mutate(across(contains("comments"), ~ str_remove_all(.x, "NA")),
         across(contains("comments"), ~ str_remove_all(.x, "[:cntrl:]")),
         across(contains("comments"), ~ str_replace_all(.x, "\\s\\s+", " ")),
         across(contains("comments"), ~ str_replace_all(.x, "//+", "/")),
         across(contains("comments"), ~ str_remove (.x, "/$")),
         across(contains("comments"), ~ str_remove (.x, "^/")))

andresrcs · May 28, 2021, 1:52am

Please make your reprex "minimal", for example this is more than enough to exemplify this specific problem.

library(tidyverse)

source <- data.frame(
    stringsAsFactors = FALSE,
    URN = c("aaa", "bbb", "ccc", "ddd"),
    CommQ1 = c("nein", NA, "Kein Kommentar ", "keine Ahnung"),
    CommQ2 = c("xxx", "Kein Kommentar", "nein.",NA),
    CommQ3 = c("Reperaturkosten stimmten", "nein", "aa", "keine Ahnung ")
)

blank_statements <- regex("^(Kein\\sKommentar|keine\\sAhnung|
                          nein)\\s*$", ignore_case = TRUE, multiline = TRUE)
source %>% 
    mutate(across(contains("Comm"), ~str_remove_all(.x, blank_statements)))
#>   URN CommQ1 CommQ2                   CommQ3
#> 1 aaa   nein    xxx Reperaturkosten stimmten
#> 2 bbb   <NA>                            nein
#> 3 ccc         nein.                       aa
#> 4 ddd          <NA>

^{Created on 2021-05-28 by the reprex package (v2.0.0)}

Also, notice that most of your problems come from your lack of regular expressions knowledge so please consider investing your time learning the topic.

technocrat · May 28, 2021, 2:42am

Regular expressions provide a rich, expressive means of pattern matching, so there are a lot of options. Sometimes, however, it pays to take a lesson from an apocryphal Michaelangelo story. When asked how he created David from a Carrera marble monolith, he replied

I simply removed everything that wasn't David

That was my suggested approach—focusing on what needed to be taken away, rather than what needed to be kept.

Slavek · May 28, 2021, 11:46am

Thank you. What is the best source of this knowledge you would recommend?

andresrcs · May 28, 2021, 12:30pm

Here you can find the basics and suggested readings.

system · June 4, 2021, 12:31pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.