Using regex to optimise str_detect in case_when

bragks · July 4, 2018, 12:24pm

I'm trying to wrap my head around regex but I can't figure out how to select both the beginning and end of a word but not what is in-between. I've started doing it the hard way (code below), but my brain already hurts from all the duplication.

To set the new value, e.g. to set "RAP" i need the two first letters "RA", and "p" which can either be the last letter or the second last letter. The same principle goes for the other alternatives, but when the output ends with "A" the last letter from input will always be "a".

Any suggestions are greatly appreciated!

 mutate(
    mr_reg_l1 = case_when(
      str_detect(L1_loc, "^RAPZpm") ~ "RAP",
      str_detect(L1_loc, "^RAPZpl") ~ "RAP",
      str_detect(L1_loc, "^RAPZa") ~ "RAA",
      str_detect(L1_loc, "^RATZa") ~ "RAA",
      str_detect(L1_loc, "^RAPTZp") ~ "RAP",
      str_detect(L1_loc, "^LAPZp") ~ "LAP",
      str_detect(L1_loc, "^LATZp") ~ "LAP",
      str_detect(L1_loc, "^LAPZa") ~ "LAA",
      str_detect(L1_loc, "^LAPTZa") ~ "LAP",
      str_detect(L1_loc, "^RAPZa") ~ "RAA",
      TRUE ~ L1_loc
    )
  )

paul · July 4, 2018, 2:02pm

This should have the desired results and reduce a bit of the repetition:

mutate(
  mr_reg_l1 = case_when(
    str_detect(L1_loc, "^RA") & str_detect(str_extract(L1_loc, ".$"), "a") ~ "RAA",
    str_detect(L1_loc, "^RA") & str_detect(str_extract(L1_loc, "..$"), "p") ~ "RAP",
    str_detect(L1_loc, "^LA") & str_detect(str_extract(L1_loc, ".$"), "a") ~ "LAA",
    str_detect(L1_loc, "^LA") & str_detect(str_extract(L1_loc, "..$"), "p") ~ "LAP",
    TRUE ~ L1_loc
  )
)

The ".$" patterns looks for an "a" in the last letter of each word and the "..$" looks for a p in either of 2 last letters.

As you mentioned, I'm sure you could reduce this further with a more sophisticated regex matching the start and end of the string in one str_detect but my regex knowledge is pretty limited!

Stroehli · July 5, 2018, 3:22am

Assuming that there are no entries in L1_loc that you do not want to change this way, and that you always want to select the first of the lower-case letters at the end of your string, you can apply the same function to all rows using mutate(), str_sub() and str_extract(), applying the same regex for all cases:

df %>%
mutate(
mr_reg_l1 = paste(
                 str_sub(L1_loc,1,2),
                 toupper(str_sub(str_extract(L1_loc, "[:lower:]+$"),1,1)),
                 sep = ""
                 )
)

This will select the first two letters from each value in L1_loc and join (paste) them with the first of the lower-case letters at the end of the string (which is transformed to uppercase using toupper()) and write the result to a new column mr_reg_l1.

Also, in your second-last case_when, shouldn't it be

str_detect(L1_loc, "^LAPTZa") ~ "LAA",

and not

str_detect(L1_loc, "^LAPTZa") ~ "LAP",

or did I miss some of your logic here?

dpprdan · July 5, 2018, 7:37am

I’ll also give it a shot. First, are the strings in L1_loc longer than what you show us here? If not there are other solutions possible, which do not require regex at all (either your case_when without the str_detect like this L1_loc == "RAPZpm" ~ "RAP" or with a lookup table and a left_join).

Anyway, here are two alternatives, depending on whether the second but last case is indeed a typo or not.

library(stringr)
suppressPackageStartupMessages(library(dplyr))

xdf <- tibble(
  L1_loc = c(
    "RAPZpm",
    "RAPZpl",
    "RAPZa",
    "RATZa",
    "RAPTZp",
    "LAPZp",
    "LATZp",
    "LAPZa",
    "LAPTZa",
    "RAPZa"
  )
)

This one assumes that your second but last case_when is not a typo. It makes use of the different string lengths, i.e. that strings with 6 word characters from the beginning always end with a “p”. "^(\\w{2})\\w{2}(\\w)" = "\\1\\2" means capture the first two word characters (^(\\w{2})) and the fifth character ((\\w)) and put them in the replacement string (\\1 is the first and \\2 is the second capture group).

xdf %>%
  mutate(mr_reg_l1 =
           str_replace_all(
             L1_loc,
             c(
               "^(\\w{2})\\w{4}"  = "\\1P",
               "^(\\w{2})\\w{2}(\\w)" = "\\1\\2"
             )
           ) %>% str_to_upper())
#> # A tibble: 10 x 2
#>    L1_loc mr_reg_l1
#>    <chr>  <chr>    
#>  1 RAPZpm RAP      
#>  2 RAPZpl RAP      
#>  3 RAPZa  RAA      
#>  4 RATZa  RAA      
#>  5 RAPTZp RAP      
#>  6 LAPZp  LAP      
#>  7 LATZp  LAP      
#>  8 LAPZa  LAA      
#>  9 LAPTZa LAP      
#> 10 RAPZa  RAA

This one is similar to @Stroehli’s solution in that it makes use of the upper/lower case letters, but does not assume that the strings ends after the lower case letters.

xdf %>%
  mutate(mr_reg_l1 =
           str_replace_all(L1_loc,
                           c("^(\\w{2})[:upper:]{2,3}(\\w{1}).*" = "\\1\\2")) %>% 
           str_to_upper()
         )
#> # A tibble: 10 x 2
#>    L1_loc mr_reg_l1
#>    <chr>  <chr>    
#>  1 RAPZpm RAP      
#>  2 RAPZpl RAP      
#>  3 RAPZa  RAA      
#>  4 RATZa  RAA      
#>  5 RAPTZp RAP      
#>  6 LAPZp  LAP      
#>  7 LATZp  LAP      
#>  8 LAPZa  LAA      
#>  9 LAPTZa LAA      
#> 10 RAPZa  RAA

Now the case_when without a reprex:

xdf %>%
  mutate(
  mr_reg_l1 = case_when(
    L1_loc == "RAPZpm" ~ "RAP",
    L1_loc == "RAPZpl" ~ "RAP",
    L1_loc == "RAPZa" ~ "RAA",
    L1_loc == "RATZa" ~ "RAA",
    L1_loc == "RAPTZp" ~ "RAP",
    L1_loc == "LAPZp" ~ "LAP",
    L1_loc == "LATZp" ~ "LAP",
    L1_loc == "LAPZa" ~ "LAA",
    L1_loc == "LAPTZa" ~ "LAP",
    L1_loc == "RAPZa" ~ "RAA",
    TRUE ~ L1_loc
  )
)
#> # A tibble: 10 x 2
#>    L1_loc mr_reg_l1
#>    <chr>  <chr>    
#>  1 RAPZpm RAP      
#>  2 RAPZpl RAP      
#>  3 RAPZa  RAA      
#>  4 RATZa  RAA      
#>  5 RAPTZp RAP      
#>  6 LAPZp  LAP      
#>  7 LATZp  LAP      
#>  8 LAPZa  LAA      
#>  9 LAPTZa LAP      
#> 10 RAPZa  RAA

And finally the lookup table - left_join

lookup_df <- tribble(
  ~L1_loc, ~mr_reg_l1,
  "RAPZpm", "RAP",
  "RAPZpl", "RAP",
  "RAPZa", "RAA",
  "RATZa", "RAA",
  "RAPTZp", "RAP",
  "LAPZp", "LAP",
  "LATZp", "LAP",
  "LAPZa", "LAA",
  "LAPTZa", "LAP",
  "RAPZa", "RAA"
)

left_join(xdf, lookup_df, by = "L1_loc")
#> # A tibble: 12 x 2
#>    L1_loc mr_reg_l1
#>    <chr>  <chr>    
#>  1 RAPZpm RAP      
#>  2 RAPZpl RAP      
#>  3 RAPZa  RAA      
#>  4 RAPZa  RAA      
#>  5 RATZa  RAA      
#>  6 RAPTZp RAP      
#>  7 LAPZp  LAP      
#>  8 LATZp  LAP      
#>  9 LAPZa  LAA      
#> 10 LAPTZa LAP      
#> 11 RAPZa  RAA      
#> 12 RAPZa  RAA

nwerth · July 5, 2018, 3:31pm

This can be done with regex replacement.

library(stringi)
library(magrittr)

xx <- c(
  "RAPZpm", "RAPZpl", "RAPZa", "RATZa", "RAPTZp", "LAPZp", "LATZp", "LAPZa",
  "LAPTZa", "RAPZa",
  # Should match nothing
  "RAPZxm", "RAPpxm"
)

p_pattern <- stri_join(
  "\\b",     # Word boundary
  "([A-Z])", # Capture uppercase letter
  "A\\w*",   # "A", followed by any number of word characters
  "p\\w?\\b" # "p", followed by 0 or 1 word character before the boundary
)
p_pattern
# [1] "\\b([A-Z])A\\w*p\\w?\\b"

a_pattern <- "\\b([A-Z])A\\w*a\\b" # Similar to above

yy <- xx %>%
  stri_replace_all_regex(p_pattern, "$1AP") %>%
  stri_replace_all_regex(a_pattern, "$1AA")
cbind(xx, yy)
#       xx       yy      
#  [1,] "RAPZpm" "RAP"   
#  [2,] "RAPZpl" "RAP"   
#  [3,] "RAPZa"  "RAA"   
#  [4,] "RATZa"  "RAA"   
#  [5,] "RAPTZp" "RAP"   
#  [6,] "LAPZp"  "LAP"   
#  [7,] "LATZp"  "LAP"   
#  [8,] "LAPZa"  "LAA"   
#  [9,] "LAPTZa" "LAA"   
# [10,] "RAPZa"  "RAA"   
# [11,] "RAPZxm" "RAPZxm"
# [12,] "RAPpxm" "RAPpxm"

bragks · July 6, 2018, 6:17am

Thank you! I have to say, this is absurdly complicated for someone new to programming (and R). Especially the syntax of regex.

I'm kind of surprised there isn't a regexnoob-package that allows for some simple matching with more readable code, at least I haven't been able to find it. Something in the lines of the helper functions from select() in combination with case_when() would have been awesome.

bragks · July 6, 2018, 6:22am

Excellent, I haven't had the time to try it yet, but based on your explanation this should be able to do the trick. And yeah, the second-last is a typo, I guess that proves the point of avoiding duplication..

Stroehli · July 6, 2018, 6:41am

Have a look at this stringr cheat sheet. There are a range of helper functions that visually show you what has been matched:

#View HTML rendering of first regex match in each string.
str_view(string, pattern, match = NA)
# View HTML rendering of all regex matches.
str_view_all(string, pattern, match = NA)
# Wrap strings into nicely formatted paragraphs. 
str_wrap(string, width = 80, indent = 0, exdent = 0)```

jcblum · July 6, 2018, 2:13pm

It’s not just you (and it’s not just R)! Regular expressions are awfully powerful and useful, but they’re so frequently confounding that there’s an entire running gag about it in programmer culture (e.g. 1171: Perl Problems - explain xkcd). There’s a core of wisdom in the joke, too: regex is complex and can be hard to maintain/debug, so don’t always reach for it first. If you have a manageable number of values to convert, you may spend more time writing and debugging your regex than if you had simply written out a list of direct translations (in your code or in a lookup table), which would be easier for others or future-you to understand, as well.

But it’s definitely worth it to get better at regex, if only for the thrill of acquiring a new superpower , and tools can help! Besides the great stringr helpers that @Stroehli pointed out, there’s this fantastic RStudio Addin:

It’s heavily inspired by RegExr.com, which is also a great resource (but not R-specific).

It’s good to know, by the way, that since regular expressions have been around for a long time, there are slightly different implementations in different languages. Within R, base R functions use a slightly idiosyncratic syntax. Meanwhile, stringr (building on the stringi package) uses the ICU regex engine. There’s an overview of that syntax, with some examples, in the stringr docs.

nwerth · July 6, 2018, 4:15pm

This is great advice. The evolution writing repetitive code (in my experience):

Stage 1: Write exhaustive code and data files by hand.
Stage 2: Discover regular expressions and use them in the code. When it comes time to update, get frustrated and start from scratch.
Stage 3: Use regular expressions to write the code, and copy-paste that into the script or data file.