Non-greedy regular expression matching.


#1

I want to delete the parts of a list of strings up to the first instance of an "=". But sometimes there is another "=" later in the strings. Since regex is generally "greedy," my code deletes too much. For example:

teststr <- c("ragecut=Child, cohort=Ideal",
"ragecut=Child, cohort=PedEnbloc",
"ragecut=Child, cohort=PSeparate",
"ragecut=Teen, cohort=Ideal",
"ragecut=Teen, cohort=PedEnbloc",
"ragecut=Teen, cohort=PSeparate")
sub(".*=+","", teststr)
[1] "Ideal" "PedEnbloc" "PSeparate" "Ideal" "PedEnbloc" "PSeparate"

This creates duplicate values. What I want is:

"Child, cohort=Ideal", "Child, cohort=PedEnbloc", "Child, cohort=PSeparate",
"Teen, cohort=Ideal", "Teen, cohort=PedEnbloc", "Teen, cohort=PSeparate"

Is there a straight-forward way to convince sub() to stop at the shortest string that satisfies the match rather than the longest?
Thanks in advance for any help with this.
Larry Hunsicker


#2

OK. I found a way:

gsub("^[^=]*=", "", teststr)
[1] "Child, cohort=Ideal" "Child, cohort=PedEnbloc"
[3] "Child, cohort=PSeparate" "Teen, cohort=Ideal"
[5] "Teen, cohort=PedEnbloc" "Teen, cohort=PSeparate"

I don't understand it yet. But it works. Thanks to all.
Larry Hunsicker


#3

OK. I got it figured out. I was confused because the carat (^) is used in two different ways. At the beginning of the expression, it means the start of a line or string. But within brackets ([^ ]) it means any characters NOT including what follows the carat. So gsub("^[^=]*=", "", teststr) means: starting at the beginning of the string, match any number of characters _not including "=", but ending with an "=" and then replace that string with a null string (""). Turns out to be completely logical -- if initially confusing because of the two uses of the carat.
Larry Hunsicker


#4

Hey @lhunsicker, glad you got it worked out. Another way of handling this with regular expressions would be to have made the pattern non-greedy by adding a ? to it:

teststr <- c("ragecut=Child, cohort=Ideal",
             "ragecut=Child, cohort=PedEnbloc",
             "ragecut=Child, cohort=PSeparate",
             "ragecut=Teen, cohort=Ideal",
             "ragecut=Teen, cohort=PedEnbloc",
             "ragecut=Teen, cohort=PSeparate")

sub(".*=+?", "", teststr)

Produces:

[1] "Child, cohort=Ideal"     "Child, cohort=PedEnbloc" "Child, cohort=PSeparate"
[4] "Teen, cohort=Ideal"      "Teen, cohort=PedEnbloc"  "Teen, cohort=PSeparate" 

If your inputs are always this regular, another good way to handle this would've been by splitting the strings by delimiter but the regex approach is pretty tidy.


#5

You can also use stringr and a regex. str_remove will only remove the first occurrence

library(stringr)
teststr <- c("ragecut=Child, cohort=Ideal",
             "ragecut=Child, cohort=PedEnbloc",
             "ragecut=Child, cohort=PSeparate",
             "ragecut=Teen, cohort=Ideal",
             "ragecut=Teen, cohort=PedEnbloc",
             "ragecut=Teen, cohort=PSeparate")
str_remove(teststr, "[a-z]*=")
#> [1] "Child, cohort=Ideal"     "Child, cohort=PedEnbloc"
#> [3] "Child, cohort=PSeparate" "Teen, cohort=Ideal"     
#> [5] "Teen, cohort=PedEnbloc"  "Teen, cohort=PSeparate"

Created on 2018-10-15 by the reprex package (v0.2.1)


#6

Thanks to both brycemecum and cderv.
Larry Hunsicker