Regex Pattern Matching - multiple patterns to text

I need to filter out some variable names from some logic programmed into another language that I'm going to convert to R. The have data is generated below and is a reasonable reproduction of my actual data. I have managed to get most of the regex patterns with the assistance of https://regexr.com, but I can't seem to get the OR portion of that working properly. I've tried adding parenthesis but likely didn't put them in correctly or I'm missing something else. Essentially, I want any pattern that matches for anything that starts with some letters and then has numbers after and take it to the end of the string but couldn't figure out how to filter just the starts with pattern of letters and numbers.

TLDR:
Extract variable names from text which are of the form letter:numbers.

#create sample data for testing
x1 = "3.07 -8.32 * (((not ab300c_7787) or (not ab300c_7038)) and (not ab300c_7312))"
x2 = "-0.135 +1.11 * ((not gh312732 or gh1782878_b) and gh10211811)"
x3 =  "-0.111 +1.87 * (gh18213180 and (gh2210213_b or gh2288775))"
x4 = "-0.0172 +1.33 * ((ab100k_133380) and (lc100t_78371 and ab300c_102130))"
x5 = "-0.885 +0.732 * (((not gh3117380) and gh13872288) and (not gh11111181))"
x6 = "-0.885 +0.783 * ((ab300c_78781 and ab300c_81521) and (not ab300c_101881))"

have = enframe(rbind(x1, x2, x3, x4, x5, x6))

#filter out names for analysis
want <- have %>%
  mutate(x = 
           str_extract_all(value, regex("([:alpha:]+[:digit:]+[:alpha:]+[_][:digit:]+)| 
                                         ([:alpha:]+[:digit:]+[_][:alpha:]+) | 
                                         ([:alpha:]+[:digit:]+)"
                                          )))

Expected output is:

ab300c_7787, ab300c_7038, ab300c_7312
gh312732, gh1782878_b, gh10211811
gh18213180, gh2210213_b, gh2288775
ab100k_133380, lc100t_78371, ab300c_102130
gh3117380, gh13872288, gh11111181
ab300c_78781, ab300c_81521, ab300c_101881

This produces your desired output

regex("[:alpha:]+[:digit:]+[:alpha:]+[_][:digit:]+|[:alpha:]+[:digit:]+[_][:alpha:]+|[:alpha:]+[:digit:]+")

Note: Do not use new lines in your regex unless you actually want to match one, if you want to use new lines (but not match them) use comments = TRUE option.

Awesome, thank you so much!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.