I'm an admitted beginner with stringr and regular expressions. I've been struggling with this problem. Any help or advice will be appreciated.
My objective is to normalize different sets of Manufacturer Part Numbers for improved matching.
I want to remove all characters except letters and numbers and convert letters to lowercase.
I think I've found a way to do get either all letters or all numbers, but not both at the same time.
library(stringr)
library(reprex)
#> Warning: package 'reprex' was built under R version 3.4.2
part_num <- c("X17-L", "36-110pc_BL/S", "#008 5")
str_replace_all(part_num, "[^a-zA-Z]", "") %>% str_to_lower()
#> [1] "xl" "pcbls" ""
str_replace_all(part_num, "[^0-9]", "")
#> [1] "17" "36110" "0085"
# I want to combine the two lines above
# Line below does not work because of "OR" condition
# Is there an "AND" condition?
str_replace_all(part_num, "[^a-zA-Z]|[^0-9]", "")
#> [1] "" "" ""
The [] character class regex operator is automatically already "or"-ing everything inside, which is why [a-zA-Z] matches all lowercase and uppercase letters. So, [^a-zA-Z0-9] would work. You could also do that with [^[:alnum:]], which may work better with if you work outside of the ASCII character set.
Character classes in Regex are exactly what you need to solve your problem, which I think is what you were aiming for in the above script. It essentially captures lots of characters to make tasks like this very easy, without developing a crazy regular expression:
You can look at this help page to see what characters are in each class or look at this help page for more details: