Normalizing text part numbers with str_replace_all


I’m an admitted beginner with stringr and regular expressions. I’ve been struggling with this problem. Any help or advice will be appreciated.

My objective is to normalize different sets of Manufacturer Part Numbers for improved matching.
I want to remove all characters except letters and numbers and convert letters to lowercase.

I think I’ve found a way to do get either all letters or all numbers, but not both at the same time.

#> Warning: package 'reprex' was built under R version 3.4.2
part_num <- c("X17-L", "36-110pc_BL/S", "#008 5")
str_replace_all(part_num, "[^a-zA-Z]", "") %>% str_to_lower()
#> [1] "xl"    "pcbls" ""
str_replace_all(part_num, "[^0-9]", "")
#> [1] "17"    "36110" "0085"
# I want to combine the two lines above
# Line below does not work because of "OR" condition
# Is there an "AND" condition?
str_replace_all(part_num, "[^a-zA-Z]|[^0-9]", "")
#> [1] "" "" ""

The desired result is:
“x17l” “36110pcbls” “0085”


The [] character class regex operator is automatically already “or”-ing everything inside, which is why [a-zA-Z] matches all lowercase and uppercase letters. So, [^a-zA-Z0-9] would work. You could also do that with [^[:alnum:]], which may work better with if you work outside of the ASCII character set.


Yup, @nick beat me to it, but this should look something like

part_num <- c("X17-L", "36-110pc_BL/S", "#008 5")

lower_part_num <- str_to_lower(part_num)

clean_part_num <- str_replace(lower_part_num, "[^[:alnum:] ]", "")

The intermediary stashing of the data isn’t necessary, just so you could see the changes as you go.


Character classes in Regex are exactly what you need to solve your problem, which I think is what you were aiming for in the above script. It essentially captures lots of characters to make tasks like this very easy, without developing a crazy regular expression:
You can look at this help page to see what characters are in each class or look at this help page for more details:

?`regular expression`

You can also use the str_view function to help debug your regex, and this cheat sheet is also helpful:

part_num <- c("X17-L", "36-110pc_BL/S", "#008 5")
str_replace_all(part_num,"[:punct:]|[:space:]","") %>% #replace punctation or space char with nothing
 str_to_lower() ##make lowercase

Let me know if you need more help.


Thanks for the beautiful and helpful replies! That’s a tremendous help and learning. I now see first hand why a reproducible example is so valuable.