how to strip special characters from a string?

stringr

#1

I ran into some difficulties replacing all the special characters in a very simple string
Consider this string

mystring <- "[[\"The Tidy\",\"Verse\"]]"

I was trying to use stringr to get rid of all the [, ] , \ and " inside it, but I was not able to do so. What is the right syntax here?

str_replace(mystring, '[\[\]"\\]', '')
Error: '\[' is an unrecognized escape in character string starting "'[\["

Thanks!


#2

Thanks for the concrete example!

Special characters are a curse in any language, not just R . Fortunately your problem has a simple solution using regular expressions I assume you want to end up with "The Tidy Verse"

library(stringr)
mystring <- "[[\"The Tidy\",\"Verse\"]]"
> str_replace_all(mystring, "[^[:alnum:]]", " ") %>% str_replace_all(.,"[ ]+", " ")
[1] " The Tidy Verse "

regex is a language all its own, but it will repay close study for the rest of your life dealing with text


#3

hi @technocrat thanks! actually I would like to end up with "The Tidy, Verse" :slight_smile:
How could we modify your code? Also, I would like to stick to pure regex without using the :alnum: shortcuts that we have in stringr.

What do you think?
Thanks!!


#4

you can actually see that I was trying to use regex by escaping all of these special characters, but for some reason it didnt work... :(:roll_eyes:


#5

Well, stringer is actually part of the tidyverse :grin:

:anum: is just sugar for [A-Za-z] and the ^ negates all the non-letters (not members of the bracketed class) and replaces them with single spaces. Then, to avoid too convoluted a regex to deal with the space separating the two parts of the inner list, I just piped to replace any run of blanks with just a single blank. So, let's refactor

> library(stringr)
> mystring <- "[[\"The Tidy\",\"Verse\"]]"
> str_replace_all(mystring, "[^[A-Za-z]]", " ") %>% str_replace_all(.,"[ ]+", " ")
[1] " The Tidy Verse "
> 

#6

thanks @technocrat but as said I need to keep the commas.. so the output would be "the Tidy, verse"


#7

Missed that.

str_replace_all(mystring, "[^[A-Za-z,]]", " ") %>% str_squish(.)  %>% str_replace_all(.," , ", ", ")

#8

it works but I am bit puzzled because I thought in R we needed to escape the special characters...


#9

We often do, but it depends on the context. In the tidyverse, the authors and maintainers have gone to a lot of effort to minimize that in general.

The trick her was to do the opposite of identifying the special characters, it was to isolate everything that wasn't a nonspecial character


#10

You can shorten the regex to "[^A-Za-z,]", instead of "[^[A-Za-z,]]" for the initial removal of the special characters.


#11

You're right. Old habits from Python


#12

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.