Manipulate string columns

I have a data.frame with 2 columns of strings. In the first column, I want to make sure that for each observation, the first character of string is a letter or a number. For example, if it is "* BNY Mellon Cash Reserve", I need to change it into "BNY Mellon Cash Reserve", no space is allowed in the first character.

The second column contains strings of numbers, but some are with parenthesis and other non-number stuff, and I want to only keep the numbers, so I can convert it to numeric values in the future.

Strings are tricky. You can use stringr in tidyverse.

df <- df %>%
    col1=str_trim(str_replace(col1, "[:punct:]", " ")), # replaces punctuation and trims leading and trailing spaces
    col2=str_trim(str_replace(col2, "[[:punct:][:alpha]]", " ")) # replaces punctuation and letters and trims leading and trailing spaces

Second column first. Try parse_number()

That might run aground on parentheses. This (from stringr) won't:

mutate(Your_Second_Column=str_replace(Your_Second_Column, "\(", ""))

The double backslash is the signal that you're not nesting code with a paren, but rather identifying a character -- ( -- in the string that you wish to remove and replace with nothing -- "".

With respect to your first column, a good solution might be (without a reprex we can't know):
Your_First_Column=str_sub(Your_First_Column, 4, 18) ... where 4 and 18 mark the beginning and end of what you want to keep from the string. Obviously, this won't work if the pattern of the beginning and end of the strings are different.

I didn't see this before responding. Excellent.

1 Like

Like use of str_trim in col1! I think col2 can be simplified

col2 = str_replace_all(col2, "[^0-9]","") # drop everything that's not a digit

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.