Regex and stringr: detect non alphanumeric character plus -

budugulo · November 4, 2022, 8:10pm

library(tidyverse)

# toy data
df <- tibble(
  text = c("abcdefgh", "abcd-efg", "123d*-e", "567xyz", "'!abc")
)

df
#> # A tibble: 5 × 1
#>   text    
#>   <chr>   
#> 1 abcdefgh
#> 2 abcd-efg
#> 3 123d*-e 
#> 4 567xyz  
#> 5 '!abc

How can I mutate a new column, say, issue, which will identify if the columns text contains non-alphanumeric characters excluding the -.? In other words, the issue column will be NA if it only contains alphanumeric characters or -.

df_wanted
#> # A tibble: 5 × 2
#>   text     issue
#>   <chr>    <chr>
#> 1 abcdefgh <NA> 
#> 2 abcd-efg <NA> 
#> 3 123d*-e  *    
#> 4 567xyz   <NA> 
#> 5 '!abc    '!

FactOREO · November 4, 2022, 9:04pm

Hey,

I think this serves your needs:

Data <- data.frame(text = c('abcdef','abcde-fgh','123d*-e','567xyz',"'!abc"))
library(stringr); library(dplyr)

Data |>
  rowwise() |>
  mutate(
    issue = case_when(
      str_detect(text,'[^a-zA-Z\\d\\-]') ~ str_extract_all(string = text, pattern = '[^a-zA-Z\\d\\-]', simplify = TRUE) |> paste(collapse = ''),
      TRUE ~ NA_character_
    )
  )
#> # A tibble: 5 × 2
#> # Rowwise: 
#>   text      issue
#>   <chr>     <chr>
#> 1 abcdef    <NA> 
#> 2 abcde-fgh <NA> 
#> 3 123d*-e   *    
#> 4 567xyz    <NA> 
#> 5 '!abc     '!

^{Created on 2022-11-04 with reprex v2.0.2}

The regex is created in a such a way, that it negates (using [^) all alphanumeric characters as well as the minus sign -.

Kind regards

budugulo · November 4, 2022, 9:48pm

@FactOREO Thanks a lot. It does serve my needs!

I need a mini-lesson
Could you please explain the reason for using rowwise() and what the str_detect() part of the code is doing?

FactOREO · November 5, 2022, 4:59am

Sure The rowwise() is necessary because the str_extract_all() inside the case_when() function is vectorized. If you wouldn't do that, the result would be a combination of all (rowwise) results, so in this case *'! in both rows where the condition is true. rowwise() makes sure the str_extract_all() only extracts strings from the specific row we are at the moment.

The str_detect() checks, if we have non alphanumeric Strings inside the text column. You can think of the chain in the following way:

Take Data and add a column issue by checking rowwise, if there are nonalphanumeric characters (excluding the minus sign) inside text. If so, extract all those nonalphanumeric characters and paste them together in one chr scalar (hence the paste(collapse = ''). Otherwise insert NA_character_.

I hope this helps you understand the code above a bit better

budugulo · November 5, 2022, 9:53am

@FactOREO Many thanks for the excellent explanation! I now understand the code fully

system · November 12, 2022, 9:54am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.