Splitting strings help

trent · February 20, 2020, 1:56am

Hi all.

I've got a large dataset with a field of drug names and ATC categories, as so:

>head(unique(state$drug))
[1] "Amphotericin B (A01AB04)" "Nystatin (A07AA02)"       "Clotrimazole (G01AF02)"   "Doxycycline (J01AA02)"    "Ampicillin (J01CA01)"    
[6] "Amoxicillin (J01CA04)"

The pattern is the same - string name (maybe, maybe not including spaces, see "Ampho B" above), followed by a space, followed by seven characters in parentheses.

What I don't want:

>head(str_trunc(state$drug, 10, side = "right"))
[1] "Amphote..." "Nystati..." "Clotrim..." "Doxycyc..." "Ampicil..." "Amoxici..."

> head(str_trunc(state$drug, 13, side = "left"))
[1] "... (A01AB04)" "... (A07AA02)" "... (G01AF02)" "... (J01AA02)" "... (J01CA01)" "... (J01CA04)"

This is the inverse of what I want (without the ellipsis)

> head(str_split_fixed(state$drug, " \\(", n=2))
     [,1]             [,2]      
[1,] "Amphotericin B" "A01AB04)"
[2,] "Nystatin"       "A07AA02)"
[3,] "Clotrimazole"   "G01AF02)"
[4,] "Doxycycline"    "J01AA02)"
[5,] "Ampicillin"     "J01CA01)"
[6,] "Amoxicillin"    "J01CA04)"

This would also do, if the second string still included the opening parenthesis, or omitted the final one.

What am I missing?

Thanks in advance.

technocrat · February 20, 2020, 3:06am

Hi, @trent, please see FAQ: What's a reproducible example (`reprex`) and how do I do one?. They are very helpful.

I have probably misunderstood the question as to what is wanted and what is unwanted. Assuming what is wanted is "(A01AB04")

library(stringr)
vec <- c("Amphotericin B (A01AB04)","Nystatin,(A07AA02)","Clotrimazole,(G01AF02)","Doxycycline,(J01AA02)","Ampicillin,(J01CA01)")
pattern <- "\\(.*\\)$"
str_extract(vec,pattern)
#> [1] "(A01AB04)" "(A07AA02)" "(G01AF02)" "(J01AA02)" "(J01CA01)"

^{Created on 2020-02-19 by the reprex package (v0.3.0)}

andresrcs · February 20, 2020, 3:39am

Is this what you want?

library(stringr)

sample_text <- c("Amphotericin B (A01AB04)", "Nystatin (A07AA02)", "Clotrimazole (G01AF02)",
                 "Doxycycline (J01AA02)", "Ampicillin (J01CA01)", "Amoxicillin (J01CA04)")

str_match(sample_text, "(.+)\\s+(\\(.+\\))")
#>      [,1]                       [,2]             [,3]       
#> [1,] "Amphotericin B (A01AB04)" "Amphotericin B" "(A01AB04)"
#> [2,] "Nystatin (A07AA02)"       "Nystatin"       "(A07AA02)"
#> [3,] "Clotrimazole (G01AF02)"   "Clotrimazole"   "(G01AF02)"
#> [4,] "Doxycycline (J01AA02)"    "Doxycycline"    "(J01AA02)"
#> [5,] "Ampicillin (J01CA01)"     "Ampicillin"     "(J01CA01)"
#> [6,] "Amoxicillin (J01CA04)"    "Amoxicillin"    "(J01CA04)"

dromano · February 20, 2020, 4:21am

Here's a possible solution that uses the tidyverse package:

library(tidyverse)
vec <- 
  c("Amphotericin B (A01AB04)",
    "Nystatin (A07AA02)",
    "Clotrimazole (G01AF02)",
    "Doxycycline (J01AA02)",
    "Ampicillin (J01CA01)"
    )

tibble(vec) %>% 
  separate(vec, into = c('name', 'code'), sep = " \\(" )  %>% 
  mutate(code = paste0('(', code))
#> # A tibble: 5 x 2
#>   name           code     
#>   <chr>          <chr>    
#> 1 Amphotericin B (A01AB04)
#> 2 Nystatin       (A07AA02)
#> 3 Clotrimazole   (G01AF02)
#> 4 Doxycycline    (J01AA02)
#> 5 Ampicillin     (J01CA01)

^{Created on 2020-02-19 by the reprex package (v0.3.0)}

system · March 12, 2020, 4:21am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.