How to recognize a regular expression and its sub-variations in R

I need to recognize a regular expression and its sub-variations and paste all expressions in one line divided by a pipe symbol.

##  I have this vector:

teste<- as.vector(c("MG_00001_01", "MG_00001_02", 
           "MG_00002_01","MG_00002_02", "MG_00002_03",
           "MG_00003_01","MG_00003_02"))
> teste
[1] "MG_00001_01" "MG_00001_02" "MG_00002_01" "MG_00002_02" "MG_00002_03" "MG_00003_01" "MG_00003_02"`

## and I need a data frame with every sub-variations of a regular expression in
##   single row and separated by a pipe # symbol, like this:
  
> result
1    "MG_00001_01"|"MG_00001_02"
2    "MG_00002_01"|"MG_00002_02"|"MG_00002_03"
3    "MG_00003_01"|"MG_00003_02"
1 Like

That depends a lot of what assumptions you can make. Something like that could make sense:

library(tidyverse)

teste<- as.vector(c("MG_00001_01", "MG_00001_02", 
                    "MG_00002_01","MG_00002_02", "MG_00002_03",
                    "MG_00003_01","MG_00003_02"))


str_match(teste, "^MG_([[:digit:]]{5})_([[:digit:]]{2})$") |>
  as.data.frame() |>
  setNames(c("original", "first_group", "second_group")) |>
  group_by(first_group) |>
  summarize(result = paste(original, collapse = "|"))
#> # A tibble: 3 × 2
#>   first_group result                             
#>   <chr>       <chr>                              
#> 1 00001       MG_00001_01|MG_00001_02            
#> 2 00002       MG_00002_01|MG_00002_02|MG_00002_03
#> 3 00003       MG_00003_01|MG_00003_02

Created on 2022-11-08 by the reprex package (v2.0.1)

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.