Regular expresion for get a specific string

M_AcostaCH · January 16, 2023, 4:28pm

Hi community

Im want to extract the specific string of this data from web scraping.
Im need only the string that start with G and next are numbers, some one could finish with letter.

library(tidyverse)
datos_pi <- structure(list(num_pi = c("PI 093817", "PI 113367", "PI 131426", 
                                       "PI 151393", "PI 299387", "PI 416424"), 
                           accession_name1 = c("G3 | Type: CGIAR International Center Identifier | Group: CIAT | Centro Internacional de Agricultura Tropical | International Center for Tropical Agriculture |  | 65-033-00223 | Type: Other or unclassified name", 
                                              "G19 | Type: CGIAR International Center Identifier | Group: CIAT | Centro Internacional de Agricultura Tropical | International Center for Tropical Agriculture |  | No. 305 | Type: Donor identifier", 
                                               "GRANDA OHNE FADEN | Type: Local name | translates: \"LARGE ONE WITHOUT STRING\" |  | No. 2756 | Type: Developer identifier", 
                                                "Guarzo de Arbol | Type: Local name | translates: \"BIRD OF THE TREE\" (a climbing bean) |  | No. 9 | Type: Donor identifier |  | G18717 | Type: CGIAR International Center Identifier | Group: CIAT | Centro Internacional de Agricultura Tropical | International Center for Tropical Agriculture", 
                                                "Preta Rajada | Type: Local name | translates: \"BLACK GUST(of wind)\"", 
                                                "G14095 | Type: CGIAR International Center Identifier | Group: CIAT | Centro Internacional de Agricultura Tropical | International Center for Tropical Agriculture |  | 65-153-01735 | Type: Donor identifier | Evans, K.H. USDA Regional Pulse Improvement Project"
                                       ), accession_name2 = c("V-2223 | Type: Other or unclassified name |  | G19957 | Type: CGIAR International Center Identifier | Group: CIAT | possibly a selection from PI 93817. | International Center for Tropical Agriculture", 
                                                              "ASIATIC EXPEDITION NO.305 | Type: Duplicate accession name", 
                                                              "G2938 | Type: CGIAR International Center Identifier | Group: CIAT | Centro Internacional de Agricultura Tropical | International Center for Tropical Agriculture", 
                                                              "G18717A | Type: CGIAR International Center Identifier | Group: CIAT | a CIAT selection from PI 151393 | International Center for Tropical Agriculture |  | G18717B | Type: CGIAR International Center Identifier | Group: CIAT | a CIAT selection from PI 151393 | International Center for Tropical Agriculture", 
                                                              "G25187 | Type: CGIAR International Center Identifier | Group: CIAT | Centro Internacional de Agricultura Tropical | International Center for Tropical Agriculture", 
                                                              "G14095A | Type: CGIAR International Center Identifier | Group: CIAT | a CIAT selection from PI 416424 | International Center for Tropical Agriculture |  | Turkey Adapazari 1735 | Type: CGIAR International Center Identifier | Group: CIAT | Name came from the CIAT database"
                                       )), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
                                       ))

datos_pi$name1 <- str_extract(datos_pi$accession_name1, "G[:digit:]");datos_pi
datos_pi$name2 <- str_extract(datos_pi$accession_name2, "G[:digit:]");datos_pi # dont get all digits

#  name1 name2
# <chr> <chr>
#1 G3    G1   
#2 G1    NA   
#3 NA    G2   
#4 G1    G1   
#5 NA    G2   
#6 G1    G1  

#Disered output
# name1  name2    name3   # put a column for each string 
# G3      G19957  
# G19     NA
# NA      G2938 
# G18717  G18717A   G18717B
# NA      G25187  
# NA      G2938 
# G3      G19957

The idea is obtain any convination that start with G.
This was the options:

G1
G12
G123
G1234
G12345
G12345A # or any letter.

Tnks!

nirgrahamuk · January 16, 2023, 4:37pm

try this regex

"G[:digit:]+"

the plus will make it match more than 1 digit as well

M_AcostaCH · January 16, 2023, 4:48pm

Is so close.

Chech the 3 rows have 3 string that star with G.
G18717 , G18717, and G18717B.

With "G[:digit:]+" Im get G18717.

nirgrahamuk · January 16, 2023, 4:51pm

you want to further match if theres an optional character on the end ?

 "G[:digit:]+[:alpha:]?"

M_AcostaCH · January 16, 2023, 6:18pm

The expression work well. But for example in row 4 in name accession_name2 column, exist two string that make match but only show 1.

Exist a form for make a new column with each match?

# G18717A   G18717B

nirgrahamuk · January 16, 2023, 6:28pm

mutate(datos_pi2,
  name2 = str_extract_all(
    datos_pi$accession_name2,
    "G[:digit:]+[:alpha:]?"
  )) |>
  relocate(name2) |>
  unnest_wider(col = "name2", 
               names_sep = "_")

M_AcostaCH · January 16, 2023, 6:44pm

@nirgrahamuk Amaizing help!

Andrzej · January 16, 2023, 8:55pm

How and when datos_pi2 was created ?

nirgrahamuk · January 16, 2023, 9:12pm

datos_pi2 is just a copy of datos_pi

M_AcostaCH · January 16, 2023, 9:27pm

Yes!, the example run well.
Change datos_pi2 for datos_pi

system · January 23, 2023, 9:28pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.