I have a dataframe containing text in a cell. The length of the text is not fixed. Now I want to extract 2 words and want to store them in 2 columns.
The 2 words starting with DB and then containing 5 numbers. Like, DB00001 and DB01569.
The full text like this,
DB00614 may increase the anticholinergic activities of DB06153.
The risk or severity of adverse effects can be increased when DB00802 is combined with DB00310.
I am using this code and working fine. But, I think , the code can be more optimized.
I'd do the matching only slightly differently, but would have to pause to consider further the handling of parent_key for cases not covered by the original example.
suppressPackageStartupMessages({
library(dplyr)
library(stringr)
})
to_find <- "DB\\d{5}"
DeepDDI <- data.frame(des =
c("DB00614 may increase the anticholinergic activities of DB06153.",
"The risk or severity of adverse effects can be increased when DB00802 is combined with DB00310.",
"The risk or severity of adverse effects can be increased when DB00802 is combined with DB00310, except in the presence of DB04532.",
"DB00616 and DB06151 show interactions, but no interactions were found between DB00616 and DB04532"))
DeepDDI %>% mutate(parent_keys = str_extract_all(des,to_find))
#> des
#> 1 DB00614 may increase the anticholinergic activities of DB06153.
#> 2 The risk or severity of adverse effects can be increased when DB00802 is combined with DB00310.
#> 3 The risk or severity of adverse effects can be increased when DB00802 is combined with DB00310, except in the presence of DB04532.
#> 4 DB00616 and DB06151 show interactions, but no interactions were found between DB00616 and DB04532
#> parent_keys
#> 1 DB00614, DB06153
#> 2 DB00802, DB00310
#> 3 DB00802, DB00310, DB04532
#> 4 DB00616, DB06151, DB00616, DB04532