How to extract two specific words from a text from a df efficiently

akib62 · April 19, 2021, 5:09pm

I have a dataframe containing text in a cell. The length of the text is not fixed. Now I want to extract 2 words and want to store them in 2 columns.

The 2 words starting with DB and then containing 5 numbers. Like, DB00001 and DB01569.

The full text like this,

DB00614 may increase the anticholinergic activities of DB06153.
The risk or severity of adverse effects can be increased when DB00802 is combined with DB00310.

I am using this code and working fine. But, I think , the code can be more optimized.

test_df <- deepDDI %>% mutate(
  parent_key = str_extract_all(deepDDI$des, "(DB)(\\d+)")
) %>%
  select(parent_key) %>%
  separate(parent_key, into=c("nodeA", "nodeB"), sep=",")

test_df[]  <- lapply(test_df, gsub, pattern = '"', replacement= '')
test_df[]  <- lapply(test_df, gsub, pattern = ')', replacement= '')
test_df[]  <- lapply(test_df, gsub, pattern = '\\(', replacement= '')
test_df[]  <- lapply(test_df, gsub, pattern = 'c', replacement= '')

How can I optimize the code?

technocrat · April 20, 2021, 2:29am

I'd do the matching only slightly differently, but would have to pause to consider further the handling of parent_key for cases not covered by the original example.

suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)
})

to_find <- "DB\\d{5}"

DeepDDI <- data.frame(des = 
  c("DB00614 may increase the anticholinergic activities of DB06153.",
    "The risk or severity of adverse effects can be increased when DB00802 is combined with DB00310.",
    "The risk or severity of adverse effects can be increased when DB00802 is combined with DB00310, except in the presence of DB04532.",
    "DB00616 and DB06151 show interactions, but no interactions were found between DB00616 and DB04532"))

DeepDDI %>% mutate(parent_keys = str_extract_all(des,to_find))
#>                                                                                                                                  des
#> 1                                                                    DB00614 may increase the anticholinergic activities of DB06153.
#> 2                                    The risk or severity of adverse effects can be increased when DB00802 is combined with DB00310.
#> 3 The risk or severity of adverse effects can be increased when DB00802 is combined with DB00310, except in the presence of DB04532.
#> 4                                  DB00616 and DB06151 show interactions, but no interactions were found between DB00616 and DB04532
#>                          parent_keys
#> 1                   DB00614, DB06153
#> 2                   DB00802, DB00310
#> 3          DB00802, DB00310, DB04532
#> 4 DB00616, DB06151, DB00616, DB04532

system · April 27, 2021, 2:30am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.