How to skip some words for pattern matching

I have a string and I need to match only 2 words like metabolism and increase/decrease and I need to skip all of the words. Then I will pass this pattern in str_detect to split my dataframe.

Sample string:
The metabolism of Drug b can be decreased when combined with Drug a.

My RE: \b(?!Drug|when|combined|with|a|b|of|can|The)\b\S+

I can capture metabolism and decrease for the sample string

But the real dataframe does not contain any Drug a or Drug b. It contains the real name of the drug and the number of a word for the drug name can be varied from drug to drug! For example, Cyclosporine Brexpiprazole, and Ivabradine these are 2 drug names!

Then I will apply this pattern to my code like

demo %>% filter(str_detect(description,pattern)) -> new_df

Any kind of suggestion is appreciable.

So are you also trying to extract the drug names? I'm confused by your question. You ask about metabolism and increase/decrease. But then go on to talk about drug names.

maybe you should type out a row as close to real as you can and then write the output you desire.

I guess I'm also confused by why you aren't doing inclusive extraction instead of exclusive. why not search for the terms instead of searching for not the terms?

to extract drug names they are all proper names right? so you could do something like...

df <-
  data.table(
    text = c(
      "During experiements we found that Cipid Duotyllyl increases metabolism when combined with Heptaichreelgynthraene Kaspliorhite",
      "During experiements we found that Sipthyde Frustrur decreases metabolism when combined with Isopheduacceite",
      "During experiements we found that Philphin increases metabolism when combined with Monoapuphyodeptin Heptawonthitharhycin",
      "During experiements we found that Glolfide Diifludran decreases metabolism when combined with Monoichuxyrlumphein",
      "During experiements we found that Octacliusplodein increases metabolism when combined with Fonhesgechlid Diizirdolfygor"
    )
  )

df %>%
  mutate(drug1 = str_extract_all(text, "\\s[A-Z][a-z]*(\\s[A-Z][a-z]*)?", simplify = TRUE)[, 1],
         drug2 = str_extract_all(text, "\\s[A-Z][a-z]*(\\s[A-Z][a-z]*)?", simplify = TRUE)[, 2]
         )

so why not just extract them both separately?

df <-
  data.table(
    text = c(
      "During experiements we found that Cipid Duotyllyl increases metabolism when combined with Heptaichreelgynthraene Kaspliorhite",
      "During experiements we found that Sipthyde Frustrur decreases metabolism when combined with Isopheduacceite",
      "During experiements we found that Philphin increases metabolism when combined with Monoapuphyodeptin Heptawonthitharhycin",
      "During experiements we found that Glolfide Diifludran decreases metabolism when combined with Monoichuxyrlumphein",
      "During experiements we found that Octacliusplodein increases metabolism when combined with Fonhesgechlid Diizirdolfygor"
    )
  )

df %>%
  mutate(
    metabolsim = str_extract(text, "metabolism"),
    inc_dec    = str_extract(text, "increase|decrease")
         )

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

Hi @ZykeZero,

Thank you for your reply. Yes, your regex is working fine to detect the drug name.

No, I do not need the name of drugs. I only need to extract metabolism and then the word increase or decrease.

Because some strings contain
The metabolism of Drug b can be decreased when combined with Drug a and sometimes
The metabolism of Drug b can be increased when combined with Drug a

So, based on metabolism increased or metabolism decreased specifically using these 2 words I want to split my dataframe.

Let me know if you have any questions.