Remove one word when it appears in the sentence with other word no matter in what order they go or how many words there are between them

I have a list of strings like this:

string <- c("tasty apple", "tasty orange", "yellow banana", "red tasty peach", "tasty banana apple", "tasty apple yellow banana", "yellow orange banana", "peach tasty apple", "yellow banana tasty peach")

When there is just one type of fruit in the string it is fine. However, when there are more than 2 of them I have a list of coexisting words and replacements (it is like a dictiorary):

pattern <- c("banana apple", "banana orange", "peach apple", "banana peach")
replacement <- c("apple", "banana", "peach", "banana")

I can remove one of fruits when they are next to each other in the string, however in my data there can be words between them and I do not know how to remove unnecessary word. The order of the words in the string might differ as well.

I want it to be like this:

Before After
tasty apple tasty apple
tasty orange tasty orange
yellow banana yellow banana
red tasty peach red tasty peach
tasty banana apple tasty apple
tasty apple yellow banana tasty apple yellow
yellow orange banana yellow banana
peach tasty apple peach tasty
yellow banana tasty peach yellow banana tasty

Maybe I can use some kind of regular expression to identify the words between words? But I need to save them and delete the unnecessary word only

If an element of string has more than one fruit name, what is the rule of decision? First fruit wins? Or second

These are inconsistent.

Hello @ technocrat! There order here does not matter. What matters is what types of fruit are present in the string. For example, whan banana and apple are in the same string only apple should always be left no matter what.

However I can modify my dictionary and present 2 scenarious: when banana is first and when apple is first and in both cases the replacement will be an apple. But it does not solve the problem with other words between them

OK, the rules then are

  • apple knocks out banana
  • banana knocks out orange
  • peach knocks out apple

I'll work on that

I came up with some code that follows...
I initially got 'yellow orange banana' for 7 as apparently orange is a fruit rather than a colour and so should be an option for being dropped. So I put it as the 4th priority to resolve that.
I have a remaining discrepancy on 8, as 'peach tasty apple' goes to tasty apple rathan than peach tasty, owing to apple being prioritised above peach ...

string <- c("tasty apple", 
            "tasty orange", 
            "yellow banana", 
            "red tasty peach", 
            "tasty banana apple", 
            "tasty apple yellow banana", 
            "yellow orange banana", 
            "peach tasty apple", 
            "yellow banana tasty peach")

priority <- c(
  "apple",
  "banana",
  "peach",
  "orange"
)


library(tidyverse)
(pr_df <- expand_grid(
  p1 = priority,
  drop = priority
) |>
  filter(p1 != drop) |>
  group_by(p1) |>
  mutate(rn = row_number()) |>
  pivot_wider(
    values_from = "drop",
    names_from = "rn"
  ) |>
  mutate(drops = list(str_c(pick(everything())))) |>
  select(p1, drops))


map_chr(string, \(x){
  priority_keep <- pr_df$p1[head(which(
    stringi::stri_detect_fixed(x, pr_df$p1)),
    n = 1)]
  if (length(priority_keep) == 0) {
    return(x)
  }
  drops_to_drop <- filter(
    pr_df,
    p1 == !!priority_keep
  ) |>
    pull(drops) |>
    unlist()
  for (d in drops_to_drop) {
    x <- str_replace_all(x,
      pattern = d,
      replacement = ""
    )
  }
  trimws(x |> str_replace_all(pattern=fixed("  ")," "))
})

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.