How to correct the misspelled words in a data frame from the corrected spelling words in another dataframe

Hi All,
I fond that similar question was asked in previous posts but I feel my requirement is peculiar.
I have a dataframe which consists of one where there are reported misspell terms by reps.

Reported terms.

abdmen pain
abdomane pain

I have another data frame where I have corrected manually .

Wrong spell Correct spell
abdmen abdomen
abdomane abdomen
abdome abdomen
abdumen abdomen
abodmen abdomen
adnomen abdomen
aabdominal abdominal
abddominal abdominal

Now I need to correct the spellings in the reported terms as follows:

Reported terms
abdomen pain
abdomen pain

Could you please someone let me know what is the best approach way to do this task.
Thanks in advance for feedback

There are two ways to approach this problem. One is the way you're thinking of, and the other is using partial matching functions.

For your solutions, the stringi package lets you perform a series of replacements on each element in a vector.

# Your example data
reported <- c("abdmen pain", "abdomane pain")
corrections <- rbind(
  c("abdmen", "abdomen"),
  c("abdomane", "abdomen"),
  c("abdome", "abdomen"),
  c("abdumen", "abdomen"),
  c("abodmen", "abdomen"),
  c("adnomen", "abdomen"),
  c("aabdominal", "abdominal"),
  c("abddominal", "abdominal")
)
colnames(corrections) <- c("wrong", "right")
corrections[1:3, ]
#      wrong      right    
# [1,] "abdmen"   "abdomen"
# [2,] "abdomane" "abdomen"
# [3,] "abdome"   "abdomen"

To avoid replacing the insides of words, we'll use a regular expression to require the "wrong" word be a whole word (bounded on both ends). After that, replacement is straightforward with stringi.

library(stringi)
corrections[, "wrong"] <- paste0("\\b", corrections[, "wrong"], "\\b")
stri_replace_all_regex(
  reported,
  corrections[, "wrong"],
  corrections[, "right"],
  vectorize_all = FALSE
)
# [1] "abdomen pain" "abdomen pain"

When vectorize_all is FALSE, the function will apply each pattern/replacement pair to every element of reported.


The second way to fix misspellings is to use the partial matching function agrepl from the base package. Here's an example of this:

typos <- aregexec("abdomen", reported)
regmatches(reported, typos)
# [[1]]
# [1] "abdmen"
# 
# [[2]]
# [1] "abdoman"

Notice how the matched part for the second element is missing the "e" at the end. This is because the pattern best matches a substring and doesn't care about words. Trying to force it to respect word boundaries doesn't help much:

typos <- aregexec("\\babdomen\\b", reported)
regmatches(reported, typos)
# [[1]]
# [1] "abdmen"
# 
# [[2]]
# character(0)

A more reliable way is to split sentences into words, find which words have any part with a partial match, replace those, and then recombine the sentences.

library(magrittr)
library(stringi)

reported %>%
  stri_split_regex("\\b") %>%
  lapply(function(x) {
    matches <- agrepl("abdomen", x)
    replace(x, matches, "abdomen")
  }) %>%
  vapply(
    FUN = paste0,
    FUN.VALUE = character(1),
    collapse = ""
  )
# [1] "abdomen pain" "abdomen pain"

Remember that using an automated process to edit data as it "should be" is bound to have a few unintended consequences. For example, the chain above will change ""abdominal pain" to "abdomen pain".

Just to add an option, this can also be done with stringr from the tidyverse

library(stringr)

corrected <- data.frame(stringsAsFactors=FALSE,
     Wrong_spell = c("abdmen", "abdomane", "abdome", "abdumen", "abodmen",
                     "adnomen", "aabdominal", "abddominal"),
   Correct_spell = c("abdomen", "abdomen", "abdomen", "abdomen", "abdomen",
                     "abdomen", "abdominal", "abdominal")
)

reported <- c("abdmen pain", "abdomane pain", "abdominal pain")

regex_pattern <- setNames(corrected$Correct_spell, paste0("\\b", corrected$Wrong_spell, "\\b")) 

str_replace_all(reported, regex_pattern)
#> [1] "abdomen pain"   "abdomen pain"   "abdominal pain"

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.