There are two ways to approach this problem. One is the way you're thinking of, and the other is using partial matching functions.
For your solutions, the stringi
package lets you perform a series of replacements on each element in a vector.
# Your example data
reported <- c("abdmen pain", "abdomane pain")
corrections <- rbind(
c("abdmen", "abdomen"),
c("abdomane", "abdomen"),
c("abdome", "abdomen"),
c("abdumen", "abdomen"),
c("abodmen", "abdomen"),
c("adnomen", "abdomen"),
c("aabdominal", "abdominal"),
c("abddominal", "abdominal")
)
colnames(corrections) <- c("wrong", "right")
corrections[1:3, ]
# wrong right
# [1,] "abdmen" "abdomen"
# [2,] "abdomane" "abdomen"
# [3,] "abdome" "abdomen"
To avoid replacing the insides of words, we'll use a regular expression to require the "wrong" word be a whole word (bounded on both ends). After that, replacement is straightforward with stringi
.
library(stringi)
corrections[, "wrong"] <- paste0("\\b", corrections[, "wrong"], "\\b")
stri_replace_all_regex(
reported,
corrections[, "wrong"],
corrections[, "right"],
vectorize_all = FALSE
)
# [1] "abdomen pain" "abdomen pain"
When vectorize_all
is FALSE
, the function will apply each pattern/replacement pair to every element of reported
.
The second way to fix misspellings is to use the partial matching function agrepl
from the base
package. Here's an example of this:
typos <- aregexec("abdomen", reported)
regmatches(reported, typos)
# [[1]]
# [1] "abdmen"
#
# [[2]]
# [1] "abdoman"
Notice how the matched part for the second element is missing the "e" at the end. This is because the pattern best matches a substring and doesn't care about words. Trying to force it to respect word boundaries doesn't help much:
typos <- aregexec("\\babdomen\\b", reported)
regmatches(reported, typos)
# [[1]]
# [1] "abdmen"
#
# [[2]]
# character(0)
A more reliable way is to split sentences into words, find which words have any part with a partial match, replace those, and then recombine the sentences.
library(magrittr)
library(stringi)
reported %>%
stri_split_regex("\\b") %>%
lapply(function(x) {
matches <- agrepl("abdomen", x)
replace(x, matches, "abdomen")
}) %>%
vapply(
FUN = paste0,
FUN.VALUE = character(1),
collapse = ""
)
# [1] "abdomen pain" "abdomen pain"
Remember that using an automated process to edit data as it "should be" is bound to have a few unintended consequences. For example, the chain above will change ""abdominal pain"
to "abdomen pain"
.