I am trying to clean a dataset as part of the pre-processing. The dataset has (among others) a text variable in which respondents wrote large amounts of text (varying from a single sentence to full paragraphs). I want to clean the dataset by going through all unique words after tokenizing into words and then cleaning all the typing errors.
My plan was to use:
df %>% str_replace_all("typo1", "fix1") str_replace_all("typo2", "fix2")
however, is there another way in which I can feed maybe a dataframe with all the typo's and the fixes? So I get to keep the list of typo's and fixes separately as well. Something with Var1 = Typo, Var2 = Fix.
Example of an instance in the text variable:
"I work 24h per week" and I would like this to turn into "I work 24 hour per week"