Code region names spelled differently

Hi all:
I have a dataset with 1600 observations. Respondents were asked to write down the regions where they are living. They have spelt those regions in all sorts of creative and different ways, with Capital letters, non-capital, ALL BLOCKS, misspelt, etc.

I would like to assign each of these names to one level of one factor (there exist 20 regions in total). Is there a way to perform this action in R? Dataset language is Italian if relevant.

Thank you for your help.

Case differences can be handled using functions from stringr. Here's an example of how to convert everything to sentence case. Once the case has been standardized, you can convert it to a factor.

regions <- c("Abruzzo", "CALABRIA", "lombardy", "TUScany")

stringr::str_to_sentence(regions)
#> [1] "Abruzzo"  "Calabria" "Lombardy" "Tuscany"

Created on 2020-05-14 by the reprex package (v0.3.0)

Misspellings are harder to deal with. You'll probably have to use something like forcats::fct_collapse() to combine the misspelled regions into the correct factor level.

Thank you very much - this is really useful and gives me somewhere where to start.

I'll keep this open a bit just in case there is anyone else who knows any additional tricks.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.