Is there a dictionary of commonly mis-encoded (English) strings/words?

I have some string data.

In a column, correctly encoded 'Café' and wrongly encoded 'Café' are both present.

I was wondering if there is some dictionary I can use to correct these wrongly encoded words.

Hi @pathos
I'm late to the party on this, but can you build your own dictionary like so?

dat <- read.table(header=TRUE, text="
index word_1
1 garçon
2 café
3 café
4 voilà
5 déjà
6 hôpital
7 Noël
8 bäcker
9 vögel
10 frühling
")

dat$word_2 <- iconv(dat$word_1, to="Windows-1252", from="UTF-8")
dat$result <- ifelse(nchar(dat$word_2, allowNA=TRUE) < nchar(dat$word_1, allowNA=TRUE),
                     dat$word_2, dat$word_1)
dat$result <- ifelse(is.na(dat$result), dat$word_1, dat$result)

dat
#>    index   word_1      word_2   result
#> 1      1   garçon   gar\xe7on   garçon
#> 2      2     café     caf\xe9     café
#> 3      3    café        café     café
#> 4      4    voilà    voil\xe0    voilà
#> 5      5     déjà  d\xe9j\xe0     déjà
#> 6      6  hôpital  h\xf4pital  hôpital
#> 7      7     Noël     No\xebl     Noël
#> 8      8   bäcker   b\xe4cker   bäcker
#> 9      9    vögel    v\xf6gel    vögel
#> 10    10 frühling fr\xfchling frühling

Created on 2023-10-16 with reprex v2.0.2

You may have to test various encoding mismatches to get the desired result.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.