I have some string data.
In a column, correctly encoded 'Café' and wrongly encoded 'Café' are both present.
I was wondering if there is some dictionary I can use to correct these wrongly encoded words.
I have some string data.
In a column, correctly encoded 'Café' and wrongly encoded 'Café' are both present.
I was wondering if there is some dictionary I can use to correct these wrongly encoded words.
Hi @pathos
I'm late to the party on this, but can you build your own dictionary like so?
dat <- read.table(header=TRUE, text="
index word_1
1 garçon
2 café
3 café
4 voilà
5 déjà
6 hôpital
7 Noël
8 bäcker
9 vögel
10 frühling
")
dat$word_2 <- iconv(dat$word_1, to="Windows-1252", from="UTF-8")
dat$result <- ifelse(nchar(dat$word_2, allowNA=TRUE) < nchar(dat$word_1, allowNA=TRUE),
dat$word_2, dat$word_1)
dat$result <- ifelse(is.na(dat$result), dat$word_1, dat$result)
dat
#> index word_1 word_2 result
#> 1 1 garçon gar\xe7on garçon
#> 2 2 café caf\xe9 café
#> 3 3 café café café
#> 4 4 voilà voil\xe0 voilà
#> 5 5 déjà d\xe9j\xe0 déjà
#> 6 6 hôpital h\xf4pital hôpital
#> 7 7 Noël No\xebl Noël
#> 8 8 bäcker b\xe4cker bäcker
#> 9 9 vögel v\xf6gel vögel
#> 10 10 frühling fr\xfchling frühling
Created on 2023-10-16 with reprex v2.0.2
You may have to test various encoding mismatches to get the desired result.
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.