Character encoding issue - tokenized data

I am running into an issue with character encoding while doing text mining using the tidyverse. I am looking at an Italian dataset, and after tokenizing my data I am noticing that some characters are not translating properly. E.g. sometimes a word like "un'altra" will end up as "un<U+0092>altra" (this issue with the apostrophe is not consistent either), or a word will end up as "communit.

I have tried to fix in this tokenized data set by: changing to utf-8 using utf8, changing to latin-1 using stringi, but with no success, even if the encoding changes.

Is there a solution to this, either in the way the data is tokenized or with changing the encoding of the tokenized data.

I am using a Windows laptop on R v 3.6.2.

Hi, and welcome!

Try tau

library(tau)
txt <- "The quick br\xfcn f\xf6x j\xfbmps \xf5ver the lazy d\xf6\xf8g." 
Encoding(txt) <- "latin1"
txt
#> [1] "The quick brün föx jûmps õver the lazy döøg."

Created on 2020-01-08 by the reprex package (v0.3.0)

1 Like

Thanks so much for this, it solved the problem, once I went back to an earlier data set where characters were represented with \ rather than <>.

1 Like

Great. Please mark the solution for the benefit of those to follow.

What tokenizer are you using? Most text mining tools around are optimized for English, with things like non ASCII characters, complicated inflections etc. causing some degree of pain.

I had good results with udpipe package; the lemmatization was particularly helpful.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.