I am running into an issue with character encoding while doing text mining using the tidyverse. I am looking at an Italian dataset, and after tokenizing my data I am noticing that some characters are not translating properly. E.g. sometimes a word like "un'altra" will end up as "un<U+0092>altra" (this issue with the apostrophe is not consistent either), or a word will end up as "communit.
I have tried to fix in this tokenized data set by: changing to utf-8 using utf8, changing to latin-1 using stringi, but with no success, even if the encoding changes.
Is there a solution to this, either in the way the data is tokenized or with changing the encoding of the tokenized data.
I am using a Windows laptop on R v 3.6.2.