Enconding solution for Linux and Windows 10

Sinval · October 16, 2017, 10:47pm

Hello dear Community!

I use Rstudio for Windows (10) and for Linux. Sometimes I use data (most of the times from *.sav files) which have characters from non-English languages (usually Portuguese) and I end up with a lot of encoding errors. In Linux everything goes fine (en_US.UTF-8), although with Windows the same doesn't happen. And the data that I have imported and saved in a *.Rdata file appears with encoding errors on Windows. Any possible solution for this issue?

Thanks in advance for any possible tip/solutions.

jjwkdl · October 17, 2017, 12:04pm

Cannot figure out what the exact problem is, so Here is several solutions for this problem.

Go to global options and set default text encoding as UTF-8
when load data, there is encoding options like 'read.table(file, encoding = "UTF-8")'
Sys.getlocale() this function shows your locale system. and Encoding(file) shows your file's encoding.
iconv(data, "CP949", "UTF-8") is the function transforms CP949 encoding data to UTF-8

I hope these methods would help.

Sinval · October 18, 2017, 8:53am

Thanks for your suggestions, the iconv() function didn't work in my case, I tried to convert the "problematic" data frame column and it shows the other strange characters. I this case, I'm using a data frame imported from an online query.

kevinushey · October 18, 2017, 4:34pm

You might need to re-mark the encoding on some character vectors, e.g. (assuming they truly are UTF-8)

Encoding(cv) <- "UTF-8"

However, there are a number of assorted issues re: the handling of UTF-8 encoding (and printing of UTF-8 characters) on Windows, so it's possible that everything is indeed encoded and 'working' correctly; it's simply not printing correctly. Unfortunately, many of these issues are R issues rather than RStudio issues and so any associated fixes would need to be implemented upstream.

Sinval · October 19, 2017, 9:45am

Thanks it worked with a character vector, can I do it for a selection of data frame columns?

kevinushey · October 19, 2017, 5:35pm

I would just loop through the columns, e.g.

for (column in names(df))
  Encoding(df[[column]]) <- "UTF-8"

Replace names(df) with the vector of variables that need their encoding re-attached.

Sinval · October 23, 2017, 11:21am

Thank you, @kevinushey!
Nevertheless, can you suggest me a default solution? In other words, every time I import a UTF-8 data frame I would have to do this?

kevinushey · October 23, 2017, 5:19pm

I'm not sure if there's an automatic solution. Some APIs for reading files (e.g. read.table()) have arguments for assuming an encoding (e.g. the fileEncoding argument); without more information it's hard to say. This is somewhat outside the bounds of IDE-specific questions, so you might want to follow up in a separate category.

Sinval · October 23, 2017, 5:33pm

Thanks, @kevinushey, the data comes from a query hosted in LimeSurvey. I import the data through the limer package. I will ask for a solution to the creator of the package.