Enconding solution for Linux and Windows 10

Hello dear Community!

I use Rstudio for Windows (10) and for Linux. Sometimes I use data (most of the times from *.sav files) which have characters from non-English languages (usually Portuguese) and I end up with a lot of encoding errors. In Linux everything goes fine (en_US.UTF-8), although with Windows the same doesn't happen. And the data that I have imported and saved in a *.Rdata file appears with encoding errors on Windows. Any possible solution for this issue?

Thanks in advance for any possible tip/solutions.

2 Likes

Cannot figure out what the exact problem is, so Here is several solutions for this problem.

  1. Go to global options and set default text encoding as UTF-8

  2. when load data, there is encoding options like 'read.table(file, encoding = "UTF-8")'

  3. Sys.getlocale() this function shows your locale system. and Encoding(file) shows your file's encoding.
    iconv(data, "CP949", "UTF-8") is the function transforms CP949 encoding data to UTF-8

I hope these methods would help. :slight_smile:

Thanks for your suggestions, the iconv() function didn't work in my case, I tried to convert the "problematic" data frame column and it shows the other strange characters. I this case, I'm using a data frame imported from an online query.

You might need to re-mark the encoding on some character vectors, e.g. (assuming they truly are UTF-8)

Encoding(cv) <- "UTF-8"

However, there are a number of assorted issues re: the handling of UTF-8 encoding (and printing of UTF-8 characters) on Windows, so it's possible that everything is indeed encoded and 'working' correctly; it's simply not printing correctly. Unfortunately, many of these issues are R issues rather than RStudio issues and so any associated fixes would need to be implemented upstream.

1 Like

Thanks it worked with a character vector, can I do it for a selection of data frame columns?

I would just loop through the columns, e.g.

for (column in names(df))
  Encoding(df[[column]]) <- "UTF-8"

Replace names(df) with the vector of variables that need their encoding re-attached.

Thank you, @kevinushey!
Nevertheless, can you suggest me a default solution? In other words, every time I import a UTF-8 data frame I would have to do this?

I'm not sure if there's an automatic solution. Some APIs for reading files (e.g. read.table()) have arguments for assuming an encoding (e.g. the fileEncoding argument); without more information it's hard to say. This is somewhat outside the bounds of IDE-specific questions, so you might want to follow up in a separate category.

Thanks, @kevinushey, the data comes from a query hosted in LimeSurvey. I import the data through the limer package. I will ask for a solution to the creator of the package.