RStudio IDE inconsistently displaying some characters

enconding

#1

I have been trying for some time to find out about this problem and make it reproducible. I am using various encodings on Windows 10, and I have the following problem.

library(tidyverse)
foo = data.frame (
  gender = c( "nő", "férfi"),
  name = c("Ági", "Jenő")
)

I am creating variables here with Hungarian characters. If I set my Project options encoding to UTF-8 (or any Central European) I run into a problem.

which (foo$gender == "nő") should give 1, and correctly gives 1.
which (foo$gender  == "no") should give integer(0) and gives 1.
which (foo$gender == "mo") should give integer(0) and  integer(0).

This is a problem, because even though I typed ő (and it is correctly displayed in the RStudio IDE). However, when I review the values with View(foo), they are displayed differently in a tabular format, changing nő to no.

When I test for the value of foo$gender, I get an ambiguous result.

which (foo$gender == "nő") should yield 1 and correctly returns 1
which (foo$gender  == "no")  should return integer(0) and incorrectly returns 1
which (foo$gender == "mo") should return integer(0) and correctly returns integer (0)

I save it with two functions.

write.csv(foo, "foo.csv", row.names = F)
write_csv(foo, "foo2.csv")

I manually create, outside R Studio, a corrected version of foo.csv, foo_man.csv (with my Windows locale), because the nő is changed to no, and Jenő is changed to Jeno. I create d foo_man_utf8.csv save in Windows with UTF-8 encoding.

I re-read the files.

foo_r <- read.csv("foo.csv", stringsAsFactors = F)
which (foo_r$gender == "nő")
which (foo_r$gender  == "no")
which (foo_r$gender == "mo")

Unchanged, nő == no, although it should be a separate character

foo_r2 <- read_csv("foo2.csv", col_names = T)
which (foo_r2$gender == "nő")
which (foo_r2$gender  == "no")
which (foo_r2$gender == "mo")

Unchanged, nő == no, although it should be a separate character

foo_r3 <- read_csv("foo_man.csv", col_names = T)
which (foo_r3$gender == "nő")
which (foo_r3$gender  == "no")
which (foo_r3$gender == "mo")

Same problem, regardless if I use read.csv or read_csv.

foo_r4 <- read_csv("foo_man_utf8.csv")
which (foo_r3$gender == "nő") integer (0)
which (foo_r3$gender  == "no") integer (0)
which (foo_r3$gender == "mo") integer (0)

Even though the the encoding is set in Project Options to UTF-8, I cannot read the UTF-8 properly.

What I find very odd is that if I read in to RStudio IDE the “foo_man_utf8.csv” as a text file, it is displayed correctly in the code editor. However, View(foo_r4) displays them incorrectly, changing nő to no, Jenő to Jeno.

Generally, I believe that there are at least two problems here. First, if you use a non-English locale, the code editor reads in information from the keyboard but probably records it in a wrong way.

Second, if the input is from a file, View(foo_r4) and “foo_man_utf8.csv” is displayed differently. This makes working with real-life data very difficult.

I know that there are many other character encoding issues that may confuse this problem, but I am quiet certain that the RStudio IDE is not consistently handling the character encodings, and displays the same variables differently. This problem persists regardless of using UTF-8 or a Central European encoding in the RStudio Project Options.


#2

Can you also provide us with the output of utils::sessionInfo()?

It seems like part of what you’re seeing could be a bug in how R prints multibyte characters within data.frames. Compare:

> foo = data.frame(
+   gender = c("nő", "férfi"),
+   name = c("Ági", "Jenő")
+ )
> foo
  gender name
1     no  Ági
2  férfi Jeno
> foo$gender
[1] nő    férfi
Levels: férfi nő

However, I cannot replicate the incorrect comparisons you’re seeing:

> which (foo$gender == "nő")
[1] 1
> which (foo$gender  == "no")
integer(0)
> which (foo$gender == "mo")
integer(0)

For what it’s worth, View() does also drop the accents when using an English locale, but if I switch to a Hungarian locale the accents are preserved, e.g.

Sys.setlocale(locale = "Hungarian")
View(foo)

#3

Sorry for the late reply. I am a sort of advanced R user, but not an IT specialist. I was trying to figure out how to present the problem. It has several layers, so this is the first part of my answer.

In this case, I create foo in RStudio with two locale settings. In both cases, in the code editor the ő, Á, is correctly shown.

##Case1 --------

Sys.setenv(LANG = "hu")
Sys.setlocale("LC_ALL","English")
foo = data.frame(
  gender = c("nő", "férfi"),
  name = c("Ági", "Jenő"), 
  values = c(100,200))
View (foo) #incorrect
print(paste(foo$name, collapse = ",")) #incorrect
ggplot(foo, aes (x = name, y = values)) +
  geom_bar (stat="identity") #incorrect
grepl("ő", foo$gender) #correct
write.csv ( foo, "foo.csv")  #incorrectly exported

In this case, the ő is correctly inputed from the keyborad to the memory, but it is not correctly printed.

##Case 2 -------

Sys.setlocale("LC_ALL","Hungarian")
foo2 = data.frame(
  gender = c("nő", "férfi"),
  name = c("Ági", "Jenő"), 
  values = c(100,200)) #correctly displayed
View ( foo2) #correct
ggplot(foo2, aes (x = name, y = values)) + geom_bar (stat="identity") #correct
grepl("ő", foo$gender) #correct
write.csv( foo2, "foo2.csv") #correctly exported

In this case, everything is correct. However, this is not a very preferable scenario, because I use various data sources in several languages, and RStudio / Windows10 does not allow to set a locale which is UTF-8 encoded. If I go Hungarian, I cannot read Slovak.

##Case3 -------- (created HU, re-read EN)

 Sys.setlocale("LC_ALL","English")
 foo_from_2 <- read.csv( "foo2.csv")
View ( foo_from_2) #correctly displayed 
print(paste(foo_from_2$name, collapse = ",")) #correctly printed
ggplot(foo_from_2, aes (x = name, y = values)) + 
 geom_bar (stat="identity") #correctly printed
grepl("ő", foo$gender) #incorrect
grepl("\u0151", foo$gender)  #incorrect

Notice the difference with Case 1

##Case4 -------- (created HU, re-read HU)

Sys.setlocale("LC_ALL","Hungarian")
foo_from_2 <- read.csv( "foo2.csv")
View ( foo_from_2) #correctly displayed 
print(paste(foo_from_2$name, collapse = ",")) #correctly printed
ggplot(foo_from_2, aes (x = name, y = values)) +
  geom_bar (stat="identity") #correctly printed
grepl("ő", foo$gender) #incorrectly recognized!
grepl("\u0151", foo$gender)  #incorrect

This is actually the worst case - everything is displayed correctly, but in reality, not stored correctly. If gender prints on plot or console or Viewer "nő", but it is not recognized by regex functions, either programming or finding program bugs becomes almost impossible.

##Case5 --------(Created EN, re-read HU)

Sys.setlocale("LC_ALL","Hungarian")
foo_from_1 <- read.csv( "foo.csv")
View ( foo_from_1) #incorrectly displayed 
print(paste(foo_from_1$name, collapse = ",")) #incorrectly printed
ggplot(foo_from_1, aes (x = name, y = values)) +
  geom_bar (stat="identity") #incorrectly printed
grepl("ő", foo$gender) #incorrect
  1. Case 1 - the program works, but you see different
  2. Case 2 - everything works
  3. Case 3 - everything works
  4. Case 4 - everything looks good but does not work
  5. Case 5 - everything looks good but does not work

Obviously, in real-life the problem is more subtle. You read in a file in Hungarian, Slovak or German, and it is either displayed properly or not, and either prints properly or not.

Honestly, I would like to find a scenario when I can control how the imported data is exactly encoded (not an option for example with readr) and how it behaves. Am I correct if I would ditch Windows, and use only UTF-8 on a different operational system, I could have a higher chance of importing, seeing, handling, and printing


#4

This is just a general issue for R on Windows. In many cases, R will attempt to round-trip characters through the system encoding, so UTF-8 characters that aren't representable in the native locale will be mis-printed.

You will definitely have a better overall experience on Linux or macOS (where the locale is almost always by default a UTF-8 locale) compared to Windows. That said, it is possible to correctly manipulate and process UTF-8 text on Windows; however, there are a lot of places where silent conversion to the system encoding will bite you.