I have been trying for some time to find out about this problem and make it reproducible. I am using various encodings on Windows 10, and I have the following problem.
library(tidyverse)
foo = data.frame (
gender = c( "nő", "férfi"),
name = c("Ági", "Jenő")
)
I am creating variables here with Hungarian characters. If I set my Project options encoding to UTF-8 (or any Central European) I run into a problem.
which (foo$gender == "nő") should give 1, and correctly gives 1.
which (foo$gender == "no") should give integer(0) and gives 1.
which (foo$gender == "mo") should give integer(0) and integer(0).
This is a problem, because even though I typed ő (and it is correctly displayed in the RStudio IDE). However, when I review the values with View(foo)
, they are displayed differently in a tabular format, changing nő to no.
When I test for the value of foo$gender
, I get an ambiguous result.
which (foo$gender == "nő") should yield 1 and correctly returns 1
which (foo$gender == "no") should return integer(0) and incorrectly returns 1
which (foo$gender == "mo") should return integer(0) and correctly returns integer (0)
I save it with two functions.
write.csv(foo, "foo.csv", row.names = F)
write_csv(foo, "foo2.csv")
I manually create, outside R Studio, a corrected version of foo.csv, foo_man.csv (with my Windows locale), because the nő is changed to no, and Jenő is changed to Jeno. I create d foo_man_utf8.csv save in Windows with UTF-8 encoding.
I re-read the files.
foo_r <- read.csv("foo.csv", stringsAsFactors = F)
which (foo_r$gender == "nő")
which (foo_r$gender == "no")
which (foo_r$gender == "mo")
Unchanged, nő == no
, although it should be a separate character
foo_r2 <- read_csv("foo2.csv", col_names = T)
which (foo_r2$gender == "nő")
which (foo_r2$gender == "no")
which (foo_r2$gender == "mo")
Unchanged, nő == no
, although it should be a separate character
foo_r3 <- read_csv("foo_man.csv", col_names = T)
which (foo_r3$gender == "nő")
which (foo_r3$gender == "no")
which (foo_r3$gender == "mo")
Same problem, regardless if I use read.csv
or read_csv
.
foo_r4 <- read_csv("foo_man_utf8.csv")
which (foo_r3$gender == "nő") integer (0)
which (foo_r3$gender == "no") integer (0)
which (foo_r3$gender == "mo") integer (0)
Even though the the encoding is set in Project Options to UTF-8, I cannot read the UTF-8 properly.
What I find very odd is that if I read in to RStudio IDE the "foo_man_utf8.csv" as a text file, it is displayed correctly in the code editor. However, View(foo_r4)
displays them incorrectly, changing nő to no, Jenő to Jeno.
Generally, I believe that there are at least two problems here. First, if you use a non-English locale, the code editor reads in information from the keyboard but probably records it in a wrong way.
Second, if the input is from a file, View(foo_r4)
and "foo_man_utf8.csv" is displayed differently. This makes working with real-life data very difficult.
I know that there are many other character encoding issues that may confuse this problem, but I am quiet certain that the RStudio IDE is not consistently handling the character encodings, and displays the same variables differently. This problem persists regardless of using UTF-8 or a Central European encoding in the RStudio Project Options.