Sorry for the late reply. I am a sort of advanced R user, but not an IT specialist. I was trying to figure out how to present the problem. It has several layers, so this is the first part of my answer.
In this case, I create foo in RStudio with two locale settings. In both cases, in the code editor the ő, Á, is correctly shown.
##Case1 --------
Sys.setenv(LANG = "hu")
Sys.setlocale("LC_ALL","English")
foo = data.frame(
gender = c("nő", "férfi"),
name = c("Ági", "Jenő"),
values = c(100,200))
View (foo) #incorrect
print(paste(foo$name, collapse = ",")) #incorrect
ggplot(foo, aes (x = name, y = values)) +
geom_bar (stat="identity") #incorrect
grepl("ő", foo$gender) #correct
write.csv ( foo, "foo.csv") #incorrectly exported
In this case, the ő is correctly inputed from the keyborad to the memory, but it is not correctly printed.
##Case 2 -------
Sys.setlocale("LC_ALL","Hungarian")
foo2 = data.frame(
gender = c("nő", "férfi"),
name = c("Ági", "Jenő"),
values = c(100,200)) #correctly displayed
View ( foo2) #correct
ggplot(foo2, aes (x = name, y = values)) + geom_bar (stat="identity") #correct
grepl("ő", foo$gender) #correct
write.csv( foo2, "foo2.csv") #correctly exported
In this case, everything is correct. However, this is not a very preferable scenario, because I use various data sources in several languages, and RStudio / Windows10 does not allow to set a locale which is UTF-8 encoded. If I go Hungarian, I cannot read Slovak.
##Case3 -------- (created HU, re-read EN)
Sys.setlocale("LC_ALL","English")
foo_from_2 <- read.csv( "foo2.csv")
View ( foo_from_2) #correctly displayed
print(paste(foo_from_2$name, collapse = ",")) #correctly printed
ggplot(foo_from_2, aes (x = name, y = values)) +
geom_bar (stat="identity") #correctly printed
grepl("ő", foo$gender) #incorrect
grepl("\u0151", foo$gender) #incorrect
Notice the difference with Case 1
##Case4 -------- (created HU, re-read HU)
Sys.setlocale("LC_ALL","Hungarian")
foo_from_2 <- read.csv( "foo2.csv")
View ( foo_from_2) #correctly displayed
print(paste(foo_from_2$name, collapse = ",")) #correctly printed
ggplot(foo_from_2, aes (x = name, y = values)) +
geom_bar (stat="identity") #correctly printed
grepl("ő", foo$gender) #incorrectly recognized!
grepl("\u0151", foo$gender) #incorrect
This is actually the worst case - everything is displayed correctly, but in reality, not stored correctly. If gender prints on plot or console or Viewer "nő", but it is not recognized by regex functions, either programming or finding program bugs becomes almost impossible.
##Case5 --------(Created EN, re-read HU)
Sys.setlocale("LC_ALL","Hungarian")
foo_from_1 <- read.csv( "foo.csv")
View ( foo_from_1) #incorrectly displayed
print(paste(foo_from_1$name, collapse = ",")) #incorrectly printed
ggplot(foo_from_1, aes (x = name, y = values)) +
geom_bar (stat="identity") #incorrectly printed
grepl("ő", foo$gender) #incorrect
- Case 1 - the program works, but you see different
- Case 2 - everything works
- Case 3 - everything works
- Case 4 - everything looks good but does not work
- Case 5 - everything looks good but does not work
Obviously, in real-life the problem is more subtle. You read in a file in Hungarian, Slovak or German, and it is either displayed properly or not, and either prints properly or not.
Honestly, I would like to find a scenario when I can control how the imported data is exactly encoded (not an option for example with readr) and how it behaves. Am I correct if I would ditch Windows, and use only UTF-8 on a different operational system, I could have a higher chance of importing, seeing, handling, and printing