UTF-8 as system.default encoding in R - windows x64

Hi all,

I am trying to change my system default in R to UTF-8 to be able to read a .txt file that has a gamma symbol there. I have searched online but I didn't find any solution.

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)



Matrix products: default



locale:
[1] LC_COLLATE=English_Anguilla.1252  LC_CTYPE=English_Anguilla.1252    LC_MONETARY=English_Anguilla.1252 LC_NUMERIC=C                     
[5] LC_TIME=English_Anguilla.1252    



attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base   

I already changed it in Tools > Global options > Code > Savings, but this does not change the system default, which is what I need. Any solution?

thanks,
Laura

It is not clear to me what your actual problem is (but I hoped you solved it anyway).

Reading a UTF-8 encoded text file with a gamma in it, can be done (I think) simply with readLines.
I copied the first few lines of https://en.wikipedia.org/wiki/Greek_alphabet in a new text-file greek_letters.txt and saved in with (my standard) encoding UTF-8.
Then I could display it with

gl <- readLines("greek_letters.txt",encoding = "UTF-8")
cat(gl)

On my system, same as yours but with other locale (see sessionInfo) I do not have to specify the encoding.

In my RStudio version (RStudio 2022.11.0-daily+178) under Tools > Global options > Code > Savings
I see 'Default text encoding [Ask]'. Clicking on 'Change' I also see UTF-8 with the indication that this is the system-default.

However I found a stackoverflow item suggesting that R 4.2.1 would solve some encoding issues.

My session info:

sessionInfo()
#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19044)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> attached base packages:
#>   other output here deleted 

Created on 2022-10-10 with reprex v2.0.2

Hello,
Thank you for your answer. Maybe this is more specific:

When I do the same as you using readLines, I get a correct result:

> gl <- readLines("C:/Vuilbak/greek_letters.txt",encoding = "UTF-8")
> cat(gl)
Α α, Β β, Γ γ, Δ δ, Ε ε, Ζ ζ, Η η, Θ θ, Ι ι, Κ κ, Λ λ, Μ μ, Ν ν, Ξ ξ

But, when I want to use read.table, without encoding I get “wrong” characters because it is not Unicode,
When I use fileEncoding, I get errors in R (see below).

read.table(file="C:/Vuilbak/greek_letters.txt",sep=",",header=FALSE)
     V1     V2     V3     V4     V5     V6     V7     V8     V9    V10    V11    V12    V13    V14
1 Α α  Î’ β  Γ γ  Δ δ  Ε ε  Ζ ζ  Η η  Θ θ  Ι ι  Κ κ  Λ λ  Îœ μ  Î\u009d ν  Ξ ξ
> read.table(file="C:/Vuilbak/greek_letters.txt",sep=",",header=FALSE,fileEncoding="UTF-8")
Error in read.table(file = "C:/Vuilbak/greek_letters.txt", sep = ",",  : 
  no lines available in input
In addition: Warning message:
In read.table(file = "C:/Vuilbak/greek_letters.txt", sep = ",",  :
  invalid input found on input connection 'C:/Vuilbak/greek_letters.txt'

In order to find a solution, I have reinstalled R to the newest version, to be sure it is not linked to the stackoverflow item you mention that there were some encoding issues. But this did not solve the problem. I have looked further on the internet and finally I have found the line that solved the problem:

“Sys.setlocale(locale = 'en_BE.UTF-8')"

> Sys.setlocale(locale = 'en_BE.UTF-8')
[1] "LC_COLLATE=en_BE.UTF-8;LC_CTYPE=en_BE.UTF-8;LC_MONETARY=en_BE.UTF-8;LC_NUMERIC=C;LC_TIME=en_BE.UTF-8"
> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=en_BE.UTF-8  LC_CTYPE=en_BE.UTF-8    LC_MONETARY=en_BE.UTF-8 LC_NUMERIC=C            LC_TIME=en_BE.UTF-8    
system code page: 1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] …

loaded via a namespace (and not attached):
[1] …    
> gl <- readLines("C:/Vuilbak/greek_letters.txt",encoding = "UTF-8")
> cat(gl)
Α α, Β β, Γ γ, Δ δ, Ε ε, Ζ ζ, Η η, Θ θ, Ι ι, Κ κ, Λ λ, Μ μ, Ν ν, Ξ ξ
> read.table(file="C:/Vuilbak/greek_letters.txt",sep=",",header=FALSE,fileEncoding="UTF-8")
   V1   V2   V3   V4   V5   V6   V7   V8   V9  V10  V11  V12  V13  V14
1 Α α  Β β  Γ γ  Δ δ  Ε ε  Ζ ζ  Η η  Θ θ  Ι ι  Κ κ  Λ λ  Μ μ  Ν ν  Ξ ξ

Thanks again for you reply!!

Laura

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.