Issues with French text showing up incorrectly in R script in RStudio

A colleague of mine sent me a R script that contains quite a bit of French text (i.e. with accented letters). She wrote the R script in RStudio (version 1.2.5001). When she sent me the script, the text shows up with ? in all the places where accented letters should be/were on her end of things. Not sure if the information below is helpful in diagnosing. I suspect this has to do with character encoding, but I don't know enough about that area. Any idea how to recover the proper character encoding on my end?

This is from her version of R...

> Sys.getlocale()
[1] "LC_COLLATE=English_Canada.1252;LC_CTYPE=English_Canada.1252;LC_MONETARY=English_Canada.1252;LC_NUMERIC=C;LC_TIME=English_Canada.1252"

I am using RStudio version 1.2.5033. This is from from my version of R...

> Sys.getlocale()
[1] "en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8"

Any advice? This probably applies to many languages and not just French. So any help from non-English R/RStudio users would be great!

@andresrcs Do you have any insights? I figure you may know more about character encoding and accented letters in R scripts than most...

Which is the encoding of the file ? It could not be UTF8, and you may open it in the correct encoding format.
If she is using a windows on a french PC, it is definitely not UTF8 by default, unless she configured RStudio to be UTF8.

On my french windows machine, default encoding is ISO-8859-1 and sometimes WINDOWS 1252

Could it be from this ?

RStudio allows to open the file in a different encoding and convert it if necessary.

2 Likes

You're right, a reprex isn't necessary to illuminate the problem.

This came up recenty with Greek, which traced back to a RStudio 1.1.x installation. Since your colleague is on 1.2.5xxx, that's not likely to be the problem here.

Here's how we want it working, cut and pasted from R Studio

text <- "Mes pensées accompagnent les victimes de l'attaque de Romans-sur-Isère, les blessés, leurs familles. Toute la lumière sera faite sur cet acte odieux qui vient endeuiller notre pays déjà durement éprouvé ces dernières semaines."

text
#> [1] "Mes pensées accompagnent les victimes de l'attaque de Romans-sur-Isère, les blessés, leurs familles. Toute la lumière sera faite sur cet acte odieux qui vient endeuiller notre pays déjà durement éprouvé ces dernières semaines."

Created on 2020-04-04 by the reprex package (v0.3.0)

with

sessionInfo() 
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Catalina 10.15.3
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.6.3  magrittr_1.5    tools_3.6.3     htmltools_0.4.0
#>  [5] yaml_2.2.1      Rcpp_1.0.4      stringi_1.4.6   rmarkdown_2.1  
#>  [9] highr_0.8       knitr_1.28      stringr_1.4.0   xfun_0.12      
#> [13] digest_0.6.25   rlang_0.4.5     evaluate_0.14

Created on 2020-04-04 by the reprex package (v0.3.0)

Anything Windows can do macOS can do better, and anything that macOS can do Linus can do better still. But if Windows were the problem, it would surface there more often. But I haven't seen that much.

So, it's an encoding issue, then, as you thought. Looking at Sys.getlocale() my sense is that the differences between en_CA.UTF-8 end**US**.UTF-8 shouldn't matter.

If the encoding of the source document renders correctly (source meaning the last file through which the text passed) in other apps than RStudio, then it's either a version-specific RStudio version bug (again, unlikely) or a configuration issue within RStudio's project or global preferences.

In Code | Saving there is an encoding option. Mine is set to UTF-8. If your colleague's is also, and I'm right that 1.2.5x makes no never mind, and the source file renders correctly, I'm at an official plumb out of suggestions resting place.

Let us know back and mark your result as a solution if success?

2 Likes

Thanks both for your responses. Yes, I reopened with ISO-8859-1 encoding and then saved with UTF-8 encoding and it seems to be all proper now.

Just so I'm clear on what is happening here...It seems like she wrote the script using her system default encoding (ISO-8859-1, probably) and my default on Mac OS is UTF-8 so RStudio opens the file assuming UTF-8 encoding, thus the issues with showing incorrect text? Is this generally correct?

Yes I think if your RStudio is set to UTF8 by default, it will open all file with this encoding and you need to manually convert it or to open it in the correct encoding.

It is a good advice for your colleague to configure her RStudio IDE to work on UTF8 by default. So that you'll both work in the same encoding.
I have configured my RStudio to work only on UTF8, I find it less prone to error

2 Likes

Thanks for the clarification. I will pass this information on to her.

1 Like

Yes, I've never set UTF8, and it always appears as the default choice in every new version of RStudio.

1 Like

On windows too ?
It is the defaut for text editor on windows in france usually. So I encountered for sure this when script are created on a simple text editor.
Simple think to check : look at the global option in Rstudio to be sure. Could be System default and system default is not always utf8 on windows machine.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Thanks for the response @cderv. Forgive my ignorance, but how does one checking the encoding of a file? She is indeed using Windows 10. When I asked her to run Sys.getlocale() it showed:

> Sys.getlocale()
[1] "LC_COLLATE=English_Canada.1252;LC_CTYPE=English_Canada.1252;LC_MONETARY=English_Canada.1252;LC_NUMERIC=C;LC_TIME=English_Canada.1252"

Does this indicate Windows 1252?

Look at readr::guess_encoding() that wraps stringi::stri_enc_detect

Also, in RStudio "File > Reopen with encoding" could give a hint by trying other encoding.

In other editor (like NOTEPAD++) it is indicated when you opens it usually

1 Like

I ran file -I <filename> in the terminal and the charset was "unknown-8bit", meanwhile my own R scripts are "utf-8" as expected.

readr::guess_encoding('file.R')

# A tibble: 2 x 2
  encoding   confidence
  <chr>           <dbl>
1 ISO-8859-1       0.8 
2 ISO-8859-2       0.37

I used the RStudio >> File >> Reopen with encoding and when I selected WINDOWS-1252 or ISO-8859-1 the file looks proper. So how do I convert it to UTF-8 permanently?

There is "Save with Encoding" in RStudio File menu. It should work well for a one time use case :smile:
Is it working well ?

1 Like