how to handle sepcial characters in R?

I am trying to handle Latin Small Letter O with Circumflex in R?

library(tibble)
library(dplyr)
sessionInfo()

R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

attached base packages:
[1] stats graphics grDevices datasets utils methods base

other attached packages:
[1] dplyr_1.0.8 tibble_3.1.6 stringr_1.4.0 rstudioapi_0.13
[5] startup_0.17.0

loaded via a namespace (and not attached):
[1] cellranger_1.1.0 pillar_1.7.0 compiler_4.2.0
[4] sysfonts_0.8.8 base64enc_0.1-3 tools_4.2.0
[7] extrafont_0.18 digest_0.6.29 googledrive_2.0.0
[10] evaluate_0.15 lifecycle_1.0.1 gargle_1.2.0
[13] pkgconfig_2.0.3 rlang_1.0.2 cli_3.3.0
[16] DBI_1.1.2 yaml_2.3.5 xfun_0.30
[19] fastmap_1.1.0 Rttf2pt1_1.3.10 comtradr_0.3.0
[22] gmailr_1.0.1 httr_1.4.2 knitr_1.38
[25] generics_0.1.2 vctrs_0.4.1 fs_1.5.2
[28] tidyselect_1.1.2 glue_1.6.2 R6_2.5.1
[31] fansi_1.0.3 readxl_1.4.0 rmarkdown_2.14
[34] extrafontdb_1.0 rdrop2_0.8.2.1 purrr_0.3.4
[37] magrittr_2.0.3 ellipsis_0.3.2 htmltools_0.5.2
[40] assertthat_0.2.1 utf8_1.2.2 stringi_1.7.6
[43] crayon_1.5.1

df <- tibble(economy = c("Afghanistan", "Albania", "C\xf4te d'Ivoire")) %>% 
  arrange(economy)
write.csv(df, "test1.csv", fileEncoding = "latin1")

image

df <- tibble(economy = c("Afghanistan", "Albania", "C\uf4te d'Ivoire")) %>% 
  arrange(economy)
write.csv(df, "test2.csv", fileEncoding = "UTF-8")

image

I think the problem here is not R, but Excel.

Running your code (on Windows 11), then using the WSL command line:

 hexdump -C test1.csv
00000000  22 22 2c 22 65 63 6f 6e  6f 6d 79 22 0d 0a 22 31  |"","economy".."1|
00000010  22 2c 22 41 66 67 68 61  6e 69 73 74 61 6e 22 0d  |","Afghanistan".|
00000020  0a 22 32 22 2c 22 41 6c  62 61 6e 69 61 22 0d 0a  |."2","Albania"..|
00000030  22 33 22 2c 22 43 f4 74  65 20 64 27 49 76 6f 69  |"3","C.te d'Ivoi|
00000040  72 65 22 0d 0a                                    |re"..|

hexdump -C test2.csv
00000000  22 22 2c 22 65 63 6f 6e  6f 6d 79 22 0d 0a 22 31  |"","economy".."1|
00000010  22 2c 22 41 66 67 68 61  6e 69 73 74 61 6e 22 0d  |","Afghanistan".|
00000020  0a 22 32 22 2c 22 41 6c  62 61 6e 69 61 22 0d 0a  |."2","Albania"..|
00000030  22 33 22 2c 22 43 c3 b4  74 65 20 64 27 49 76 6f  |"3","C..te d'Ivo|
00000040  69 72 65 22 0d 0a                                 |ire"..|

So in test1 (latin1), the ô is encoded as f4. In test2 (UTF8), it is encoded as c3 b4. You can check the corresponding latin1 and UTF8 to ensure that this is correct.

On Windows, things also work if I open the files with Notepad (not always, it has an autodetection of encoding which doesn't always work, but you can also set the encoding manually).

Now, if I open test1.csv in Excel, I directly get the correct display. But this may depend on the computer and operating system you're using. So to get something more reproducible, we have to tell Excel what encoding to use.

In Data, select "From Text/CSV", then select test2.csv. You will get a popup window that gives you a choice of encoding, a choice of separator, and a data type selection method. The default separator is comma, that's great, and for me the default encoding is 1252: Western European (Windows). This is why my version of Excel opens test1.csv correctly (and what you should select to open that file). For test2.csv, choose 65001: Unicode (UTF-8).

And what happens if we try to open these files with R? Again, I need to explicitly set the encoding, but then it works:

library(tidyverse)

df <- tibble(economy = c("Afghanistan", "Albania", "C\xf4te d'Ivoire")) %>% 
  arrange(economy)
write.csv(df, "test1.csv", fileEncoding = "latin1")

df <- tibble(economy = c("Afghanistan", "Albania", "C\uf4te d'Ivoire")) %>% 
  arrange(economy)
write.csv(df, "test2.csv", fileEncoding = "UTF-8")



read.csv("test1.csv", encoding = "latin1")
#>   X       economy
#> 1 1   Afghanistan
#> 2 2       Albania
#> 3 3 Côte d'Ivoire
read.csv("test2.csv", encoding = "UTF-8")
#>   X       economy
#> 1 1   Afghanistan
#> 2 2       Albania
#> 3 3 Côte d'Ivoire

Created on 2022-04-27 by the reprex package (v2.0.1)

Or using {readr}:

read_csv("test1.csv", locale = locale(encoding = "latin1"))
read_csv("test2.csv", locale = locale(encoding = "UTF-8"))
1 Like

In addition i would advise thay if your target for this is excel then rather than use write.csv use readr package's write_excel_csv() which can be used to write a csv file with a UTF-8 Byte order
mark included, which allows Excel to detect that it is UTF-8 encoded

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.