Text encoding for swedish characters goes wrong in YAML

ChristianL · May 15, 2018, 9:41am

(I have posted this qustion to Stackoverflow, but I didn't got any answer. Sorry for the crossposting!)

I want to use parametrized reports in RStudio. But when I use params with swedish characters with umlauts (like å, ä ö) something goes wrong with the encoding. I'm running Windows 10 on my computor.

Example:

title: "test_yaml_encoding"
output: html_document
params:
  swe_chars_param: "åäöÅÄÖ"


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
print(params$swe_chars_param)
```

[1] "Ã¥Ã¤Ã¶Ã…Ã„Ã–"

It seems to be a known issue:

github.com/rstudio/rmarkdown

Encoding of special characters in the YAML header on Windows

opened 08:18AM - 28 Apr 15 UTC

closed 04:29PM - 29 Apr 15 UTC

crsh

bug

I'm experiencing issues with the encoding of information in the YAML header on W…indows. This is the MWE of the .Rmd-file: `````` --- title: "ÄÜÖäüöß€" output: html_document --- ```{r} rmarkdown::metadata$title sessionInfo() ``` `````` The resulting document contains the following text. ``` ÖÄÜöäüß€ rmarkdown::metadata$title ## [1] "Ã–Ã„ÃœÃ¶Ã¤Ã¼ÃŸâ‚¬" sessionInfo() ## R version 3.2.0 (2015-04-16) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## locale: ## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 ## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C ## [5] LC_TIME=German_Germany.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## loaded via a namespace (and not attached): ## [1] tools_3.2.0 htmltools_0.2.6 yaml_2.1.13 rmarkdown_0.5.1 ## [5] knitr_1.10 stringr_0.6.2 digest_0.6.8 evaluate_0.7 ``` The title is printed correctly but when I try to access this information in the metadata-list the text is scrambled. I've tried the same thing with PDF and Word ouptut and experience the same issue. It works like a charm on my Linux machine with the following setup: ``` sessionInfo() ## R version 3.2.0 (2015-04-16) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 14.04.2 LTS ## ## locale: ## [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 ## [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 ## [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## loaded via a namespace (and not attached): ## [1] tools_3.2.0 htmltools_0.2.6 yaml_2.1.13 rmarkdown_0.5.1 ## [5] knitr_1.10 stringr_0.6.2 digest_0.6.8 evaluate_0.7 ``` Is this a bug in rmarkdown?

but I have not mangaged to find a solution.

One solution is to fix the encoding with a function. I have tried two different solutions.

First try:

ffix_swedish_chars <- function(txt) {
  txt <- gsub("Ã¥", "å", txt)
  txt <- gsub("Ã¤", "ä", txt)
  txt <- gsub("Ã¶", "ö", txt)
  txt <- gsub("Ã…", "Å", txt)
  txt <- gsub("Ã„", "Ä", txt)
  txt <- gsub("Ã–", "Ö", txt)
}

print(ffix_swedish_chars(params$swe_chars_param))

Result:

[1] "åäöÃ…Ã„Ã–"

It worked, but only for lower case letters.

Then I tried to use brute force and tried to loop through all possible encodings to UTF-8

library(utf8)
library(purrr)
library(readr)


koder <- iconvlist()


ftest_kodning <- function(str, kod) {
  iconv(str, from = kod, to = "UTF-8")
}

ftest_kodning_safe <- possibly(ftest_kodning, NA)

for (i in 1:length(koder)) {
  print(paste(koder[i], ftest_kodning_safe(params$swe_chars_param, koder[i])))
}

I couldn't find any combination of encoding that worked.

Now I'm stuck. Does anyone have a solution?

Edit:

I don't know if this is of any help but this is the raw codes for the characters:

print(charToRaw(params$swe_chars_param))
[1] c3 83 c2 a5 c3 83 c2 a4 c3 83 c2 b6 c3 83 e2 80 a6 c3 83 e2 80 9e c3 83 e2 80 93

ChristianL · May 16, 2018, 9:57am

Edit: This code work when I knit the chunk but not when I run it.

---
title: "test_yaml_encoding"
output: html_document
params:
  swe_chars_param: "Swedish municipalities: Åre, Östersund, Älmhult, Mölndal"
  sv_umlauts: "åäöÅÄÖ"
---

{r}

ffix_swedish_chars <- function(txt) {

  # convert the string to a character string of hexvalues that can be
  # used as input to gsub
  hex_str <-
    paste(as.character(charToRaw(txt)), collapse = "")
  
  # Replace the hexcode that represent the wrong encoded chars with the righ hexcodes
  hex_str <- gsub("c383c2a5", "e5", hex_str) # å
  hex_str <- gsub("c383c2a4", "e4", hex_str) # ä
  hex_str <- gsub("c383c2b6", "f6", hex_str) # ö
  hex_str <- gsub("c383e280a6", "c5", hex_str) # Å
  hex_str <- gsub("c383e2809e", "c4", hex_str) # Ä
  hex_str <- gsub("c383e28093", "d6", hex_str) # Ö
  
  # Split the hexcode string to a vector of chars where every element is 2 chars long
  hex_vec <-
    substring(hex_str, seq(1, nchar(hex_str) - 1, 2), seq(2, nchar(hex_str), 2))

  # Transform the character vector of hexcodes to integers and then to a
  # vector of type raw. Then transfer the raw vector to a character vector.
  txt_correct <-
    rawToChar(as.raw(strtoi(hex_vec, base = 16L)))
   Encoding(txt_correct) <- "UTF-8"
  return(txt_correct)
}

swe_chars_param <- unname(ffix_swedish_chars(txt = params$swe_chars_param))
sv_umlauts <- unname(ffix_swedish_chars(txt = params$sv_umlauts))

# Check if the result realy is identical to a string defined in the chunk

identical(swe_chars_param, "Swedish municipalities: Åre, Östersund, Älmhult, Mölndal")

# [1] TRUE