Text encoding for swedish characters goes wrong in YAML

(I have posted this qustion to Stackoverflow, but I didn't got any answer. Sorry for the crossposting!)

I want to use parametrized reports in RStudio. But when I use params with swedish characters with umlauts (like å, ä ö) something goes wrong with the encoding. I'm running Windows 10 on my computor.

Example:

title: "test_yaml_encoding"
output: html_document
params:
  swe_chars_param: "åäöÅÄÖ"


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
print(params$swe_chars_param)
```

[1] "åäöÅÄÖ"

It seems to be a known issue:

but I have not mangaged to find a solution.

One solution is to fix the encoding with a function. I have tried two different solutions.

First try:

ffix_swedish_chars <- function(txt) {
  txt <- gsub("Ã¥", "å", txt)
  txt <- gsub("ä", "ä", txt)
  txt <- gsub("ö", "ö", txt)
  txt <- gsub("Ã…", "Å", txt)
  txt <- gsub("Ä", "Ä", txt)
  txt <- gsub("Ö", "Ö", txt)
}

print(ffix_swedish_chars(params$swe_chars_param))

Result:

[1] "åäöÅÄÖ"

It worked, but only for lower case letters.

Then I tried to use brute force and tried to loop through all possible encodings to UTF-8

library(utf8)
library(purrr)
library(readr)


koder <- iconvlist()


ftest_kodning <- function(str, kod) {
  iconv(str, from = kod, to = "UTF-8")
}

ftest_kodning_safe <- possibly(ftest_kodning, NA)

for (i in 1:length(koder)) {
  print(paste(koder[i], ftest_kodning_safe(params$swe_chars_param, koder[i])))
}

I couldn't find any combination of encoding that worked.

Now I'm stuck. Does anyone have a solution?

Edit:

I don't know if this is of any help but this is the raw codes for the characters:

print(charToRaw(params$swe_chars_param))
[1] c3 83 c2 a5 c3 83 c2 a4 c3 83 c2 b6 c3 83 e2 80 a6 c3 83 e2 80 9e c3 83 e2 80 93

1 Like

Edit: This code work when I knit the chunk but not when I run it.

---
title: "test_yaml_encoding"
output: html_document
params:
  swe_chars_param: "Swedish municipalities: Åre, Östersund, Älmhult, Mölndal"
  sv_umlauts: "åäöÅÄÖ"
---

{r}

ffix_swedish_chars <- function(txt) {

  # convert the string to a character string of hexvalues that can be
  # used as input to gsub
  hex_str <-
    paste(as.character(charToRaw(txt)), collapse = "")
  
  # Replace the hexcode that represent the wrong encoded chars with the righ hexcodes
  hex_str <- gsub("c383c2a5", "e5", hex_str) # å
  hex_str <- gsub("c383c2a4", "e4", hex_str) # ä
  hex_str <- gsub("c383c2b6", "f6", hex_str) # ö
  hex_str <- gsub("c383e280a6", "c5", hex_str) # Å
  hex_str <- gsub("c383e2809e", "c4", hex_str) # Ä
  hex_str <- gsub("c383e28093", "d6", hex_str) # Ö
  
  # Split the hexcode string to a vector of chars where every element is 2 chars long
  hex_vec <-
    substring(hex_str, seq(1, nchar(hex_str) - 1, 2), seq(2, nchar(hex_str), 2))

  # Transform the character vector of hexcodes to integers and then to a
  # vector of type raw. Then transfer the raw vector to a character vector.
  txt_correct <-
    rawToChar(as.raw(strtoi(hex_vec, base = 16L)))
   Encoding(txt_correct) <- "UTF-8"
  return(txt_correct)
}

swe_chars_param <- unname(ffix_swedish_chars(txt = params$swe_chars_param))
sv_umlauts <- unname(ffix_swedish_chars(txt = params$sv_umlauts))

# Check if the result realy is identical to a string defined in the chunk

identical(swe_chars_param, "Swedish municipalities: Åre, Östersund, Älmhult, Mölndal")

# [1] TRUE