Why are UTF-8 characters printed to the console as Latin-1 code?

readr
encoding

#1

Edit: Because apparently they never were UTF-8-encoded in the first place.


I'm trying to read in a csv file that was stored from Excel with UTF-8 encoding. The file only contains strings of German words. When I print the tibble, those cells that contain a word with, say, an umlaut ("ä" etc.) appear in quotes and instead of the umlaut, its ISO-8859-1 code is depicted.

For instance, store a file Test.csv with the content

name;city
Bärbel;Berlin

and read it with dat <- read_csv2("Test.csv"). When printing it, this is the output I see:

> dat
# A tibble: 1 x 2
  name        city  
  <chr>       <chr> 
1 "B\xe4rbel" Berlin

The encoding is apparently right:

> Encoding(dat$name)
[1] "UTF-8"

Then again:

> guess_encoding(charToRaw(dat$name))
# A tibble: 2 x 2
  encoding   confidence
  <chr>           <dbl>
1 ISO-8859-1       0.42
2 ISO-8859-2       0.42

So, what is printed to the console is encoded in Latin-1? At least, when I read in via dat <- read_csv2("Test.csv", locale = locale(encoding = "ISO-8859-1")), I get

> dat
# A tibble: 1 x 2
  name   city  
  <chr>  <chr> 
1 Bärbel Berlin

My question ultimately is this: Do I have to specify separately in what format characters are stored and in what way they are printed to the console?

(Apologies if this a bad question, but I'm quite new to R and I have thoroughly tried to find an answer elsewhere, to no avail... Any help would be much appreciated!)


#2

The problem seems to be the file you are reading is not UTF-8 as you think it is actually ISO-8859-1 / Latin-1. E4 is the byte code for ä in Latin-1 (https://www.fileformat.info/info/unicode/char/00e4/index.htm). I can reproduce your symptoms exactly by saving the file with a ISO-8850-1 encoding in RStudio (File->Save with Encoding...)

The default readr locale encoding is UTF-8, so to read in data from other formats you need to specify it explicitly, like you did in the last example.

Note that regardless of the input file encoding readr always converts the data to UTF-8.


#3

Thank you! I saved the original file myself in Excel several times, everytime painstakingly making sure to choose UTF-8 encoding... Looks like I or Excel aren't doing what they're supposed to. :grimacing:

Is that why Encoding(dat$name) throws "UTF-8"? That may have lead me to believe that the (input) encoding as such was indeed utf-8.


#4

At that point, you've read in the file, which has been converted to UTF-8.