Edit: Because apparently they never were UTF-8-encoded in the first place.
I'm trying to read in a csv file that was stored from Excel with UTF-8 encoding. The file only contains strings of German words. When I print the tibble, those cells that contain a word with, say, an umlaut ("ä" etc.) appear in quotes and instead of the umlaut, its ISO-8859-1 code is depicted.
For instance, store a file Test.csv
with the content
name;city
Bärbel;Berlin
and read it with dat <- read_csv2("Test.csv")
. When printing it, this is the output I see:
> dat
# A tibble: 1 x 2
name city
<chr> <chr>
1 "B\xe4rbel" Berlin
The encoding is apparently right:
> Encoding(dat$name)
[1] "UTF-8"
Then again:
> guess_encoding(charToRaw(dat$name))
# A tibble: 2 x 2
encoding confidence
<chr> <dbl>
1 ISO-8859-1 0.42
2 ISO-8859-2 0.42
So, what is printed to the console is encoded in Latin-1? At least, when I read in via dat <- read_csv2("Test.csv", locale = locale(encoding = "ISO-8859-1"))
, I get
> dat
# A tibble: 1 x 2
name city
<chr> <chr>
1 Bärbel Berlin
My question ultimately is this: Do I have to specify separately in what format characters are stored and in what way they are printed to the console?
(Apologies if this a bad question, but I'm quite new to R and I have thoroughly tried to find an answer elsewhere, to no avail... Any help would be much appreciated!)