Conversion of MARC-8 to UTF-8 in R

Hi, there.
I have a .csv file with some Cyrillic text. When I used readr::read_delim() to upload it to the R I faced an issue with encoding - the text is shown as <c0><e1>..... When loaded to google sheets it works just fine and converts this code into Cyrillic characters.
After some googling, I found that this hex code is matching neither ASCII nor MARC-8. Furthermore, R already thinks it is UTF-8.
Interestingly, readr::read_delim() and read.csv()differently read this file.

suppressWarnings(library(tidyverse))
rates <- suppressMessages(read_delim("rates.csv", skip = 2, delim = ";"))
x <- rates$SHORTNAME[[2]]
# x should be "АСКО" (Cyrillic)
x

#> [1] "<c0><d1><ca><ce>"
Encoding(x)
#> [1] "UTF-8"

rates2 <- read.csv("rates.csv", skip = 2, sep = ";")
y <- rates2$SHORTNAME[[2]]
# y also should be "АСКО" (Cyrillic)
y
#> [1] "ÀÑÊÎ"
Encoding(y)
#> [1] "unknown"

So, my question is how to convert this text into something readable?

Hello Oleg,

does it help when you call the functions with a file connection in stead of a file name?
E.g.

file1 <- file('rates.csv',encoding="whatever you think it should be")
rates <- read.csv(file1,skip=2,sep=";")
close(file1)

@HanOostdijk, thank you for the suggestion. I don't know why, but it returns data.frame with just one raw and only first column has non-NA value (which was in English in original file). Without encoding = ...it loads everything.

file1 <- file("rates.csv", encoding="windows-1251")
rates <- read.csv(file1, skip=2, sep=";")
#> Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
#> invalid input found on input connection 'rates.csv'
#> Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
#> incomplete final line found by readTableHeader on 'rates.csv'
close(file1)
#> Error in close.connection(file1): invalid connection

I found different solution. Essentially convert values with iconv() after loading as is:

  library(tidyverse)
  
  rates <- read_delim("rates.csv", skip = 2, delim = ";")
  rus_cols <- c(...)
  rates <- rates %>%
    mutate(across(all_of(rus_cols), ~iconv(.x, from="windows-1251", to = "UTF-8")))
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.