Trouble Preserving Diacritic Marks in Names

Using read::read_csv() I find that some names with diacritical marks are encoded, but I'm having trouble figuring out how to get them back in their proper form for output.

For example, the name Renato Núñez in the file is stored in my data.frame as Renato N\xfa\xf1ez.

Is there a way to read them without encoding or, alternately, a way to convert the encoding back so the names are properly presented.

Appreciate any guidance.

is not being UTF-8 encoded either in the source file or in the R environment

library(readr)
# external csv
# nombre
#Renato Núñez
#
#csv must have trailing blank line
read_csv("~/Desktop/nombre.csv")
#> Parsed with column specification:
#> cols(
#>   nombre = col_character()
#> )
#> # A tibble: 1 x 1
#>   nombre      
#>   <chr>       
#> 1 Renato Núñez

Created on 2020-08-07 by the reprex package (v0.3.0)

Thanks for your reply, @technocrat. I see what you're saying.

The source of the data is an online CSV. When I copy and paste that data into my editor I correctly see Renato Núñez. But when I read it with read_csv() and inspect it, the same data is shown as Renato N˙Òez.

Copy/paste may not be reliable. Check the source encoding by downloading the csv and checking in the terminal with head, preferably. If the source encoding is correct, then the R client encoding needs to be adjusted.

Wow, interesting.

So, at the terminal,
cat data.csv | grep "Renato"

Yields
Renato N��ez

That could be a LOCALE environment setting. UTF-8 is preferable.

Was thinking about locales so I tried:

data <- readr::read_delim("http://crunchtimebaseball.com/master.csv", 
      delim = ",", locale = locale(encoding = "UTF-8"))

But then

data %>% 
  filter(mlb_name == "Renato Nunez") %>% 
  select(mlb_name, yahoo_name)

Still yields,

  mlb_name     yahoo_name          
  <chr>        <chr>               
1 Renato Nunez "Renato N\xfa\xf1ez"

I see that the readr documentation notes, that the encoding parameter "only affects how the file is read - readr always converts the output to UTF-8."

yahoo_name in source data is encoded inconsistently. Probably due to a cut-and-paste provenance.

Ha! OK. I'll hunt down another better source at some point,

Thank you so much for all your help.

— Robert

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.