How to get the utf-8 codes from a text string?

ChristianL · February 28, 2019, 12:33pm

In some cases, I need to know which utf8 code a particular character has because there is encoding errors in the document. Is there any function that can take a text string as an argument and return the utf8 codes? I think of something like this:

utf8_to_code ("abc")

Output: "\U0061\U0062\U0063"

mara · February 28, 2019, 12:45pm

I think stri_enc_toutf8() in the stringi package might do what you're looking for:
https://jangorecki.gitlab.io/data.cube/library/stringi/html/stri_enc_toutf8.html

Kevin Ushey also wrote up his own function for this in his post on String Encoding and R, but note that you need to declare/know locale in order for these to work properly, I think:
https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/

ChristianL · February 28, 2019, 1:05pm

Thanks!

But I think that RStudio (or R) probably is to smart:

stringi::stri_enc_toutf8("abc")
[1] "abc"

I guess that the output from stri_enc_toutf8() is "\U0061\U0062\U0063" but that R translates it back to characters.

mara · February 28, 2019, 1:39pm

You could try pryr::bits() (from Kevin's post) and then see if there's something to convert from binary to utf-8 codes…

mara · February 28, 2019, 2:14pm

OK, so I was able to scrape a data frame for you which has the binary and the UTF-8 codes (I'm just showing you a subset because the first several entries are <control> and blanks.

Because string encoding is, well, unpredictably weird, your results may vary, or you might want a different set of characters, etc., but the method I used should work for the various combinations available on the site:
https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=bin

library(tidyverse)
library(janitor)
library(rvest)
#> Loading required package: xml2
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#> 
#>     pluck
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding

url <- "https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=bin"

utf8_enc <- url %>%
  read_html() %>%
  html_nodes(css = 'body > table.codetable') %>%
  html_table()

utf8_enc_tab <- utf8_enc[[1]]  

utf8_enc_tab <- utf8_enc_tab %>%
  janitor::clean_names()

utf8_enc_tab %>%
  slice(70:80)
#>    unicodecode_point character utf_8_bin                   name
#> 1             U+0045         E  01000101 LATIN CAPITAL LETTER E
#> 2             U+0046         F  01000110 LATIN CAPITAL LETTER F
#> 3             U+0047         G  01000111 LATIN CAPITAL LETTER G
#> 4             U+0048         H  01001000 LATIN CAPITAL LETTER H
#> 5             U+0049         I  01001001 LATIN CAPITAL LETTER I
#> 6             U+004A         J  01001010 LATIN CAPITAL LETTER J
#> 7             U+004B         K  01001011 LATIN CAPITAL LETTER K
#> 8             U+004C         L  01001100 LATIN CAPITAL LETTER L
#> 9             U+004D         M  01001101 LATIN CAPITAL LETTER M
#> 10            U+004E         N  01001110 LATIN CAPITAL LETTER N
#> 11            U+004F         O  01001111 LATIN CAPITAL LETTER O

^{Created on 2019-02-28 by the reprex package (v0.2.1)}

I did write it out to a csv, but I suggest you do the scraping on your own machine, since these things vary from OS to OS, etc.

gist.github.com

https://gist.github.com/batpigandme/fb6bf75a0158b1f3bdda8530d0cb35ac

utf8_enc_tab.csv

unicodecode_point,character,utf_8_bin,name
U+0000,,00000000,<control>
U+0001,,00000001,<control>
U+0002,,00000010,<control>
U+0003,,00000011,<control>
U+0004,,00000100,<control>
U+0005,,00000101,<control>
U+0006,,00000110,<control>
U+0007,,00000111,<control>
U+0008,,00001000,<control>

This file has been truncated. show original

You can then basically use this to do a lookup:

utf8_enc_tab <- utf8_enc[[1]]

utf8_enc_tab <- as_tibble(utf8_enc_tab) %>%
  janitor::clean_names()


x <- "abc"
characters <- strsplit(x, "")[[1]]

char_frame <- tibble(chars = characters)

char_frame <- char_frame %>%
  mutate(bits = pryr::bits(chars)) %>%
  left_join(utf8_enc_tab, by = c("chars" = "character"))

char_frame
#> # A tibble: 3 x 5
#>   chars bits     unicodecode_point utf_8_bin name                
#>   <chr> <chr>    <chr>             <chr>     <chr>               
#> 1 a     01100001 U+0061            01100001  LATIN SMALL LETTER A
#> 2 b     01100010 U+0062            01100010  LATIN SMALL LETTER B
#> 3 c     01100011 U+0063            01100011  LATIN SMALL LETTER C

ChristianL · February 28, 2019, 2:53pm

Thanks!

The solution works fine (but I think I have to find a more comprehensive table that linking byte codes to unicode).

Regards

Christian

mara · February 28, 2019, 6:14pm

Definitely. That's just one alphabet in there (Latin characters). I imagine the unicode consortium has a dataset somewhere.

system · March 21, 2019, 6:14pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.