HTML characters (ie.... ³) to unicode (ie. \uB003)?


#1

I'm looking for a way to convert HTML characters (ie.... ³ entity number I think they are called) to unicode (ie. \uB003)? My use case is that I can get information like: "Streamflow, ft³/s", but I want to put it on a ggplot2 graph. Here is what I've come up with, but it's not (a) working and (b) ideal:

library(xml2)
unescape_html <- function(str){
  xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))
}


ugly_string <- "Streamflow, ft&#179;/s"
fancy_chars <- regmatches(ugly_string, gregexpr("&#\\d{3};", ugly_string)) 

replacement <- unescape_html(fancy_chars)

formatted_chars <- gsub(pattern = "&#\\d{3};", 
                        replacement = replacement, 
                        x = ugly_string)
formatted_chars 
[1] "Streamflow, ft³/s"

So, it's close, but still got a funky Â. The end goal is to get ugly_string as a not-so-ugly axis label on a ggplot plot


#2

Have you looked at bquote?


#3

You can add to your ggplot:

+ labs(x = bquote("Streamflow, " ~ ft^3/s), y = "blablabla")

(Or the other way around depending on whether streamflow is the x or y axis.)


#4

Here is an example with a silly graph:

library(tidyverse)

my_dat <- tibble(
  streamflow = letters[1:10],
  y = 1:10
)

my_dat %>% ggplot(aes(streamflow, y)) +
  geom_point() + 
  labs(x = bquote("Streamflow, " ~ ft^3/s), y = "y")


#5

Thanks! I can use bquote when I know the equation. The issue is I have a web service that spits out THOUSANDS of parameters with these html codes (for instance, maybe it's cubic meters, or degrees, or who-knows-what). I'm trying to write a function (or use one that's already created) to make the labels pretty without needing to write them by hand.

I have a functional one now:

unescape_html <- function(str){
   
  fancy_chars <- regmatches(str, gregexpr("&#\\d{3};",str)) 

  unescaped <- xml2::xml_text(xml2::read_html(paste0("<x>", fancy_chars, "</x>")))

  fancy_chars <- gsub(pattern = "&#\\d{3};", 
                      replacement = unescaped, x = str)

  fancy_chars <- gsub("Â","", fancy_chars)
  return(fancy_chars)
}
unescape_html("Streamflow, ft&#179;/s")
[1] "Streamflow, ft³/s"

I'm just not confident how robust it is.


#6

how about

html_to_unicode <- function(x) {
  tmp <- tempfile(fileext = ".html")
  on.exit(file.remove(tmp))
  tmp_out <- tempfile(fileext = ".md")
  on.exit(file.remove(tmp_out))
  
  write(x, tmp)
  rmarkdown::pandoc_convert(tmp, output = tmp_out)
  readLines(tmp_out)
}

ugly_string <- "Streamflow, ft&#179;/s"
html_to_unicode(ugly_string)
#> [1] "Streamflow, ft³/s"

Created on 2018-04-11 by the reprex package (v0.2.0).