fi displaying as <U+FB01>

Isaiah · January 4, 2019, 12:13pm

In RStudio version 1.1.463, running on Windows 10, the following tibble is not displaying consistently.

# A tibble: 1 x 1
  word 
  <chr>
1 ﬁeld

"fi" displays in the console pane as "fi"
but as "<U+FB01>" in the viewer and environment panes
The same issue for "fl".

dput(x)
structure(list(word = "field"), row.names = c(NA, -1L), class = c("tbl_df", 
"tbl", "data.frame"))
#>          word
#> 1 field

Editor font is set as Lucida console, and theme is modern

mara · January 4, 2019, 12:47pm

Hmm, what is the original character data? I'm on a Mac, so I'm not sure I can reproduce your results. Using your dput() from above I got regular old "field" in all views:

Can you include a reprex with the code you're using to create the tibble?

# Characters f-i
pryr::bits('fi')
#> [1] "01100110 01101001"
# Manually entered latin small ligature fi
pryr::bits("ﬁ")
#> [1] "11101111 10101100 10000001"

# from your top code chunk which case U+FB01
pryr::bits("ﬁeld")
#> [1] "11101111 10101100 10000001 01100101 01101100 01100100"
# from your dput, which does not
pryr::bits("field")
#> [1] "01100110 01101001 01100101 01101100 01100100"

^{Created on 2019-01-04 by the reprex package (v0.2.1.9000)}

jcblum · January 4, 2019, 9:35pm

Unless I’ve missed something, I don’t think this question involves R Markdown? The OP’s question appears to be about aspects of the RStudio IDE interface (viewer, environment pane).

Isaiah · January 4, 2019, 11:34pm

Thanks Mara, below is some code to make a tibble where every row shows the issue.

library(tidytext)
library(tidyverse)
library(rvest)
library(pdftools)

report <- 
  "http://www.bhp.com/~/media/bhp/documents/investors/reports/2011/bhpbillitonannualreport2011.pdf?la=en" %>%
  pdf_text %>%
  paste(collapse = " ") %>%
  tibble(text = .) %>%
  unnest_tokens(word, text) %>% 
  arrange(word) %>%
  rowid_to_column("ID") %>%
  filter(between(ID, 23, 673))

yonicd · January 5, 2019, 1:33am

My mistake, I misunderstood where the encoding was being misread.

andresrcs · January 5, 2019, 2:56am

I don't think this is an IDE issue, I think this has to do with the internal OCR library that the function pdf_text()uses, because I can reproduce on windows but get this completley different error on ubuntu (pdftools uses a different library for OCR on linux libpoppler-cpp-dev).

library(tidytext)
library(tidyverse)
library(rvest)
library(pdftools)

report <- 
    "http://www.bhp.com/~/media/bhp/documents/investors/reports/2011/bhpbillitonannualreport2011.pdf?la=en" %>%
    pdf_text %>%
    paste(collapse = " ") %>%
    tibble(text = .) %>%
    unnest_tokens(word, text) %>% 
    arrange(word) %>%
    rowid_to_column("ID") %>%
    filter(between(ID, 23, 673))
report
#> # A tibble: 651 x 2
#>       ID word 
#>    <int> <chr>
#>  1    23 0.0  
#>  2    24 0.0  
#>  3    25 0.0  
#>  4    26 0.0  
#>  5    27 0.0  
#>  6    28 0.0  
#>  7    29 0.0  
#>  8    30 0.0  
#>  9    31 0.0  
#> 10    32 0.0  
#> # ... with 641 more rows

^{Created on 2019-01-04 by the reprex package (v0.2.1)}

Isaiah · January 5, 2019, 5:20am

report %>% purrr::pluck("word") %>% str_detect(pattern = "fi") %>% sum() # 0
report %>% purrr::pluck("word") %>% str_detect(pattern = "ﬁ") %>% sum() # 572
report %>% purrr::pluck("word") %>% str_detect(pattern = "ﬂ") %>% sum() # 79
report %>% purrr::pluck("word") %>% vec_size() # 651 = 572 + 79
report %>% purrr::pluck("word") %>% str_detect(pattern = "FB") %>% sum() # 0 (from ,<U+FB01>)

So some panes in the id show <U+FB01> and some show "ﬁ",
the LATIN SMALL LIGATURE character. Same for <U+FB02>

Isaiah · January 5, 2019, 5:32am

And the repair is:

report$word <- report %>% purrr::pluck("word") %>% str_replace_all(pattern = "ﬁ", replacement = "fi")
report$word <- report %>% purrr::pluck("word") %>% str_replace_all(pattern = "ﬂ", replacement = "fi")

system · January 12, 2019, 5:32am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.