read_lines_raw is not handling the CRLF in UTF-16LE files

kbzsl · May 7, 2019, 5:20am

Hi,

I am trying to import UTF-16LE formatted files (and later to convert/process). After some troubleshooting I found that the read_lines_raw is not handling the CRLF in UTF-16LE files.

> iconv("ab\r\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]]
 [1] 61 00 62 00 0d 00 0a 00 31 00 32 00
> readr::read_lines_raw(iconv("ab\r\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]])
[[1]]
[1] 61 00 62 00

[[2]]
[1] 00

[[3]]
[1] 00 31 00 32 00

Unfortunately the separator argument cannot be used with the read_lines_raw().

Is this by design or a fault?
Do you have any idea for a workaround?

Thank you.

mara · May 7, 2019, 10:32am

What is your expected output?

From the readr read_lines_raw() docs:

read_lines_raw() produces a list of raw vectors, and is useful for handling data with unknown encoding.

Also, if you wouldn't mind running examples through reprex in the future, it makes it a bit easier for others to work with your code (since you can just copy and paste it directly)!

iconv("ab\r\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]]
#>  [1] 61 00 62 00 0d 00 0a 00 31 00 32 00
readr::read_lines_raw(iconv("ab\r\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]])
#> [[1]]
#> [1] 61 00 62 00
#> 
#> [[2]]
#> [1] 00
#> 
#> [[3]]
#> [1] 00 31 00 32 00

^{Created on 2019-05-07 by the reprex package (v0.2.1.9000)}

kbzsl · May 7, 2019, 10:57am

Sorry, I am not familiar with reprex. It’s on my backlog to learn it. I tried to compile an easily reproducible example instead.

The example contains two (2) lines "ab" (61 00 62 00) and "12" (31 00 32 00) separated by the CRLF (0d 00 0a 00) and not CR in UTF-16LE coding (and in raw format).

[1] 61 00 62 00 0d 00 0a 00 31 00 32 00

The expected result is a list of 2 raw vectors (as many line lines are present) and not 3 raw vectors.

[[1]]
[1] 61 00 62 00

[[2]]
[1] 31 00 32 00

I used the read_lines_raw() to avoid any issues by using UTF-16LE encoding. (and I read through the documentation before I wrote this topic)

mara · May 7, 2019, 11:37am

Possibly related issues (though none seem exactly on, so you might consider filing a new one):

github.com/tidyverse/readr

Support multibyte encodings (e.g. UTF-16LE)

opened 10:45AM - 03 Nov 15 UTC

closed 04:40PM - 20 May 21 UTC

bilydr

feature multibyte

Hi, I am trying to read in a file with UTF-16LE encoding which can be done wit…h base package codes ``` R df <- read.delim(file1, stringsAsFactors = FALSE, fileEncoding = 'UTF-16LE') ``` but when I try to use readr to do the same ``` R df <- read_tsv(file1, locale = locale(encoding = 'UTF-16LE')) ``` I got the error **Error: Incomplete multibyte sequence** Can you please help fix it? Thanks for your advice!

github.com/tidyverse/readr

CRLF is treated as two newlines when skip_empty_rows = FALSE

opened 07:45PM - 28 Feb 19 UTC

closed 03:09PM - 06 May 21 UTC

nacnudus

bug

Windows newlines ``` r library(readr) # This is the output I would expect. read_csv("foo\n\nbar", #> # A tibble: 3 x 1 #> X1 #> <chr> #> 1 foo #> 2 <NA> #> 3 bar # I would expect read_csv("foo\r\n\r\nbar", #> # A tibble: 4 x 1 #> X1 #> <chr> #> 1 foo #> 2 <NA> #> 3 <NA> #> 4 bar ``` Created on 2019-02-28 <details> <summary>Session ``` r devtools::session_info() #> ─ Session #> setting value #> version R #> os Arch Linux #> system x86_64, linux-gnu #> ui X11 #> language #> collate en_NZ.UTF-8 #> ctype en_GB.UTF-8 #> tz Europe/London #> date 2019-02-28 #> #> ─ Packages #> package * version #> assertthat 0.2.0 #> backports 1.1.3 #> callr 3.1.1 #> cli 1.0.1 #> crayon 1.3.4 #> desc 1.2.0 #> devtools #> digest #> evaluate 0.12 #> fansi 0.4.0 #> fs 1.2.6 #> glue #> highr 0.7 #> hms #> htmltools 0.3.6 #> knitr 1.21 #> magrittr 1.5 #> memoise 1.1.0 #> nvimcom * 0.9-75 #> pillar #> pkgbuild 1.0.2 #> pkgconfig 2.0.2 #> pkgload 1.0.2 #> prettyunits 1.0.2 #> processx 3.2.1 #> ps 1.3.0 #> R6 2.4.0 #> Rcpp 1.0.0 #> readr #> remotes 2.0.2 #> rlang 0.3.1 #> rmarkdown 1.11 #> rprojroot 1.3-2 #> sessioninfo 1.1.1 #> stringi 1.3.1 #> stringr 1.4.0 #> testthat 2.0.1 #> tibble #> usethis 1.4.0 #> utf8 1.1.4 #> withr 2.1.2 #> xfun 0.4 #> yaml 2.2.0 #> #> [1] /home/nacnudus/R #> [2] /usr/lib/R/library ``` </details> `\r\n` are treated as two new lines when `skip_empty_rows = FAL…SE`. col_names = FALSE, skip_empty_rows = FALSE) the output to be the same as above. col_names = FALSE, skip_empty_rows = FALSE) by the [reprex package](https://reprex.tidyverse.org) (v0.2.0.9000). info</summary> info ────────────────────────────────────────────────────────── version 3.5.2 (2018-12-20) ────────────────────────────────────────────────────────────── date lib source 2017-04-11 [1] CRAN (R 3.5.0) 2018-12-14 [1] CRAN (R 3.5.2) 2018-12-21 [1] CRAN (R 3.5.2) 2018-09-25 [1] CRAN (R 3.5.1) 2017-09-16 [1] CRAN (R 3.5.0) 2018-05-01 [1] CRAN (R 3.5.0) 2.0.1.9000 2019-01-28 [1] Github (r-lib/devtools@e4e57aa) 0.6.18 2018-10-10 [1] CRAN (R 3.5.1) 2018-10-09 [1] CRAN (R 3.5.1) 2018-11-09 [1] Github (brodieG/fansi@ab11e9c) 2018-08-23 [1] CRAN (R 3.5.2) 1.3.0.9000 2019-01-28 [1] Github (tidyverse/glue@8188cea) 2018-06-09 [1] CRAN (R 3.5.1) 0.4.2.9001 2019-02-28 [1] Github (tidyverse/hms@16ff76e) 2017-04-28 [1] CRAN (R 3.5.0) 2018-12-10 [1] CRAN (R 3.5.1) 2014-11-22 [1] CRAN (R 3.5.0) 2017-04-21 [1] CRAN (R 3.5.0) 2019-01-03 [1] local 1.3.1.9000 2019-01-23 [1] Github (r-lib/pillar@3a54b8d) 2018-10-16 [1] CRAN (R 3.5.1) 2018-08-16 [1] CRAN (R 3.5.1) 2018-10-29 [1] CRAN (R 3.5.1) 2015-07-13 [1] CRAN (R 3.5.0) 2018-12-05 [1] CRAN (R 3.5.1) 2018-12-21 [1] CRAN (R 3.5.2) 2019-02-14 [1] CRAN (R 3.5.2) 2018-11-07 [1] CRAN (R 3.5.2) * 1.3.1.9000 2019-02-28 [1] Github (tidyverse/readr@b7e0b99) 2018-10-30 [1] CRAN (R 3.5.2) 2019-01-08 [1] CRAN (R 3.5.2) 2018-12-08 [1] CRAN (R 3.5.1) 2018-01-03 [1] CRAN (R 3.5.0) 2018-11-05 [1] CRAN (R 3.5.1) 2019-02-13 [1] CRAN (R 3.5.2) 2019-02-10 [1] CRAN (R 3.5.2) 2018-10-13 [1] CRAN (R 3.5.2) 2.0.1.9001 2019-02-28 [1] Github (tidyverse/tibble@92f5604) 2018-08-14 [1] CRAN (R 3.5.1) 2018-05-24 [1] CRAN (R 3.5.0) 2018-03-15 [1] CRAN (R 3.5.0) 2018-10-23 [1] CRAN (R 3.5.1) 2018-07-25 [1] CRAN (R 3.5.1) /x86_64-pc-linux-gnu-library/3.5

kbzsl · May 7, 2019, 12:30pm

Thank you for your answer.

In advance I checked that ticket (and some others). Initially I was not sure that they are connected, because I was reading in raw (= hex) format and not expecting that the format/encoding is parsed during reading process (read_lines_raw vs read_lines).
But checking for different combination for end of line separators (CRLF, CR and LF) it is clearly visible that they are not parsed as 2 bytes: the last raw vector in each case is staring with 00.

> raw_crlf = iconv("ab\r\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]]
> raw_cr = iconv("ab\r12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]]
> raw_lf = iconv("ab\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]]
> 
> readr::read_lines_raw(raw_crlf)
[[1]]
[1] 61 00 62 00

[[2]]
[1] 00

[[3]]
[1] 00 31 00 32 00

> readr::read_lines_raw(raw_cr)
[[1]]
[1] 61 00 62 00

[[2]]
[1] 00 31 00 32 00

> readr::read_lines_raw(raw_lf)
[[1]]
[1] 61 00 62 00

[[2]]
[1] 00 31 00 32 00

I assume that this is connected to multi-byte issue.

In meantime (till the multi-byte support will be implemented), do you have any idea for a workaround?
Thank you.

mara · May 7, 2019, 2:33pm

I don't, but hopefully someone else will! You might take a look at the iotools package, though I'm not sure if readAsRaw() will fit your use case.
https://CRAN.R-project.org/package=iotools

system · May 28, 2019, 2:33pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.