read_lines_raw is not handling the CRLF in UTF-16LE files

Hi,

I am trying to import UTF-16LE formatted files (and later to convert/process). After some troubleshooting I found that the read_lines_raw is not handling the CRLF in UTF-16LE files.

> iconv("ab\r\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]]
 [1] 61 00 62 00 0d 00 0a 00 31 00 32 00
> readr::read_lines_raw(iconv("ab\r\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]])
[[1]]
[1] 61 00 62 00

[[2]]
[1] 00

[[3]]
[1] 00 31 00 32 00

Unfortunately the separator argument cannot be used with the read_lines_raw().

Is this by design or a fault?
Do you have any idea for a workaround?

Thank you.

What is your expected output?

From the readr read_lines_raw() docs:

read_lines_raw() produces a list of raw vectors, and is useful for handling data with unknown encoding.

Also, if you wouldn't mind running examples through reprex in the future, it makes it a bit easier for others to work with your code (since you can just copy and paste it directly)!

iconv("ab\r\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]]
#>  [1] 61 00 62 00 0d 00 0a 00 31 00 32 00
readr::read_lines_raw(iconv("ab\r\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]])
#> [[1]]
#> [1] 61 00 62 00
#> 
#> [[2]]
#> [1] 00
#> 
#> [[3]]
#> [1] 00 31 00 32 00

Created on 2019-05-07 by the reprex package (v0.2.1.9000)

Sorry, I am not familiar with reprex. It’s on my backlog to learn it. I tried to compile an easily reproducible example instead.

The example contains two (2) lines "ab" (61 00 62 00) and "12" (31 00 32 00) separated by the CRLF (0d 00 0a 00) and not CR in UTF-16LE coding (and in raw format).

[1] 61 00 62 00 0d 00 0a 00 31 00 32 00

The expected result is a list of 2 raw vectors (as many line lines are present) and not 3 raw vectors.

[[1]]
[1] 61 00 62 00

[[2]]
[1] 31 00 32 00

I used the read_lines_raw() to avoid any issues by using UTF-16LE encoding. (and I read through the documentation before I wrote this topic)

Possibly related issues (though none seem exactly on, so you might consider filing a new one):


Thank you for your answer.

In advance I checked that ticket (and some others). Initially I was not sure that they are connected, because I was reading in raw (= hex) format and not expecting that the format/encoding is parsed during reading process (read_lines_raw vs read_lines).
But checking for different combination for end of line separators (CRLF, CR and LF) it is clearly visible that they are not parsed as 2 bytes: the last raw vector in each case is staring with 00.

> raw_crlf = iconv("ab\r\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]]
> raw_cr = iconv("ab\r12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]]
> raw_lf = iconv("ab\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]]
> 
> readr::read_lines_raw(raw_crlf)
[[1]]
[1] 61 00 62 00

[[2]]
[1] 00

[[3]]
[1] 00 31 00 32 00

> readr::read_lines_raw(raw_cr)
[[1]]
[1] 61 00 62 00

[[2]]
[1] 00 31 00 32 00

> readr::read_lines_raw(raw_lf)
[[1]]
[1] 61 00 62 00

[[2]]
[1] 00 31 00 32 00

I assume that this is connected to multi-byte issue.

In meantime (till the multi-byte support will be implemented), do you have any idea for a workaround?
Thank you.

I don't, but hopefully someone else will! You might take a look at the iotools package, though I'm not sure if readAsRaw() will fit your use case.
https://CRAN.R-project.org/package=iotools

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.