Reading and writing in Binary mode: replace \r with \n

# Create toy text file("CR")
write.table(mtcars, file = "toy_text.TXT",
            col.names = FALSE,row.names = FALSE,
            quote=FALSE, eol = "\r")

When I execute the above code, it generates a text file with "CR" because of the eol = "\r".

I am trying to convert CR to LF on a windows machine. Tweaking the solution given on StackOverflow
How to convert CRLF to LF on a Windows machine in Python - Stack Overflow, the Python code shown below works for me. If I understand correctly, the code simply replaces \r with \n in binary mode.

How do I achieve the same result using R?

# replacement strings
WINDOWS_LINE_ENDING = b'\r' # CR
UNIX_LINE_ENDING = b'\n' # LF

# relative or absolute file path, e.g.:
file_path = "toy_text.txt"

with open(file_path, 'rb') as open_file:
    content = open_file.read()
    
# Windows ➡ Unix
content = content.replace(WINDOWS_LINE_ENDING, UNIX_LINE_ENDING)

with open(file_path, 'wb') as open_file:
    open_file.write(content)

I would just use the basic R functions:

mylines <- readLines("toy_text.TXT")
writeLines(mylines, con = "toy_text_n.TXT", sep = "\n")

No need to handle binary.

EDIT:

Actually, it seems I was wrong: on my Windows the previous code saves a file with \r\n as separator. This code does more literally what you ask for:

xx <- readBin("toy_text.TXT", what = "raw",
              n = 32*11*10)

# change 0d (=\r) in 0a (=\n)
xx[xx == 0x0d] <- as.raw(0x0a)

writeBin(xx,
         "toy_text_n2.TXT")

Using the WSL command line to check the content of the file:

$ hexdump -C toy_text.TXT | head
00000000  32 31 20 36 20 31 36 30  20 31 31 30 20 33 2e 39  |21 6 160 110 3.9|
00000010  20 32 2e 36 32 20 31 36  2e 34 36 20 30 20 31 20  | 2.62 16.46 0 1 |
00000020  34 20 34 0d 32 31 20 36  20 31 36 30 20 31 31 30  |4 4.21 6 160 110|
00000030  20 33 2e 39 20 32 2e 38  37 35 20 31 37 2e 30 32  | 3.9 2.875 17.02|
00000040  20 30 20 31 20 34 20 34  0d 32 32 2e 38 20 34 20  | 0 1 4 4.22.8 4 |
00000050  31 30 38 20 39 33 20 33  2e 38 35 20 32 2e 33 32  |108 93 3.85 2.32|
00000060  20 31 38 2e 36 31 20 31  20 31 20 34 20 31 0d 32  | 18.61 1 1 4 1.2|
00000070  31 2e 34 20 36 20 32 35  38 20 31 31 30 20 33 2e  |1.4 6 258 110 3.|
00000080  30 38 20 33 2e 32 31 35  20 31 39 2e 34 34 20 31  |08 3.215 19.44 1|
00000090  20 30 20 33 20 31 0d 31  38 2e 37 20 38 20 33 36  | 0 3 1.18.7 8 36|
$ hexdump -C toy_text_n.TXT | head
00000000  32 31 20 36 20 31 36 30  20 31 31 30 20 33 2e 39  |21 6 160 110 3.9|
00000010  20 32 2e 36 32 20 31 36  2e 34 36 20 30 20 31 20  | 2.62 16.46 0 1 |
00000020  34 20 34 0d 0a 32 31 20  36 20 31 36 30 20 31 31  |4 4..21 6 160 11|
00000030  30 20 33 2e 39 20 32 2e  38 37 35 20 31 37 2e 30  |0 3.9 2.875 17.0|
00000040  32 20 30 20 31 20 34 20  34 0d 0a 32 32 2e 38 20  |2 0 1 4 4..22.8 |
00000050  34 20 31 30 38 20 39 33  20 33 2e 38 35 20 32 2e  |4 108 93 3.85 2.|
00000060  33 32 20 31 38 2e 36 31  20 31 20 31 20 34 20 31  |32 18.61 1 1 4 1|
00000070  0d 0a 32 31 2e 34 20 36  20 32 35 38 20 31 31 30  |..21.4 6 258 110|
00000080  20 33 2e 30 38 20 33 2e  32 31 35 20 31 39 2e 34  | 3.08 3.215 19.4|
00000090  34 20 31 20 30 20 33 20  31 0d 0a 31 38 2e 37 20  |4 1 0 3 1..18.7 |
$ hexdump -C toy_text_n2.TXT | head
00000000  32 31 20 36 20 31 36 30  20 31 31 30 20 33 2e 39  |21 6 160 110 3.9|
00000010  20 32 2e 36 32 20 31 36  2e 34 36 20 30 20 31 20  | 2.62 16.46 0 1 |
00000020  34 20 34 0a 32 31 20 36  20 31 36 30 20 31 31 30  |4 4.21 6 160 110|
00000030  20 33 2e 39 20 32 2e 38  37 35 20 31 37 2e 30 32  | 3.9 2.875 17.02|
00000040  20 30 20 31 20 34 20 34  0a 32 32 2e 38 20 34 20  | 0 1 4 4.22.8 4 |
00000050  31 30 38 20 39 33 20 33  2e 38 35 20 32 2e 33 32  |108 93 3.85 2.32|
00000060  20 31 38 2e 36 31 20 31  20 31 20 34 20 31 0a 32  | 18.61 1 1 4 1.2|
00000070  31 2e 34 20 36 20 32 35  38 20 31 31 30 20 33 2e  |1.4 6 258 110 3.|
00000080  30 38 20 33 2e 32 31 35  20 31 39 2e 34 34 20 31  |08 3.215 19.44 1|
00000090  20 30 20 33 20 31 0a 31  38 2e 37 20 38 20 33 36  | 0 3 1.18.7 8 36|

Note how the 4th byte of the 3rd line is 0d in the original file, 0a in the last file, but 0d 0a in the middle one.

1 Like

@AlexisW Thanks a lot! Your code works perfectly for mini datasets. But, when I apply it to a larger dataset, e.g., ggplot2::diamonds, it produces a file size of 4KB.

Please consider the following toy data.

write.table(ggplot2::diamonds, file = "toy_text.TXT",
            col.names = FALSE,row.names = FALSE,
            quote=FALSE, eol = "\r")

Answer to your question

Yes, that's the meaning of the n argument in

xx <- readBin("toy_text.TXT", what = "raw",
              n = 32*11*10)

The way I understand it, when you call readBin(), R will first ask the OS for memory of size n. Then it will start reading the content of the file and storing it in memory. If it finds an EOF (End of File signal) within the file, it stops reading; if it runs out of memory it stops reading.

So you need to guesstimate the size of the file before you start reading, overestimating the real size. That's what I did with n=32*11*10, because I knew the file should contain a 32 x 11 data frame, with typically less than 10 bytes per field.

Now if you really don't know anything about the file beforehand, you could try using file.size().

Recommended

Anyway, working with binary being a bit of a pain, I strongly recommend you stick with string functions. I still don't know why my previous writeLines() didn't work, but you can get it with:

xx <- readr::read_lines("toy_text.TXT",)

xx2 <- stringr::str_replace_all(xx, "\r", "\n")
readr::write_lines(xx2, "toy_text2.TXT")

And in that case:

$ hexdump -C toy_text.TXT | head
00000000  32 31 20 36 20 31 36 30  20 31 31 30 20 33 2e 39  |21 6 160 110 3.9|
00000010  20 32 2e 36 32 20 31 36  2e 34 36 20 30 20 31 20  | 2.62 16.46 0 1 |
00000020  34 20 34 0d 32 31 20 36  20 31 36 30 20 31 31 30  |4 4.21 6 160 110|
                    ^
$ hexdump -C toy_text2.TXT | head
00000000  32 31 20 36 20 31 36 30  20 31 31 30 20 33 2e 39  |21 6 160 110 3.9|
00000010  20 32 2e 36 32 20 31 36  2e 34 36 20 30 20 31 20  | 2.62 16.46 0 1 |
00000020  34 20 34 0a 32 31 20 36  20 31 36 30 20 31 31 30  |4 4.21 6 160 110|
                    ^
1 Like

@AlexisW many thanks for the suggestions. I find them really useful.

When I assign big numbers to n, the binary solution works as desired.
Could you please show me how can I use file.size() under n?

The size of the file on the disk should correspond to the number of bytes in that file (thus the number of files that need to be read). See examples at the end.

But I would still add:

  • it might be a good idea to overestimate a bit more in case something weird happens, e.g. n = 10*file.size("myfile.TXT")
  • In any case, working directly with binary is more dangerous, the solution above with read_lines() and write_lines() is probably always preferable.
write.table(ggplot2::diamonds, file = "toy_text_long.TXT",
            col.names = FALSE,row.names = FALSE,
            quote=FALSE, eol = "\r")


write.table(mtcars, file = "toy_text_short.TXT",
            col.names = FALSE,row.names = FALSE,
            quote=FALSE, eol = "\r")


# read correctly short one
read_as_bin_to_text <- readBin("toy_text_short.TXT",
                               what = "raw",
                               n = file.size("toy_text_short.TXT")) |>
  rawToChar() |>
  strsplit("\r") |>
  (\(.x) .x[[1]])()

read_as_text <- readLines("toy_text_short.TXT")

all.equal(read_as_bin_to_text, read_as_text)
#> [1] TRUE


# read long one, but with length of short (wrong)
read_as_bin_to_text <- readBin("toy_text_long.TXT",
                               what = "raw",
                               n = file.size("toy_text_short.TXT")) |>
  rawToChar() |>
  strsplit("\r") |>
  (\(.x) .x[[1]])()

read_as_text <- readLines("toy_text_long.TXT")

all.equal(read_as_bin_to_text, read_as_text)
#> [1] "Lengths (28, 53940) differ (string compare on first 28)"
#> [2] "1 string mismatch"


#read correctly long one
read_as_bin_to_text <- readBin("toy_text_long.TXT",
                               what = "raw",
                               n = file.size("toy_text_long.TXT")) |>
  rawToChar() |>
  strsplit("\r") |>
  (\(.x) .x[[1]])()

read_as_text <- readLines("toy_text_long.TXT")

all.equal(read_as_bin_to_text, read_as_text)
#> [1] TRUE

Created on 2022-05-11 by the reprex package (v2.0.1)

1 Like

@AlexisW Thank you again for the detailed explanations! I have learned many new things. Thank you!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.