Awk-like processing of large text file

Hi

Does anyone know of an existing package that lets you process large tabular files:

  1. Line by Line / Chunk by Chunk to avoid loading full data into tabular files
  2. in a manner that leverages multiple cores (or could be parallelized

I'd love to replace some of what I typically do using gnu parallel and awkwith a pure R solution (Rcpp and Rust-powered R packages included).

Example
Say we have a tsv file far too large to fit into memory.

image

I'm looking for a package/approach that would let me create a new file containing the contents of column 1
when the string in column 2 is one of c("red", "green", "orange")

I know that dvd's answer on the following thread goes through the baseR implementation but would love to know if there's a more out-of-the box solution supporting chunking/multithreading

I would start here

I think arrow could do it, although I might be misunderstanding your goal

data.table::fread() allows the use of grep.

Thanks for your response @martin.R

data.table::fread() allows the use of grep.

Clever and pragmatic solution for many in my positon. However, result of grep must still be small enough to fit into memory, which might not be the case. Would still require downstream chunking in some cases depending on the scale we work at

@nirgrahamuk - thanks for pointing me to readr's read_delim_chunked function. I was not aware it existed. Definitely looks like a reasonable place to start. Could operate on chunks

I think arrow could do it, although I might be misunderstanding your goal (@andresrcs )

Thanks andre. I think arrow is definitely part of the way there. if i understand correctly (and its very possible i don't), Depending on the filetype your data is stored in, it won't be immediately read into R, and can be operated on in batches with map_batches. More info here . Not sure whether this work can be done in parallel.

The sense I'm getting from all of these answers is that chunking of inputs is a solved problem in R (with arrow/readr). This solves our memory issues.

I'm not yet confident we've got an obvious solution on the parallelization part of this question . I'll have to look more into whether arrow parallelization applies to simple tsv inputs and if so, how scalable it makes my operations with host CPUs. It remains possible arrow is the solution

This might be a way to go also:


DiskFrame


add chunk - brief

Currently custom R code can't be executed in parallel for user defined chunks but there is work on the way, although, dplyr like operations are multi-threaded as of others implemented in C++. So it dependents on the specific processing to be applied.

1 Like

Thanks for clarifying @andresrcs!

Its great to know the dplyr-like operations are already multi-threaded. That's really the main functionality I'm after right now.

Basically right now I'm just trying to see how close I can get (in terms of speed and scalability) to a really simple commandline parallel awk filter on a basic TSV with native R solutions. Might be time to actually time to do some testing. I've been meaning to play around with arrow for a while now

Thanks @nirgrahamuk for suggesting the DiskFrame package. Certainly solves the memory issue. Its currently soft-deprecated in favor of Arrow so I'll probably start there.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.