Thanks for your response @martin.R
data.table::fread() allows the use of grep.
Clever and pragmatic solution for many in my positon. However, result of grep must still be small enough to fit into memory, which might not be the case. Would still require downstream chunking in some cases depending on the scale we work at
@nirgrahamuk - thanks for pointing me to readr's read_delim_chunked function. I was not aware it existed. Definitely looks like a reasonable place to start. Could operate on chunks
I think arrow could do it, although I might be misunderstanding your goal (@andresrcs )
Thanks andre. I think arrow is definitely part of the way there. if i understand correctly (and its very possible i don't), Depending on the filetype your data is stored in, it won't be immediately read into R, and can be operated on in batches with map_batches. More info here . Not sure whether this work can be done in parallel.
The sense I'm getting from all of these answers is that chunking of inputs is a solved problem in R (with arrow/readr). This solves our memory issues.
I'm not yet confident we've got an obvious solution on the parallelization part of this question . I'll have to look more into whether arrow parallelization applies to simple tsv inputs and if so, how scalable it makes my operations with host CPUs. It remains possible arrow is the solution