Heuristic for `chunk_size` value in `readr::read_csv_chunked`

espinielli · September 28, 2018, 8:23am

I am happily using read_csv_chunked to filter on the fly a subset of data from a 9GiB file.
The relevant callback is a filter on matching value of a couple of columns (one of which from a mutate).

I have "randomly" chosen the value for chunk_size to 20000.
Is there some heuristics/rationale for setting chunk_size to a value that minimizes the time needed to read the data?

My guess is that it partially depends on the "size" of each row and the applied filter complexity (memory requirements and CPU computation) but I am curious to hear from other users...

mara · September 28, 2018, 12:41pm

I think you might be missing a word here, and I'm not totally sure what it is.

You could run a few different options and benchmark it to see what's fastest.

There's also a section in Efficient R programming, Fast data reading, that you might take a look at:

https://csgillespie.github.io/efficientR/5-3-importing-data.html#fast-data-reading

espinielli · September 28, 2018, 1:56pm

Mara,
thank you for the feedback: I edited and completed the questions in my initial post.
You guessed correctly what I was asking for, so thank you for link to "Efficient R Programming"...I will have a look.

espinielli · September 28, 2018, 2:29pm

The book does not really deal with chunked reading of data a la read_csv_chunked, rather it suggests solutions for handling big files.

The nice thing about read_csv_chunked is the capability of filtering on the fly and retaining the small-ish part of a much-too-big-for-your-machine initial CSV file.

I'll patiently wait and see whether some other tidyverse users have had any experience with this.