I am happily using
read_csv_chunked to filter on the fly a subset of data from a 9GiB file.
The relevant callback is a
filter on matching value of a couple of columns (one of which from a
I have "randomly" chosen the value for
chunk_size to 20000.
Is there some heuristics/rationale for setting
chunk_size to a value that minimizes the to time needed to read the data?
My guess is that it partially depends on the "size" of each row and the applied filter complexity (memory requirements and CPU computation) but I am curious to hear from other users...