Speed Up Data Analytics and Wrangling With Parquet Files

This is a companion discussion topic for the original entry at https://www.rstudio.com/blog/speed-up-data-analytics-with-parquet-files


Nice work!

I just have a naive comment, since I am not familiar with those packages, so forgive me if I missed something.

I see that the Paquet data was loaded first, and then only few columns were selected, out of the hundreds that are available. Like the Task 1 for the X-Small data:


However, I do not see that the CSV data were reduced in the same way, so they indeed have hundreds of columns. I think the argument col_types or col_select was missing in the function read_csv.

This could explain in part why CSV code is slower, because it has around 50 times more columns.

Am I missing something?

I agree that things seems suspect.
They do identify the needed columns as they set up claims_columns object from a cols_only() call, and this could be used in a read_csv to control/limit the readin values, but after defining claims_columns it appears unused, ditto provider_columns etc.

This is very interesting. I'm using the qs format, which is also way faster than the (admittedly low baseline of) cvs files. The upside is that it can store R objects natively, such as lists, which is very useful the downside being that you have to load the whole file. Parquet might be better for loading columns in from a larger file. I made a quick test from the hip between parquet and qs, on a file with 500 mio rows and 20 columns, and qs (with the slowest compression setting) made the file 1/3 the size of write_parquet (with default settings). haven't tested speed yet.

Have you thought about using data.table's fread() for the CVSs?

This is a better baseline I would say. You can supply fread with commands from the commandline, such as sed and awk. This means you can select columns and filter rows before they enter R, which makes CSV files suddenly become much more interesting again, for large data. I have yet to see people use freads command line capabilities to it full potential. I should probably make a blog post myself where I do that.

Also, there is the h5 format. I think a comparison with more possibilities would be beneficial.

Also, use the library bench() for benchmarking, so you can see the RAM usage.

Thanks for the feedback, I can give some clarification. The published version is missing a piece within the readr::read_csv call. To reduce the number of columns the on the read desired columns are identified using cols_only, which is then referred in readr::read_csv. This second part was omitted in the final publication (I believe this is being fixed). So the example for Task 1 should be:

claims_columns <-
  cols_only(CLAIM_STATUS_TYPE_CD = col_character(),
            CH_SUBM_PROVIDER_ID = col_double(),
            BNFT_TYPE_CD = col_character(),
            CH_REND_AMT = col_double())

provider_columns <-
  cols_only(provider_id = col_double(),
            provider_type = col_character(),
            benefit_description = col_character())

  read_csv("/home/data/CLAIM_HISTORY_DAY.csv", col_types = claims_columns ) %>%
    filter(CLAIM_STATUS_TYPE_CD == "PC"),
  read_csv("/home/data/PROVIDER.csv", col_types = provider_columns),
  by = c("CH_SUBM_PROVIDER_ID" = "provider_id"))

I'd like to see the performance and storage comparison between bigmemory (file-backed) vs Parquet.