I'm looking for an efficient way to serialize R nested dataframes (created with tidyr in this case) to a binary file format like Parquet, Avro, or ORC.
The use-case for this is BigQuery ingestion, where nested/repeated fields are helpful data structures. BigQuery requires one of Avro, Parquet, ORC or newline-delimited JSON (NDJSON) when using nested data. I'm working with some rather large raw datasets (in sum >20 TB) and NDJSON representations inflate the size even further.
So far I've tried the R
arrow package (details below) and investigated using
sparklyr.nested, but couldn't get an environment running. (In any case, my goal would be to write these files directly from R, without needing to pass it all through a Spark instance.)
I've included some rough example data below on what these nested data structures could look like. The R
arrow package has
write_parquet with a schema autodetect that works perfectly on unnested data, but not on the list-column after nesting. I've showed the example data in three steps: unnested long form, a 'level1' nested form, and a 'level2' nested form with a more complex schema. I have these questions in mind:
- Is it possible to write nested dataframes with
arrow::write_parquet, e.g. using a custom schema? (I've tried a few approaches to this but I'm not really comprehending the documentation for
arrow::schemaand its data types.) An example of a schema for a nested dataframe would be most helpful.
- Assuming that #1 is possible at all, are there good practices for creating schemas on more complex objects such as multiply-nested list-columns? If I can get parquet-writing working, I'll use this approach over a wide range of schemas of varying complexity, so being able to construct a schema somewhat programmatically would be helpful.
(I've added the tag
sparklyr here even though my goal is to avoid going through Spark, because that community seems to deal with these formats and associated read/write issues more than in other corners of R...)
Thank you for taking a look!
library(tidyverse) library(arrow) library(jsonify) set.seed(1234) long_df <- tidyr::expand_grid( ABC = LETTERS, xyz = letters, num = 1:100 ) %>% dplyr::mutate(ftr1 = runif(n()), ftr2 = rpois(n(), 100)) nested_frame_level1 <- long_df %>% tidyr::nest(nest_level1 = c(num, ftr1, ftr2)) nested_frame_level2 <- nested_frame_level1 %>% tidyr::nest(nest_level2 = c(xyz, nest_level1)) arrow::write_parquet(long_df, sink = "long_df.parquet") # No error, writes (and reads back in) just fine arrow::write_parquet(nested_frame_level1, sink = "nested_frame_level1.parquet") # Result: Error in Table__from_dots(dots, schema) : cannot infer type from data arrow::write_parquet(nested_frame_level2, sink = "nested_frame_level2.parquet") # Same result: Error in Table__from_dots(dots, schema) : cannot infer type from data # the NDJSON option is pretty fast but not storage-efficient nested_frame_level2_ndjson_list <- nested_frame_level2 %>% # Split by row split(f = seq_along(.[])) %>% # Create individual JSON objects per df row in a list purrr::map(~jsonify::to_json(.x, unbox = TRUE)) %>% # fwrite can interpret this list and write it a bit faster than writeLines data.table::fwrite("nested_frame_level2.ndjson", sep = "\n") # Another attempt: use arrow's JSON parser to learn a schema on samples # of the data records we want to write to Parquet nested_frame_level1_json <- nested_frame_level1 %>% jsonify::to_json(unbox = TRUE) %>% writeLines("nested_frame_level1.json") nested_frame_level2_json <- nested_frame_level2 %>% jsonify::to_json(unbox = TRUE) %>% writeLines("nested_frame_level2.json") # Reading the JSONs into Arrow also errors out, although with distinct error messages: arrow::read_json_arrow("nested_frame_level1.json") # Result: Error in json___TableReader__Read(self) : # Invalid: JSON parse error: A column changed from object to number arrow::read_json_arrow("nested_frame_level2.json") # Result: Error in json___TableReader__Read(self) : OK