compress_fst/decompress_fst fails with very large files

First, some exposition on how I got here to head off questions about why I'm not using write_fst and read_fst. I have a very large data.frame that changes daily and is used by a shiny app. The source data is being stored on a private S3 bucket that the app has read only access to. I was using the very clever aws.s3::readRDS function to pull this data.frame directly from the S3 bucket into memory for the app, without having to write to disk. The problem was that initialization was taking a very long time (~2 minutes), so I went looking for faster read methodologies, which lead me to fst.

One of the core functionalities of fst is to have the file written to disc, such that it has random access, but I am simply interested in taking advantage of its speed advantage in decompressing and (un)serializing. I came up with a solution that cut the read time in half (yay), but it has been failing on very large data.frames (> 3.8 GB).

I've tested this in a variety of environments (Windows 10 machine, 32 GB RAM; Ubuntu 18.04, 64 GB RAM; R 3.6.2 in both cases) with the same result:

 library(fst)
 library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
 
 # big dataframe creation
 test_df <- data.frame(replicate(26, sample(letters, 20000000, rep = TRUE)), stringsAsFactors = FALSE)
 
 # data.frame memory footprint
 format(object.size(test_df), units = "auto")
#> [1] "3.9 Gb"
 
 # create compressed binary in memory
 df_compressed <- serialize(test_df, NULL) %>%
     compress_fst()
 
 # decompress and unserialize to restore
 df_decompressed <- decompress_fst(df_compressed) %>%
     unserialize()
#> Error in unserialize(.): ReadItem: unknown type 0, perhaps written by later version of R
 
 ## WORKS WITH SLIGHTLY SMALLER DATAFRAME
 smaller_df <- sample_n(test_df, nrow(test_df)*0.9)
 
 # create compressed binary in memory
 df_compressed <- serialize(smaller_df, NULL) %>%
     compress_fst()
 
 # decompress and unserialize to restore
 df_decompressed <- decompress_fst(df_compressed) %>%
     unserialize()
 
 # ensure dataframe has not changed in the process
 all_equal(smaller_df, df_decompressed)
#> [1] TRUE

Created on 2020-02-26 by the reprex package (v0.3.0)

For those that are curious, here are my S3 write/read functions

library(aws.s3)

# S3 write binary function
s3_fst_bin_write <- function(data, bucket, compress_lvl = 50) {
    compressed <- serialize(data, NULL) %>% 
        compress_fst(compressor = "ZSTD", compression = compress_lvl)
    
    s3HTTP(
        verb = "PUT", 
        bucket = bucket,
        path = paste0('/', deparse(substitute(data))),
        request_body = compressed,
        verbose = FALSE,
        show_progress = TRUE)
    
    invisible(gc())
}

# S3 read  binary function
s3_fst_bin_read <- function(data, bucket) {
    get_object(data, bucket = bucket) %>% 
        decompress_fst() %>% 
        unserialize()
}

Ultimately I'm open to alternative methodologies that would allow me to read from S3 directly to memory faster than aws.s3::readRDS.

You could try data.table's fread() & fwrite():
https://h2oai.github.io/db-benchmark/

I don't know how well this would work with AWS.

1 Like

Using the new(ish) aws.s3::s3write_using/s3read_using methodology with data.table::fwrite/fread yields 2x speed advantage over s3saveRDS/s3readRDS, about inline with the performance improvements I was seeing with fst. For my purposes, this solution provides the outcome I was looking for. Thanks for the suggestion. Here is some sample code

# data.table fwrite/fread solution
library(data.table)
library(aws.s3)
library(dplyr)

# big dataframe creation
test_df <- data.frame(replicate(26, sample(letters, 20000000, rep = TRUE)), stringsAsFactors = FALSE)

# data.table methodology
s3write_using(
    test_df,
    FUN = fwrite,
    bucket = "bucket_name",
    object = "test_df"
)

retrieved <- s3read_using(
    FUN = fread, 
    bucket = "bucket_name",
    object = "test_df",
)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.