Anyone able to help with disk.frames?

bryanrt · April 11, 2021, 1:52am

Goal

I am currently running a semi-complicated workflow that is quickly taking up all my RAM/Memory. I have numerous time series type models and each model outputs unique dataframes for each timestep. Thus, each model will have atleast 20-70 data frames by the time they are completed (not my design, so I am stuck with it). To perform a model to model comparison I am trying to take each of those data frames, from there respective model, and select what I need to join them into a singular dataframe per model. My workflow works great until I am comparing 20+ models.

This is why I am trying to find a way to step around my RAM issue. Below I have provided a single models worth of data and the chunk of code that converts the multiple dataframes per time step into a singular dataframe. If I change the listed listed dataframes to disk.frames I think I can drastically reduce the strain on my computer, but I am struggling to figure out how to do this. If anyone has experience using disk.frames and is willing to help out I would be much obliged.

Data

I haven't seen any better way to share rds files, so until then here are the links to download my rds files. I know it's not clean so sorry, but unless you want some massive chunks of code to reproduce the data without having to download it, this is the best I got.

pfl_data.rds

col_ids.rds

Current Workflow

Each rds here is a list of lists. pfl_data is a list of each model and at each model is a list of each dataframe from each timestep within the model. col_ids is a list of listed values corresponding to the timestep dataframes used to determine what columns I want to pull from each data frame. so essentially I boil the dataframes down to two columns each and then left join them by the first column with a final renaming at the end.

# insert the quoted path (with the filename) to where you downloaded the rds file into the here() function
col_ids <- readr::read_rds(file = here::here())

# insert the quoted path (with the filename) to where you downloaded the rds file into the here() function
pfl_data <- readr::read_rds(file = here::here())

new_pfl_data <- furrr::future_map2(pfl_data, col_ids, function(pfl, col){
  purrr::map2(pfl[-1], col[-1],  ~ .x[c(2, .y)])
}) %>%
  furrr::future_map(function(x){
    purrr::reduce(x, dplyr::left_join, by = "V2") %>%
      stats::setNames(c("Y", paste0("Z", 2:ncol(.))))
  })

I was thinking about going upstream and changing pfl_data from a list of listed dataframes into a list of listed disk.frames with as.disk.frame() but I don't think that would work because I believe I could isolate unique columns from each disk.frame if they are listed like this. Thanks for taking the time to look this over!

technocrat · April 13, 2021, 12:00am

As an alternative to disk.frame, with which I have no experience, I offer some general observations and an alternative toolchain.

Let's start with the data. The pfl_data object is a list in which the data of real interest are tibbles deeply embedded several indices down.

> head(pfl_data[[1]][2][1][[1]],1)
# A tibble: 1 x 6
     V1    V2    V3    V4    V5    V6
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1   250  1000     0  49.7  154.  173.
> head(pfl_data[[1]][3][1][[1]],1)
# A tibble: 1 x 9
     V1    V2    V3     V4    V5    V6    V7    V8    V9
  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1   250  1000     0 -3797.  190.  294.  313.  323.  323.
> head(pfl_data[[1]][4][1][[1]],1)
# A tibble: 1 x 11
     V1    V2    V3     V4    V5    V6    V7    V8    V9   V10   V11
  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1   250  1000     0 -3673.   314   417  436.  446.  446.  453.  454.

Each embedded tibble appears to have 1,001 rows and a variable number of columns. Per the issue description, only two columns are needed from each data frame. Because the list fits comfortably in memory

> object.size(pfl_data)
1553736 bytes

and, therefore, the final object, which is a subset, should also. However, presumably because of a proliferation of intermediate objects, it does not.

Although we grow to understand the R philosophy of lazy evaluation, lazily bringing data into available memory is not at the top of our minds until we bump up against the constraints imposed by dynamic memory, operating system limits on per process access to it, failure of the OS to release it or some combination.

Accordingly, I would leave pfl_data out of RAM and extract only the pieces needed, since most of it is pure surplusage, to accomplish the boiling down. I would also do the join out-of-memory. The obvious tool is an SQL database.

For this data set, SqlLite is probably adequate and MySQL/MariaDB, Postgres or another relational database manager is definitely adequate by a very large margin.

The {dbplyr} package allows you to work with the data stored out-of-ram as if it were in memory using the same commands as for {dplyr} for these basic operations of select and join.

This approach exemplifies a helpful way of approaching R, the interaction of three objects— an existing object, x , a desired object,y , and a function, f, that will return a value of y given x as an argument. In other words, school algebra— f(x) = y. Any of the objects can be composites.

For this case, the objects are readily identifiable. pfl_data is x, a list of 20-70 tibbles, and y is the object desired to assume the role of x for further analysis. f will be a composite function, as follows

g g(x,y,z) to query tibbles x & y and join by key z
h h(g(x,y,z) to perform g and save it back to some object in or out of memory.

system · May 4, 2021, 12:01am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.