List Columns and Memory

davidhen · September 29, 2017, 10:06pm

I'm going to be conducting an analysis with a pretty large dataset: ~134million observations of 8 variables and am expecting some memory difficulties with data wrangling (I'll have 64GB RAM available).

Ideally I'd like to use a list column for a lot of the data which would drop the number of rows down to about 1.1 million.

My question is whether this gives any memory advantage - or do the list columns take up more memory?

nick · September 29, 2017, 10:35pm

So, will just one column be a list column, or will every (or almost every) column be a list column? The first case would be more likely to save memory. I would try it out with a subset of your data, and check the results using object.size or pryr::object_size.

As a toy example, using a list with one numeric column A and one numeric list column B with a 100 entries per A row, unnesting increases the data size by about 50%:

suppressPackageStartupMessages(library(tidyverse))
df_list <- tibble(A = 1:1000L,
                  B = rerun(1000, rnorm(100)))
df_list
#> # A tibble: 1,000 x 2
#>        A           B
#>    <int>      <list>
#>  1     1 <dbl [100]>
#>  2     2 <dbl [100]>
#>  3     3 <dbl [100]>
#>  4     4 <dbl [100]>
#>  5     5 <dbl [100]>
#>  6     6 <dbl [100]>
#>  7     7 <dbl [100]>
#>  8     8 <dbl [100]>
#>  9     9 <dbl [100]>
#> 10    10 <dbl [100]>
#> # ... with 990 more rows
object.size(df_list)
#> 852896 bytes
df_long <- unnest(df_list)
df_long
#> # A tibble: 100,000 x 2
#>        A          B
#>    <int>      <dbl>
#>  1     1  1.7545358
#>  2     1 -0.2732362
#>  3     1  0.9484000
#>  4     1 -0.8999221
#>  5     1  1.3951232
#>  6     1  0.9915580
#>  7     1 -0.3650540
#>  8     1 -0.1489101
#>  9     1  1.4596137
#> 10     1  1.5815404
#> # ... with 99,990 more rows
object.size(df_long)
#> 1200896 bytes

So, it might help, but as I said, you should test it out to see if it's worthwhile in your case.

martin.R · September 30, 2017, 12:20am

If you are pushed to the limit on memory you could try using data.table. Its speed is mentioned most, but I think its memory efficiency is its strongest point.

I have managed over 630 million records (3 columns, I think) with 32 GB RAM using it without any crashes.

cderv · September 30, 2017, 8:10am

You can be interested by this current discussion here

data.table was already mentionned by @martin.R - I can confirm that its memory efficiency is its strongest point. data.table has a special syntax and mechanism to work on data by reference, therefore limiting the copy in memory. Now, it is pretty different from dplyr in syntax.

For memory efficiency in the tidyverse, I think you can try using database for your data. dplyr works very well with database connection. see rstudio website about database Using a SQLlite data.table and dplyr verbs can help you deal with big dataset.

davidhen · September 30, 2017, 12:19pm

Hi,

This is great - thanks!

Looking back at the description of the data I'll be working with it'll be 7 columns total. I'd like to make this into 1 column and 1 list-column (at a push 2 and 1) so potentially a big save in memory given your example - but yes, I'll need to test it.

I need a (fairly) simple summary from this data and am keen to use purrr functions on the list-column to obtain these so hopefully this will work (and keep everything in-memory and in-tidyverse).

If not, as @martin.R and @cderv point out - there are alternatives that will work.

Thanks all!

raybuhr · October 1, 2017, 5:30am

Options so far are good. One thing that hasn't been brought up yet is testing out your code on just a sample of the total data. Usually that works pretty well for me, though with less structured list columns something unexpected could potentially happen after you expand to the full dataset. Much easier to make sure your code does what it is supposed to on small data first then add MOAR RAM.

edgararuiz · October 1, 2017, 1:34pm

Hi @davidhen, another option I like may be using sparklyr locally in your machine. You can create multiple tables and then relate them via dplyr joins, that way you won't need list columns. There a bit more detail in this reply: Limitations of R

hadley · October 1, 2017, 4:59pm

Be aware that object.size() does not correctly account for objects that share memory:

library(pryr)

x <- 1:1e6
object.size(x)
#> 4000040 bytes
object_size(x)
#> 4 MB

y <- list(x, x, x, x)
object.size(y)
#> 16000232 bytes
object_size(y)
#> 4 MB