Best practice to handle out-of-memory data

nakamichi · September 16, 2017, 12:14pm

Hello, I am a data analyst using mainly R and SAS.

While I generally prefer to use R and tidyverse tools for my data science and programming tasks, I miss SAS datasets whenever R data frames consume all the memory.

I could use variety of R packages to handle large data (bigmemory, ff, dplyr interface to databases, etc.), but unified binary data format on disk as in SAS has several advantages: ease of data management, no need to learn additional syntax, being able to create many intermediate datasets without caring about RAM (which makes debugging easy).

So, I want to ask R community:

What are your ideas, techniques, workflows, and best practices to handle out-of-memory data in R ? (I assume that data is several gigabytes or bigger, but not as big as it needs large-scale infrastructure for distributed computing.)
Is there any new technology or project to this problem worth watching ?

alistaire · September 16, 2017, 4:21pm

I've been connecting to a local instance of Apache Drill via Bob Rudis's sergeant, which defines a dplyr backend. It can operate directly on on-disk CSVs (or JSON, or Parquet, or...) with ANSI SQL as if it were a remote SQL server, which in practice means I can keep big (10s of Gb so far) rectangular data files on an external hard drive and run anything on them I like via dplyr without ever reading them into R's memory. It's reasonably fast, considering the size (though a well-maintained SQL server would be faster) and requires very little setup and maintenance.

nakamichi · September 16, 2017, 5:17pm

Thanks!

It is a very interesting idea to use technologies for distributed computing to deal with data on a single machine.

While I have been interested in distributed computing technologies, I am intimidated by technical expertise required by those technologies, because our team is small and lacking in resources to set up and maintain large-scale infrastructure.

In addition to my original questions, do you have general advice for small teams to take advantage of distributed computing tech ?

alistaire · September 16, 2017, 7:02pm

There are people who can answer that question more thoroughly, but a few thoughts. Firstly, don't be too intimidated; in the age of clouds and Docker, the bar for a lot of these technologies is really low and doesn't require a dedicated sysadmin, e.g.

Apache Spark via RStudio/sparklyr on AWS:

or even just RStudio on AWS with as much memory as you need:

The opposite—trying to build systems from scratch—can waste a lot of time and money, and leave you with a really poor implementation.

Maybe the most important point, though, is to find the find the right tool for the job. Neural networks are cool, but sometimes you really just need a linear model. The question determines the algorithm, which determines the tech stack.

desmonds22 · September 17, 2017, 12:50pm

Same boat. Welcome aboard.

From what I've seen every solution is a workaround. No package will allow R to 'just work' with larger-than-RAM the way SAS functions.

The most frustrating part about it is the responses to this question will usually be a smattering of the following

Buy more RAM - (If I had a dollar for every time I heard this I could buy all the RAM)
Sample data - which doesn't answer the question
Or something like "Just spin up a spark cluster AWS big data machine in the cloud with 256gb of ram" - which again doesn't answer the question and doesn't seem to grasp that’s not how things work in an insurance company / bank etc

I started off using RSQlite to do data munging for large datasets however once your data is larger than a few GBs the speed becomes frustrating - at which point I switched to MonetDBLite. It’s a huge speed bump. Don't ask me my - they claim its to do with columnar storage but I'm sure its more to do with witchcraft.

But even so, I could never fit any models on data stored in a DB so really using a database solves a small part of the problem. Switching to ff and the big-family of packages gives you a bit of runway but still nothing like the compatibility of SAS

I eventually gave up and we ordered Microsoft R Server to handle out-of-memory datasets. Its still not perfectly compatible with base-R, but its very close. If you're coming from SAS this is your best bet in my opinion.

That said I don't claim to be an expert in R - it was cheaper to buy MRS than spend time messing around with workarounds.

nakamichi · September 17, 2017, 5:32pm

“Just spin up a spark cluster AWS big data machine in the cloud with 256gb of ram” ... doesn’t answer the question and doesn’t seem to grasp that’s not how things work in an insurance company / bank etc.

This is exactly the pain point for us, who work in regulated environment and can not easily access modern cloud infrastructure such as AWS, though I benefit from packages created by more web-oriented developers and scientists.

I would be glad if I could hear more opinions from those who work in similar environments, though comments from more tech-savvy people are also welcome !

olgavitek · September 18, 2017, 10:19pm

Consider Matter
https://bioconductor.org/packages/release/bioc/html/matter.html https://academic.oup.com/bioinformatics/article/3868724

jjwkdl · September 19, 2017, 7:22am

Check out this book R Inferno, which handle the efficiency of R data structure.

raybuhr · September 21, 2017, 3:18am

+1 for using Drill, whether locally or on a remote server. I've used it not only for querying flat files, but also json logs and MySQL databases, and sometimes all three in a single SQL query! I've been amazed at the simplicity of setup and performance.

raybuhr · September 21, 2017, 3:46am

Great comments. I had the same struggle with SQLite and switched to using PostgreSQL through Docker after reading
http://www.win-vector.com/blog/2016/02/databases-in-containers/. (great blog for R, btw)

I've heard really good things about MonetDB because of the columnar storage, but haven't tried it. Instead I used a columnar storage implementation in Postgres from Citus Data, cstore_fdw. This has the advantage of being able to use row based or column based storage for the table as appropriate for your needs. And since it is running in Docker you can pretty easily make this fairly easy to allow for reproducible analysis.

For fitting models on data larger than memory, h2o can be ran in a distributed manner. This requires access to multiple computers and a shared network drive, but the idea is subsetting the data and using stochastic gradient descent -- see this stackshare answer for a more thorough explanation and link to research.

Lastly, the CRAN Task Views has some great references for this type of stuff.

nakamichi · September 21, 2017, 2:16pm

Thank you for useful comments.

I have just found articles on ALTREP project in R core team.

This seems to address important performance issues in the current version of R, though I am lacking in technical knowledge to fully understand the implications of the project.

In particular, native support for memory mapped vectors seems to allow us to easily deal with out-of-memory data. Can we expect to use base R functions on these out-of-memory vectors in future versions of R, without specialized package API like bigmemory and biganalytics?

iain · September 21, 2017, 11:21pm

I had a nice experience using MonetDBLite (https://cran.r-project.org/web/packages/MonetDBLite/index.html). It was very easy to set up, was very fast and worked will with dplyr.

thoughtfulnz · September 22, 2017, 7:13am

I've spent the last month keeping a very close eye on election Twitter for the New Zealand election. This has involved building a dataset outside of RAM.

I've deliberately based this around what can be done on a home computer, because I want a project at the end that people can replicate on their own computer.

The main trick has been that as I have downloaded things I have saved them in individual CSV files (approximately 50000 files of data). Then, I have been answer questions that involve a first stage processing the data when reading it in. So rather than read.csv() it is read_and_process.csv(), for example by selecting, filtering, and grouping and summarising, then returning the modified data.

It wouldn't work for tackling analyses where you need to hold every atomised piece of data in memory at once, but for where the process reduces the needed data in intermediate stages it can be effective

pyguy2 · September 26, 2017, 3:02am

Thanks for that idea!
I tried to use drill, but for some reason it treats the csv as a single column of data, even when the drill console does.

I get output like:

# Source:   table<dfs.`/tmp/data.csv`> [?? x 1]
# Database: DrillConnection
                            columns
                              <chr>
1 "[\"fname\",\"lname\",\"score\"]"
2      "[\"mark\",\"smith\",\"1\"]"
3    "[\"betty\",\"wilson\",\"2\"]"
4       "[\"jim\",\"mccoy\",\"3\"]"

Here is the example code

#write out example csv file
df1 <- tibble(
  'fname'=c('mark','betty','jim'),
  'lname'=c('smith','wilson','mccoy'),  
  'score'=1:3
)
write_csv(df1,'/tmp/data.csv')

#startup drill
#check out: https://drill.apache.org/docs/drill-in-10-minutes/
#in another shell run
#bin/drill-embedded --verbose
#works from drill
#0: jdbc:drill:zk=local> select columns[1],columns[2] from dfs./tmp/data.csv;

db <- src_drill('localhost')

#sees the file but does not split into columns?

drill_df <- tbl(db, "dfs.`/tmp/data.csv`") 
drill_df
Source:   table<dfs.`/tmp/data.csv`> [?? x 1]
Database: DrillConnection
                            columns
                              <chr>
1 "[\"fname\",\"lname\",\"score\"]"
2      "[\"mark\",\"smith\",\"1\"]"
3    "[\"betty\",\"wilson\",\"2\"]"
4       "[\"jim\",\"mccoy\",\"3\"]"

Thanks for any input,

John

alistaire · September 26, 2017, 2:34pm

Can you put ``` around the code in your post so it's easier to read?

Also I (unrelatedly) managed to screw up my Drill installation, so I can't give you a full working version at the moment, but I can point you in the right direction.

By default, Drill reads files ending in .csv as if they don't have column headings. You can separate out the columns and add names in the SQL query, but it can be a lot of work.

Happily, there's a better option. Drill reads in files ending in .csvh as if they do have column headings, and splits them all out for you. If your files tend to have headings and you don't want to rename them all, you can change the settings of the storage plugin to swap the default behavior.

pyguy2 · September 27, 2017, 12:33am

Thanks for your help. I am sorry about not using ```.

If I treat it as a query it works. For future reference:

follow embedded instructions at: https://drill.apache.org/docs/drill-in-10-minutes/
2)shell prompt: drill/bin/drill-embedded --verbose

3)To query file at /tmp/data.csv and have it process the headers

library(tidyverse)
library(sergeant)

drill_con <- drill_connection('localhost')
query <- "select * from table(dfs.`/tmp/data.csv`(type => 'text',fieldDelimiter => ',', extractHeader => true))"
drill_df <- drill_query(drill_con,query)
drill_df
A tibble: 3 x 3
fname score  lname
* <chr> <int>  <chr>
1  mark     1  smith
2 betty     2 wilson
3   jim     3  mccoy

pyguy2 · October 20, 2017, 11:36pm

An option you can try is to use Apache Spark. It will process the data in a separate local JVM and manage the memory for large files. I ran a simple test with a csv that has 10M rows in it and it looked to work fine.

library(tidyverse)
library(sparklyr)

#follow instructions to install with spark_install(version = "2.2.0") at https://spark.rstudio.com/
#or download spark-2.2.0-bin-hadoop2.7.tgz from https://spark.apache.org/downloads.html
#extract, set SPARK_HOME, and point R to it

#connect to spark instance
sc <- spark_connect(master = "local")
#load csv into spark with 10M rows
sp_big <- spark_read_csv(sc,'big','/tmp/a.csv')
#look for single row that has A=2
sp_small <- sp_big %>% filter(A=='2')
#extract into a tibble
df_small <- collect(sp_small)

jessemaegan · October 21, 2017, 12:00am

If you use a mac and go this route, learn from the endless hours I wasted troubleshooting error messages, only to learn that Java 9 is not compatible with Spark.

Thankfully someone had solved the issue on SO:

I will also point out the RStudio walk-through on R + Spark is painless and easy to follow:

https://spark.rstudio.com/

alistaire · October 21, 2017, 1:43am

As far as I can tell, the same is true of Apache Drill which I mentioned above.

david2 · October 26, 2017, 3:55pm

In a similar spirit to @alistaire - I have been using MonetDBLite - for some time now for my need to store loads of data away, and it has been a really nice experience.
The 'Lite' version is just the original MonetDB repackaged as a R lib so you don't have to bother with the setup.
It is compatible with dplyr and it is blazing fast ( "the cheetah with a jetpack" ).
One nice feature that I use quite often is their version of read csv which dumps big csv's into a MonetDBLite instance on disk pretty quickly. Also i think worth mentioning is that feedback or support is always fast and friendly on github or the mailing lists.