One of the limitations of R that I have read about is that it needs to load entire data into memory, so its not suitable for analyses of big data. Has anything been done to address this problem? Are there any workarounds?
There are tons of great big data solutions in the R ecosystem.
- Sparklyr - the R API package to Apache Spark. There are data manipulation and machine learning modules in Spark that you can control with R or Python.
- Database connectors and dplyr - use R to manipulate data in the database on the server and then pull it out when you need it. I would recommend the new odbc package and the dbplyr package
- There are some memory optimization packages that will give you a small boost if you are talking about a few extra gigs of memory like bigmemory.
- Rent an AWS or Digital Ocean VM for an hour or two and you can brute force it
I’m sure other people will have more!
There is a pretty good discussion of this topic here:
As @Dripdrop12 alluded to, part of the trick is that R can provide an excellent interface to the “standard” big data tools. Beyond that, recommendations on specific solutions generally depend on the specific problem.
Unclear how this a limitation of R (especially on a 64-bit system) … nearly every programming system relies on external packages/modules/libraries/etc. to handle ‘virtual memory’ mappings for specific tasks (e.g. large matrix operations or binary/text file scanning). BTW, readr has chunked reading to process/reduce large files in chunks to prevent out-of-memory issues.
(In contrast, R’s missing out-of-the-box 64-bit integer support does strike me as a limitation when compared to other popular computing systems. bit64 is helpful, but (IMHO) I think native int64 support should be prioritized by R-core.)
Another cool thing about
sparklyr is that you can use it without a cluster or external server. The “local” mode (https://spark.rstudio.com/articles/deployment-overview.html#deployment) will create a Spark context in your laptop (Windows, Mac or Linux). I’ve done experiments on my laptop where I “map” the large files using Spark, so when I perform some
dplyr commands, they are actually being performed in Disk and not in Memory. I then just import the data I want into the Spark cache. Because of how Spark works, another nice thing is that I can actually map multiple files that have the same layout as if there were one table, so I can actually query across files w/o bringing anything into memory. Here are a couple of links that may be of help: https://spark.rstudio.com/articles/guides-caching.html and https://github.com/rstudio/webinars/blob/master/42-Introduction%20to%20sparklyr/sparklyr-webinar1.Rmd
List Columns and Memory
Thanks for letting me know of that, looks a very nice way to handle large data sets.