I need to handle data as large as 20 gigabytes on a daily basis. Currently I am using Sparklyr to deal with out of memory data, but it still does not feel as smooth as SAS. I would like to know if there will be any benefits to switch to Microsoft Open R in terms of big data analysis. I heard MRO uses MKL which could improve matrix operations, but I am not sure if this advantage would translate to faster data.frame operations and statistics stuffs. Also, is MRO shipped with any unique libraries that excel in big data?
You should probably ask on MRO's forum too
MRO will not help memory consumption. Some matrix operations are faster and some explicit parallelization is easier.
I am guessing Microsoft R server might be the solution, but it is not free. Any recommendation of big data tools in R?
Sounds like you need to take a look at Databases using dplyr.
Briefly, you connect an SQL database (or one of the other supported backends) to your R session and then you setup your script using R syntax. This syntax is then under the hood converted to SQL, which then interacts with your DB. This way you needn't load 20 gigs in memory.
Hope it helps
what do you mean by the above? Specifically what's not smooth? What does "not smooth" mean? From your question it's not obvious what your looking for.
20gigs isn't really "big data" to be completely honest and you have to take into consideration the spin-up time when using something like Spark.
This said, I routinely process est900million to est1.6billion rows in R(no Sparklyr) in 2-4 hours. I can't compare it to our SaS runs since before we converted we were not able to do the same type of processing. So, it's not a fair apple to apples comparison. This said, we are processing more columns, more rows, and more accurately(we fitted a GLM on our data processing) which we could not do before in SaS.
We tried Microsoft Open R and whether or not you will see a performance increase, well, it depends. I found the performance of using OpenBLAS was comparable to Microsoft's Open R and our memory footprint was significantly lower. The way we got around R hogging up all the memory was using the dreaded for loop and we ran concurrent Rscript sessions. We didn't think the performance was honestly that bad and it meets our needs.
Yeah, I do use postgresql with r, but it is not as fast as sparklyr.
Sorry for the ambiguity, smooth means fast
Are there any packages for doing that? Or could you kindly provide an example code? Thanks
you might consider Amazon Redshift. My experience has been that it's very fast. It depends on your data, of course. Also depends on the load time, the structure, and all sorts of details that you have not provided.
But to answer your initial question, there's nothing about your problem that would be made materially faster by switching to Microsoft Open R.
Here's what I will say: If you're inserting into a database, whether it's Redshift, Postgres, etc. It's going to be slow. I messed with this for weeks and determined that not only was it slow, it was dropping rows(this is obviously bad).
Our setup looks like the following.
Person puts .csv file(usually about 400k-900k rows) in an S3 bucket. AWS Lambda written in Python fires on the object creation event which calls our Python code.
The Python code SSHs into another EC2 instance and nohups a bash script which calls specific RScripts based on the file. Looks something like.
#!/bin/bash Rscript script1 & Rscript script2 & Rscript script3 & Rscript script4
Each script uses aws.s3, and reads in the file from S3 into an object where each code base does the EXACT same processing. We join each row against our lead table(28million rows) row by row and then parse, calculate and fit a model over the top. We write the results to independent locations where our "combinator" is running in the background. This is a python program with almost no overhead but what it does is basically loops over and over looking for files to be named a certain way in a certain location. The loop breaks when the counter hits it's setting based on the files being created. This way, it doesn't matter if Rscript4 finishes first or last, once it's done and creates the .zip, the background job will pick it up and will automatically exit ONCE it grabs all files.
We predominantly use aws.s3, data,table, and just base R. The trick is the splitting of the data. The lambda handler basically does a rec count on the number of rows in the input file and depending on the file, splits it X ways. This changes just because our lead sample will be different per file. So, we chunk into smaller pieces, and run it 3x, 4,x 8x etc concurrently using the bash script.
Once we have the full result sets, that's when we insert it into MSSQL.
Some food for thought: Ran a test today: Microsoft Open R on a Windows EC2 environment took2 hours 41 minutes. Our machine we actually run this stuff is R 3.5.0 on a centos build and finished in 39 minutes. However, Linux has less overhead in general so this isn't a completely fair comparison.
I would start with data.table, openBLAS, aws.s3, and maybe mixing in some Python.
I don't know if you can PM me or anything but maybe I can help some more.
Microsoft R server does allow one to work with with files larger than RAM but xdf files are messy to work with (duplication of files) and the revoscaler syntax is both alien and lacking in community support on stackoverflow. I dont believe its as seamless as SAS but its not bad. I'd still recommend paying for more ram (a rounding-error compared to a SAS license) as it takes out the admin of working with xdf files.
Even without the xdf files R server /R client is good to use as your default R version as revoscaler functions are much faster than base-R's equivalents (eg rxGlm models are tiny and run way faster than glms in base-R) and recommend R Client as THE version of R everyone should use.
Sparklyr is the closest free thing you'll get to SAS in R - which is to say a database that allows you to fit models. The database part is easily covered in R with monetdblite, postgres etc but you wont get the model-fitting.
If you're into masochism you could always round-trip the data by exporting a munged-dataset from the database to a csv file and then re-importing the csv into h2o. There is a (clunky) java driver that allows h2o to import data from postgresql so you can avoid that middle-bit but its still a bit of a rube goldberg solution.
(ps check if your company has Sql Server 2016+ as then you already have an installer for full-fat R server.)
Desktop Microsoft R Open helps in means of speeding up matrix operations but doesn't help with the memory issue.
In general, 20 GiB for analysis is usually feasible through code optimization, caching, vectorisation, data.table for R/O, parallelisation for multiple cores. And yes, buying a more powerful machine sometimes can be the fastest and the 'cheapest' option. But if the problem has a potential to grow then I'd recommend looking towards server/cloud solutions be it Microsoft R or not.
Hi, when you say you're using
sparklyr, does that mean that you're also using a Spark cluster? Or are you using
sparklyr as a way to run Spark locally in your laptop?
I use sparklyr locally. Many guys here recommend me monetdb, I tried it. It is faster than using SAS, I am a happy man now.
That's great to know. And yes, I'd use
sparklyr for a quick analysis of data that are in files, and I don't want to structure into a database, not necessarily fast, but I get access to run models out-of-memory.
It would be awesome if you write up an article on your experience going from SAS to
monetdb, I'm sure lots of folks would be interested.
Can we have an archive like this?
@mine - Do you have any thoughts on this?
I don't know of such a repository @Peter_Griffin and I think it's a good idea to have one. It will be challenging to keep up to date, but that might not be such a big deal. And the list would be pretty long, but that's a good thing! It's in the spirit of https://www.tidyverse.org/learn/ but for R in general.
I've started a list at https://github.com/rstudio-education/rstats-ed and will look into where might be a better venue for the information later.
Also, see https://www.class-central.com/tag/r%20programming for MOOCs on R (I haven't browsed it in detail, but I think it has complete-ish information on MOOCs).