When should I use sparklyr?


#1

I am pretty much new to Spark and have only heard about it from our professor. Despite Googling and scratching my head several times I’m still not clear on understanding Spark.

My basic question is dplyr is awesome, then why do we need sparklyr?

Things I understand.

  1. Apache Spark is not a database - so no package is/was necessary.
  2. It just increases the computation speed.

So unless our IT department does not have spark on their machine, we should not use it.
Please help me to increase my knowledge.


#2

spaklyr is an R implementation of interface to Spark. Spark in itself is a solution that allows you to work with big data (think terabytes or even petabytes) that simply is impossible on a single machine. So you would use sparklyr when and if you don’t want to work with Spark directly (through Scala, for example), but you want to stay in R ecosystem.
Also, dplyr and sparklyr are not opposite to each other. You can (and probably will) use them together.
As for Spark increasing computation speed - this is actually not the case in a strict sense. For example, if you want to build a model and data fits into memory on your machine (say, it’s only 100k rows), it is faster to do it locally and not bother with Spark. As I’ve said, Spark normally is a good solution when/if you are moving to amounts of data that don’t fit into memory anymore.
I would suggest to read official documentation for sparklyr - http://spark.rstudio.com/. It goes into many details about how one would use sparklyr with Spark.