What is the difference between working on R directly and working with sparklyr?

Hi, I was just reading the article https://spark.rstudio.com/
But I am not sure what is the difference between working on R directly working under install.packages("sparklyr") packages,

Could you let me know, I am confused

When you work with sparklyr your data gets copied into a spark instance (think of it as a database engine), then you can manipulated it using dplyr-like commands, but under the hood all the processing gets done by spark much faster than it's possible on R.

1 Like

Cool. Here is the sample I got.

install.packages("sparklyr")
library(sparklyr)
spark_install(version = "2.1.0")
sc <- spark_connect(master = "local")
library(dplyr)
iris_tbl <- copy_to(sc,iris)

Here Iris dataset is getting copied to spark right? So in order to copy the data into a spark, first we need to load data into R right? If that is the case, my data is very huge, I cannot even import in R due to which I cannot copy into Spark.

please correct me if my understanding is wrong

You can read tabular data directly into spark with spark_read_csv()

Thanks let me try ............

Hi I tried to load the data using spark_read_csv() using below

spark_read_csv(sc,name = "as", path = "D:/New folder/Copy.csv",header = TRUE)

It is working. But the entire rows are not getting extracted. There are 2 lakh rows and only 1000 rows are extracted. May I know why?

Sorry I can't reproduce your issue, could you give any other details that might be relevant?

Take a look at this book to learn how to work with sparklyr

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.