Reading data into Spark with sparklyr

Hi all,

I am a little confused with the different functions used to import data into Spark using the sparklyr interface:

  • spark_read_csv (or jason, parquet,etc.): I understand that we just need to provide the directory path, set the memory=FALSE (or true if we want it cached into memory) and voila, data imported. However, what I don't understand is how we can point this function towards a desired database, if at all. Is this set up during the configuration of the cluster? What I understand is that when memory=FALSE, a map is created to the data source, so that when we run a query using dplyr, the query goes all the way back to the database. But how do we tell Spark where to find the database?

  • The second question relates to the first one. When using the dplyr::tbl() function, if Spark is integrated with Hive, I understand we can use tbl to reference a Hive table like tbl(sc, "table_name") and that we can nest further pipes with dplyr. Isn't this the same as with the spark_read_csv() function when we create a map to the data table when setting memory=FALSE? If this is correct, can we then cache the result of this query into memory in the same way that spark_read_csv works when memory=TRUE?

I think I have these concepts mixed up, it'd be useful if somebody can help me unwind here.


This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.