Hi all,
I am a little confused with the different functions used to import data into Spark using the sparklyr interface:
-
spark_read_csv (or jason, parquet,etc.): I understand that we just need to provide the directory path, set the memory=FALSE (or true if we want it cached into memory) and voila, data imported. However, what I don't understand is how we can point this function towards a desired database, if at all. Is this set up during the configuration of the cluster? What I understand is that when memory=FALSE, a map is created to the data source, so that when we run a query using dplyr, the query goes all the way back to the database. But how do we tell Spark where to find the database?
-
The second question relates to the first one. When using the dplyr::tbl() function, if Spark is integrated with Hive, I understand we can use tbl to reference a Hive table like tbl(sc, "table_name") and that we can nest further pipes with dplyr. Isn't this the same as with the spark_read_csv() function when we create a map to the data table when setting memory=FALSE? If this is correct, can we then cache the result of this query into memory in the same way that spark_read_csv works when memory=TRUE?
I think I have these concepts mixed up, it'd be useful if somebody can help me unwind here.
Regards,