First time trying to implement a clean production workflow using R, sparklyr, and AWS technologies (mainly ec2, s3, and Redshift). For simplicity sake, I'm hoping to keep everything inside R. I would like some advice around what architecture makes the most sense.
Essentially, the modeling workflow looks something like:
- execute Redshift query to clean up/manipulate some raw data
- unload result set into s3
- load data into either HDFS or directly into memory
- execute spark job fitting model using sparklyr
- write results back to s3
What I'm struggling with is whether to go with
- a standalone Spark ec2 cluster, where I think I'm reading data directly off of s3, OR
- an EMR cluster with Spark installed, where I'm loading data into hadoop
From an automated production workflow perspective, it seems much cleaner to go with the former, since I can avoid writing a hadoop job using an R package to load data. However, I'm wondering if there are performance benefits I'm losing out on by not storing data in hadoop. Additionally, I'm not really sure what is actually happening in option 1- is data somehow being loaded straight from s3 into memory? Lastly, if I'm going with first option, is it overkill to still use an EMR cluster in order to avoid configuring Spark manually on a set of ec2 instances?