I have a small organization which is doing machine learning on a large streaming data set which increases in size and complexity by the day. We're finding our macbooks increasingly inadequate for the throughput and we generally want to structure the app in a mature way, with an eye toward a production environment that would withstand scrutiny.
We aren't engineers or even great programmers. I know R much better than anything else, mostly due to the outstanding community around the language. Of necessity, the code base is therefore R-centric and we want to avoid the cycle of prototyping in R and then translating to another language.
Therefore, I'm considering investing in a more powerful local machine to do the heavy lifting, and using Spark/ sparklyr to handle the data/ modeling, along with a local db.
A local desktop can be arbitrarily powerful (and expensive). Thinking about the hardware has generated a few questions, and I'd be grateful for any opinionated replies:
Q1. Given a single machine with potentially 1+ GPUs (eg powerful nVidia GPUs), and a one+ powerful multithreaded CPU(s), will Spark (in local mode) efficiently find and use those resources?
Q2. Presumably peripherals (eg the monitors/ OS desktop) will consume a certain fraction of compute. Should this be taken into account up front? Or is it insignificant and/or automatically handled by Spark/ the OS anyway?
Q3. Is using Spark in local mode a good match for the described hardware setup? My understanding is that worker nodes are typically individual machines with discrete CPU/GPU/memory/disk; here we'd have 1-2 multicore CPUs, one memory, one disk, the GPUs and so on.
Q4. Are there good benchmarks on the tradeoffs in performance (this is surprisingly hard to find) for more RAM vs cores vs GPU etc.
Q5. I'd probably run linux. Is there a downside to this?
Anticipating any recs for, say, AWS or Google Compute as an alternative: there's a significant knowledge overhead for adding those to the stack and they represent somewhat of a black box of expense. Therefore, they aren't my first preference, but I'm happy to be talked out of this view.
Thank you in advance, anyone.