How to improve and sample the data sets from HDFS into R
I think working with spark may be one of the best option. You should consult the website from RStudio
Spark can be an analytic engine to work with a hadoop cluster.
This explanation about Datascience with a Data Lake can help.
You could also have some data engineer help you expose data (Hive table or something else) to help you access this data remotely (with impala odbc driver for example)
There is others solutions I think, but I let someone else on the community talk about them.