This took a tremendous amount of work, but I finally cracked the code to get this working.
conf <- spark_config()
conf$sparklyr.defaultPackages <- "org.apache.hadoop:hadoop-aws:2.7.3"
conf$sparklyr.shell.conf <- "spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4"
sc <- spark_connect(master = "local", config = conf, version = "2.4.4")
ctx <- spark_context(sc)
jsc <- invoke_static(sc,
"org.apache.spark.api.java.JavaSparkContext",
"fromSparkContext",
ctx)
hconf <- jsc %>% invoke("hadoopConfiguration")
# we always want the s3a file system with V4 signatures
hconf %>% invoke("set", "fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
hconf %>% invoke("set", "com.amazonaws.services.s3a.enableV4", "true")
# connect to us-east-2 endpoint
hconf %>% invoke("set", "fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
# ensure we always use bucket owner full control ACL in case of cross-account access
hconf %>% invoke("set", "fs.s3a.acl.default", "BucketOwnerFullControl")
# use EC2 metadata service to authenticate
hconf %>% invoke("set", "fs.s3a.aws.credentials.provider",
"com.amazonaws.auth.InstanceProfileCredentialsProvider")
I have to say, the documentation on this (particularly the distinction between spark_config() and hadoop config) is a bit...rough about the edges. 