spark_read_avro not working in R

I'm using:
R version 4.1.1
sparklyr version ‘1.7.2’

I'm connected to my databricks cluster with databricks-connect and trying to read an avro file using the following code:

library(sparklyr)
library(dplyr)

sc <- spark_connect(
  method = "databricks", 
  spark_home = Users/my_spark_home_path,
  version = "3.1.1",
  packages = c("avro")
  )

df_path = "s3a://my_s3_path"
df = spark_read_avro(sc, path = df_path, memory = FALSE)

I also tried with explicitly adding the package:

library(sparklyr)
library(dplyr)

sc <- spark_connect(
  method = "databricks", 
  spark_home = Users/my_spark_home_path,
  version = "3.1.1",
  packages = "org.apache.spark:spark-avro_2.12:3.1.1"
  ) 

df_path = "s3a://my_s3_path"
df = spark_read_avro(sc, path = df_path, memory = FALSE)

The spark connection is working, I can read parquet files normally, but when reading the avro file I always get:

Error in validate_spark_avro_pkg_version(sc) : 
  Avro support must be enabled with `spark_connect(..., version = <version>, packages = c("avro", <other package(s)>), ...)`  or by explicitly including 'org.apache.spark:spark-avro_2.12:3.1.1-SNAPSHOT' for Spark version 3.1.1-SNAPSHOT in list of packages

I found a workaround with sparkavro package.

library(sparklyr)
library(dplyr)
library(sparkavro)

sc <- spark_connect(
  method = "databricks", 
  spark_home = "my_spark_home_path") 

df_path = "s3a://my_s3_path"
df = spark_read_avro(sc, path = df_path, name = "some_name",  memory = FALSE)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.