Sparklyr, ML Pipelines, and Collaborative Filtering



I am trying to learn the how to do collaborative filtering in Spark. There are several new concepts which exist outside of the tidyverse that I am still trying to get my head around.

These include spark connection, ml pipelines, tbl_spark, transformers, and estimators.

I am using a data set of product reviews. Selecting on the ProductID, UserID, and Score columns for my Alternating Least Squares (ALS) model.

I've tried to follow Kevin Kuo's RStudio::conf ML Pipelines presentation, but I am not getting all the components together in the proper order.

Spark's requirement for all columns to be numeric and ft_string_indexer is difficult to conceptualize from my tidyverse experience.

My end goal is to get a data frame of five products to recommend to each user.


Create the Spark context connnection

sc <- spark_connect(master="yarn-client", 
              version = '2.3.0',
              config = conf,
              spark_home = "/usr/lib/spark")

Get the reviews data. It is fine food reviews from Amazon.

reviews <- spark_read_csv(sc,"reviews","s3a://reviews.txt",delim="|")  %>%
           sdf_partition(train = 4/5, validation = 1/5, seed=2018)

View the Spark DataFrame


Split data into train and test

train <- reviews$train 
validation <- reviews$validation


Convert strings to integers using string indexer

product <- ft_string_indexer(sc,input_col="ProductID",output_col="product_index")
user <- ft_string_indexer(sc,input_col="UserID",output_col="user_index")


Create an ML Pipeline to feed the ALS Collaborative Filter model

pipeline <- ml_pipeline(sc) %>%
            ft_dplyr_transformer(train) %>%
            ft_string_indexer(input_col="ProductID",output_col="product_index") %>%

Set up model on training data

model <- ml_als(pipeline,rating_col="Score",user_col="user_index",item_col="product_index",max_iter=10) 



recommendations <-  ml_recommend(model, type = c("items"), n = 5)

I don't know how to create a ml pipelines for ALS. Does anyone in the R community have any experience with collaborative filtering? Thanks


Just to close the loop for benefit of future viewers: The ALS stage needs to be part of your pipeline, and you'll need to extract the fitted ALSModel in order to call the ml_recommend() function on it. See for further discussion on this.