I am trying to learn the how to do collaborative filtering in Spark. There are several new concepts which exist outside of the tidyverse that I am still trying to get my head around.
These include spark connection, ml pipelines, tbl_spark, transformers, and estimators.
I am using a data set of product reviews. Selecting on the ProductID, UserID, and Score columns for my Alternating Least Squares (ALS) model.
I've tried to follow Kevin Kuo's RStudio::conf ML Pipelines presentation, but I am not getting all the components together in the proper order.
Spark's requirement for all columns to be numeric and ft_string_indexer is difficult to conceptualize from my tidyverse experience.
My end goal is to get a data frame of five products to recommend to each user.
library("dplyr")
library("sparklyr")
Create the Spark context connnection
sc <- spark_connect(master="yarn-client",
version = '2.3.0',
config = conf,
spark_home = "/usr/lib/spark")
sc
Get the reviews data. It is fine food reviews from Amazon.
reviews <- spark_read_csv(sc,"reviews","s3a://reviews.txt",delim="|") %>%
sdf_partition(train = 4/5, validation = 1/5, seed=2018)
View the Spark DataFrame
glimpse(reviews)
head(reviews)
Split data into train and test
train <- reviews$train
validation <- reviews$validation
glimpse(train)
head(train)
Convert strings to integers using string indexer
product <- ft_string_indexer(sc,input_col="ProductID",output_col="product_index")
user <- ft_string_indexer(sc,input_col="UserID",output_col="user_index")
is_ml_estimator(product)
product
Create an ML Pipeline to feed the ALS Collaborative Filter model
pipeline <- ml_pipeline(sc) %>%
ft_dplyr_transformer(train) %>%
ft_string_indexer(input_col="ProductID",output_col="product_index") %>%
ft_string_indexer(input_col="UserID",output_col="user_index")
pipeline
Set up model on training data
model <- ml_als(pipeline,rating_col="Score",user_col="user_index",item_col="product_index",max_iter=10)
model
Recommendations
recommendations <- ml_recommend(model, type = c("items"), n = 5)
I don't know how to create a ml pipelines for ALS. Does anyone in the R community have any experience with collaborative filtering? Thanks