I am exploring the package
textrecipes within the
I wish to tune several options within the tokenizing but am becoming a bit stuck
Lets say I have a dataframe with two columns of reviews from two difference newspapers for burger places.
It looks like the below
I am trying to predict if a customer will go to the burger places after reading the reviews.
This is a made up nonsense dateset just for illustration
burger_id newspaper_1_review newspaper_2_review cust_go 1 This is a review This is a second review 'Y' 2 This is a review This is a second review 'N' 3 This is a review This is a second review 'Y'
I have set up my recipe and tidymodels like below and was wondering how can i tune the tokenization of both newspaper reviews separately
I have made up and example of pseudo code below which doesn't work in the slightest
library(tidymodels) library(tidyverse) xgb_rec <- recipe(cust_go ~ newspaper_1_review + newspaper_2_review) %>% # First newspaper to tune step_tokenize(newspaper_1_review) %>% step_ngram(newspaper_1_review, newspaper_1_review_num_tokens = tune(num_tokens), min_num_tokens = 1) %>% step_tokenfilter(newspaper_1_review, newspaper_1_review_max_tokens = tune(max_tokens), min_times = 5) %>% step_tf(newspaper_1_review) # Second newspaper to tune step_tokenize(newspaper_2_review) %>% step_ngram(newspaper_2_review, newspaper_2_review_num_tokens = tune(num_tokens), min_num_tokens = 1) %>% step_tokenfilter(newspaper_2_review, newspaper_2_review_max_tokens = tune(max_tokens), min_times = 5) %>% step_tf(newspaper_2_review), data = mydf) # boilerplate xgb_spec <- boost_tree(trees = 1300, min_n = 6, mtry = 15, learn_rate = 0.01 ) %>% set_engine("xgboost") %>% set_mode("classification") # This is the part I'm a bit all over the place with xgb_grid <- grid_max_entropy( newspaper_1_review_num_tokens(), newspaper_1_review_max_tokens(), newspaper_2_review_num_tokens(), newspaper_2_review_max_tokens(), size = 10 ) xgb_wf <- workflow() %>% add_recipe(xgb_rec) %>% add_model(xgb_spec) ctrl <- control_grid(verbose = FALSE, save_pred = TRUE) set.seed(345) xgb_rs <- tune_grid( xgb_wf, resamples = train_fold, grid = xgb_grid, metrics = mset, control = ctrl )
Thank you for your time