I am exploring the package textrecipes within the tidymodels ecosystem.

I wish to tune several options within the tokenizing but am becoming a bit stuck

Lets say I have a dataframe with two columns of reviews from two difference newspapers for burger places.

It looks like the below

I am trying to predict if a customer will go to the burger places after reading the reviews.

This is a made up nonsense dateset just for illustration

burger_id   newspaper_1_review      newspaper_2_review              cust_go
1           This is a review        This is a second review         'Y'
2           This is a review        This is a second review         'N'
3           This is a review        This is a second review         'Y'

I have set up my recipe and tidymodels like below and was wondering how can i tune the tokenization of both newspaper reviews separately

I have made up and example of pseudo code below which doesn't work in the slightest :slight_smile:


xgb_rec <- recipe(cust_go ~  newspaper_1_review + newspaper_2_review) %>%
# First newspaper to tune    
                step_tokenize(newspaper_1_review) %>%
                step_ngram(newspaper_1_review, newspaper_1_review_num_tokens = tune(num_tokens), min_num_tokens = 1) %>%
                step_tokenfilter(newspaper_1_review, newspaper_1_review_max_tokens = tune(max_tokens), min_times = 5) %>%

                # Second newspaper to tune
                step_tokenize(newspaper_2_review) %>%
                step_ngram(newspaper_2_review, newspaper_2_review_num_tokens = tune(num_tokens), min_num_tokens = 1) %>%
                step_tokenfilter(newspaper_2_review, newspaper_2_review_max_tokens = tune(max_tokens), min_times = 5) %>%
                step_tf(newspaper_2_review), data = mydf)

# boilerplate
xgb_spec <-
  boost_tree(trees = 1300, min_n = 6, mtry = 15, learn_rate = 0.01
  ) %>%
  set_engine("xgboost") %>%

# This is the part I'm a bit all over the place with
xgb_grid <- grid_max_entropy(
  size = 10

xgb_wf <- workflow() %>%
  add_recipe(xgb_rec) %>%

ctrl <- control_grid(verbose = FALSE, save_pred = TRUE)

xgb_rs <- tune_grid(
  resamples = train_fold,
  grid = xgb_grid,
  metrics = mset,
  control = ctrl

Thank you for your time

