Sorry about that, that is an oversight on my end. You will need to make sure that your text column (in this case
medium) is being passed in as a character variable not a factor. Then it should work
data("tate_text", package = "modeldata")
tate_text <- tate_text |>
mutate(medium = as.character(medium)) |>
select(medium, year) |>
mutate(year = factor(if_else(year > 2000, "2000s", "1900s")))
tate_split <- initial_split(tate_text)
tate_train <- training(tate_split)
tate_test <- testing(tate_split)
rec <- recipe(year ~ medium, data = tate_train) |>
step_tokenfilter(medium, max_tokens = 20) |>
lr_spec <- logistic_reg()
wf_spec <- workflow() |>
wf_fit <- fit(wf_spec, data = tate_train)
tate_new <- tibble(medium = "Finger paint on sofa", year = "2000s")
predict(wf_fit, new_data = tate_new)
#> # A tibble: 1 × 1
#> 1 1900s
- If I understand you correctly, a one-token-per-feature approach, as this is, precludes presenting the model with data outside of the original data set. So, again talking about tweets, once I train my model a new tweet that comes in containing any new words would not be
predict-able? Expanding on your example:
Above was my mistake with the characters/factors issue. But to expand. These models works by looking at how often different works appear, then using that information to set the weights for the model. If it encounters a new word in the testing data set, it will simply be ignored before the model has zero information about that word.
- The code you show obviously works but since the recipe specifies modeling on
tate_train it has no awareness of
tate_test. Why doesn't that run into the data leakage problem you cite?
data leakage is what happens if you include information about the testing data into model training. For this example, if you were to let the model know that
sofa was word it could encounter later, it might change how the model would be fit, hence "leaking". (this specific recipe doesn't have much leakage opportunities, but as the general principle it applies)
- Along those lines, how does step_tokenfilter() work. If it selects the top 20 tokens in the training set, by frequency, all of those tokens are not guaranteed to appear in the test set, yet
predict on the test set doesn't throw an error. Why?
step_tokenfilter() works by counting the tokens in the training data set passed to it. For this set of arguments, it finds what the 20 most common tokens, and filters the tokens to only allow these tokens to pass through. The tokens are not guaranteed to be in the test set, and that is okay because it is is filter.
you will have a hard time trying to create a model that is able to work well on data that is drastically different than the model it is trained on.