Hi all - just want to say big thanks first before I ask this question. Also want to acknowledge the incredible work by Emil and Julia who have taught me much about ML. Speaking of which, I am following along with the still-in-press SMLTAR ebook. I am working to engineer some features of my corpus and put them into the tidy recipes workflow. However, I am a bit stuck with the step_textfeature
function that is a part of the textrecipes
package. Hopefully my ignorance will be easy to remedy!
So in my feature engineering step I am making a handful of simple functions that should help improve the model's feature space. For example, here is a simple function that counts the number of characters in the given text record:
#simply count the length of the text
response_length <- function(x) {
str_count(x)
}
And here is another function I wrote to analyze the total sentiment value of all the words in the text:
afinn <- get_sentiments("afinn")
derive_sentiment <- function(response) {
df <- tibble(response) %>%
unnest_tokens(word, response) %>%
inner_join(afinn, by = "word")
summed <- sum(df$value)
return (summed)
}
When I call both of these functions outside of the recipe flow, they both return an integer (number) as expected. But, when I put them into the recipe using the step_textfeature
function from the textrecipes
package, I get some errors at model training. Here is the remainder of the code showing the appropriate inclusion of the two above functions in the workflow:
#define a list of these custom functions to be put into the recipe later
custom_functions <- list(
response_length = response_length,
derive_sentiment = derive_sentiment
)
#make a 'recipe' that pre-processes the text
#here we also add in our custom functions that help build the feature space
preprocessing_recipe <-
recipe(label ~ response,
data = train
) %>%
step_mutate(response_copy = response) %>%
step_textfeature(response_copy, extract_functions = custom_functions) %>%
step_tokenize(response) %>%
step_stopwords(response) %>%
step_tokenfilter(response, max_tokens = 500, min_times = 50) %>%
step_tfidf(response) %>%
step_downsample(label)
#cross-validation object
folds <- vfold_cv(train)
#declare a SVM classification model
svm_spec <- svm_rbf() %>%
set_mode("classification") %>%
set_engine("kernlab")
svm_spec
#build a SVM 'workflow' by passing the model and the recipe
svm_wf <- workflow() %>%
add_recipe(preprocessing_recipe) %>%
add_model(svm_spec)
svm_wf
#fit the models
#warning - takes a long time!
svm_rs <- fit_resamples(
svm_wf,
folds,
metrics = metric_set(recall, precision, sensitivity, specificity, accuracy),
control = control_resamples(save_pred = TRUE)
)
svm_rs
So the errors in question are simply that: the step_textfeature
function throws:
Or at least I am assuming this comes from step_textfeature
because in the docs it states:
All the functions passed to extract_functions must take a character vector as input and return a numeric vector of the same length, otherwise an error will be thrown.
(https://www.rdocumentation.org/packages/textrecipes/versions/0.3.0/topics/step_textfeature)
So, as I mentioned, I am under the impression that both of my functions are passed a single character vector and return a single real number. In fact, the first function (resposne_length
) makes it through the recipe just fine. The error shows itself when I include the derive_sentiment
function. I think I'm missing something super basic but if anyone can shine the light that would be awesome!
Cheers.