Need help troubleshooting step_textfeature function inside of textrecipes package

swansswansswans · November 24, 2020, 6:23pm

Hi all - just want to say big thanks first before I ask this question. Also want to acknowledge the incredible work by Emil and Julia who have taught me much about ML. Speaking of which, I am following along with the still-in-press SMLTAR ebook. I am working to engineer some features of my corpus and put them into the tidy recipes workflow. However, I am a bit stuck with the step_textfeature function that is a part of the textrecipes package. Hopefully my ignorance will be easy to remedy!

So in my feature engineering step I am making a handful of simple functions that should help improve the model's feature space. For example, here is a simple function that counts the number of characters in the given text record:

#simply count the length of the text
response_length <- function(x) {
  str_count(x)
}

And here is another function I wrote to analyze the total sentiment value of all the words in the text:

afinn <- get_sentiments("afinn")

derive_sentiment <- function(response) {
  df <- tibble(response) %>%
    unnest_tokens(word, response) %>%
    inner_join(afinn, by = "word")

  summed <- sum(df$value)
  return (summed)
}

When I call both of these functions outside of the recipe flow, they both return an integer (number) as expected. But, when I put them into the recipe using the step_textfeature function from the textrecipes package, I get some errors at model training. Here is the remainder of the code showing the appropriate inclusion of the two above functions in the workflow:

#define a list of these custom functions to be put into the recipe later
custom_functions <- list(
  response_length = response_length,
  derive_sentiment = derive_sentiment
)

#make a 'recipe' that pre-processes the text 
#here we also add in our custom functions that help build the feature space
preprocessing_recipe <-
  recipe(label ~ response,
         data = train
  ) %>%
  step_mutate(response_copy = response) %>%
  step_textfeature(response_copy, extract_functions = custom_functions) %>%
  step_tokenize(response) %>%
  step_stopwords(response) %>%
  step_tokenfilter(response, max_tokens = 500, min_times = 50) %>%
  step_tfidf(response) %>%
  step_downsample(label)


#cross-validation object
folds <- vfold_cv(train)

#declare a SVM classification model
svm_spec <- svm_rbf() %>%
  set_mode("classification") %>%
  set_engine("kernlab")
svm_spec

#build a SVM 'workflow' by passing the model and the recipe
svm_wf <- workflow() %>%
  add_recipe(preprocessing_recipe) %>%
  add_model(svm_spec)
svm_wf


#fit the models
#warning - takes a long time!
svm_rs <- fit_resamples(
  svm_wf,
  folds,
  metrics = metric_set(recall, precision, sensitivity, specificity, accuracy),
  control = control_resamples(save_pred = TRUE)
)
svm_rs

So the errors in question are simply that: the step_textfeature function throws:

Or at least I am assuming this comes from step_textfeature because in the docs it states:

All the functions passed to extract_functions must take a character vector as input and return a numeric vector of the same length, otherwise an error will be thrown.
(https://www.rdocumentation.org/packages/textrecipes/versions/0.3.0/topics/step_textfeature)

So, as I mentioned, I am under the impression that both of my functions are passed a single character vector and return a single real number. In fact, the first function (resposne_length) makes it through the recipe just fine. The error shows itself when I include the derive_sentiment function. I think I'm missing something super basic but if anyone can shine the light that would be awesome!

Cheers.

system · December 15, 2020, 6:23pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.