I've had some serious success using tidy principles in my text classification project. Following some guides I've been able to produce a classification model that has some pretty strong performance ( > .8 on sensitivity, specificity, recall, and precision). I've gotten as far as creating my predictors / features and putting them into a recipe and juicing it. I've been using these resources:
Here is a truncated version of the R code that shows the model definition:
#cross-validation object folds <- vfold_cv(train) #declare a RF classification model rf_spec <- rand_forest( trees = 500 ) %>% set_mode("classification") %>% set_engine("ranger") rf_spec #build a 'workflow' by passing the model and the recipe svm_wf <- workflow() %>% add_recipe(preprocessing_recipe) %>% add_model(svm_spec) svm_wf #fit the model! svm_rs <- fit_resamples( svm_wf, folds, metrics = metric_set(recall, precision, sensitivity, specificity, accuracy), control = control_resamples(save_pred = TRUE) ) svm_rs
So after defining this model and fitting it, I am able to use it to classify my text! I feel great about the performance metrics so far and am working on tuning my model. But here's what I really want to know:
How can I report the 'goodness of fit' for each record? Or in other words is there a way to know how well a record matches the given classification?
For example, if the model labels a text record as "positive" based on the features / predictors... how can I describe this particular record's fit to the "positive" class? In conventional statistics there are confidence values, intervals, p values, and so on. Any advice or resources would be helpful, thank you.