Best practices for one-vs-rest predictive modeling

ttrodrigz · March 10, 2022, 5:28pm

Could anyone suggest best practices within the tidymodels ecosystem for building multi-class models where a one-vs-rest (A vs not A, B vs not B, C vs not C, ... etc.) approach was used?

I never start with that approach, but occasionally I will build a multi-class classifier which, despite attempts at tuning, up/down sampling, feature engineering, etc., really struggles to predict a particular class.

For example, if a multinomial logit did a poor job at predicting class "A", one might think of building an A vs not A binary logistic regression with a relatively high recall % and attempting to blend that probability with the probability of predicting "A" in the multinomial model.

Is this a bad idea in general? If this is not a bad idea, does anyone have any suggestions on best practices for doing such a thing within tidymodels?

Thanks!

Max · March 11, 2022, 12:39pm

I guess it's not a bad idea; SVMs do this to get probability models for classification since they do not naturally produce class probabilities.

I'm not sure why you would do this if your ordinary model can produce probabilities though. I suspect that it will do worse but, if you have the time, give it a shot and see what happens.

Benbob · March 11, 2022, 4:58pm

So, what I've found with multi-class models (in particular neural nets), is that if there are some easy classes and some hard classes, they'll optimize for the easy ones quickly, but in the time it takes to fit the difficult classes, the easy ones tend to overfit.

Obviously there are deployment benefits to having a single multiclass model, however for overall training time and minimising the amount of faffing about, it's often easier to do multiple smaller models than one big multiclass one.

You could possibly do something in keras, though. I'm thinking you could try fitting a multiclass model, take the hidden layers, then transfer them to fine-tuned individual models, then stitch them together after they're trained into a kind of softmax frankenstein that'd take one input vector, do some shared hidden layer(s), then feed that into several pre-trained layers & outputs. It's a bit jankey, but, should result in a multiclass model that has the power of a bunch of dedicated models.

Max · March 12, 2022, 12:30am

There is cost-sensitive learning that could help. It's pretty straightforward. brulee or keras can do that.

ttrodrigz · March 12, 2022, 2:44am

I've had some success here and there with it. It certainly is a lot more effort since diverging from the ordinary tidymodeling workflow is required in order to blend these models in such an atypical manner.

I have to build a lot of multi-class models in my line of work and sometimes have to get a little creative with some of the data I get to work with , just wanted to make sure I wasn't overlooking something and wasn't committing some sort of modeling abuse.

ttrodrigz · March 12, 2022, 2:58am

Very interesting, Frankenstein softmax is a great way of describing it. I tend to shy away from neural nets in my day-to-day work since the models I generate will often have to be explainable to an end business user. Multinomial logits, discrim, decision trees and such are usually what I go for, but similar problems arise in those models as you described in the neural net case.

Max · March 12, 2022, 1:23pm

SVM's, random forests, boosting, single trees, and others can also do this.

system · April 2, 2022, 1:24pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.