So, what I've found with multi-class models (in particular neural nets), is that if there are some easy classes and some hard classes, they'll optimize for the easy ones quickly, but in the time it takes to fit the difficult classes, the easy ones tend to overfit.
Obviously there are deployment benefits to having a single multiclass model, however for overall training time and minimising the amount of faffing about, it's often easier to do multiple smaller models than one big multiclass one.
You could possibly do something in keras, though. I'm thinking you could try fitting a multiclass model, take the hidden layers, then transfer them to fine-tuned individual models, then stitch them together after they're trained into a kind of softmax frankenstein that'd take one input vector, do some shared hidden layer(s), then feed that into several pre-trained layers & outputs. It's a bit jankey, but, should result in a multiclass model that has the power of a bunch of dedicated models.