Choosing between models specified with and without a single attribute given future data structure

That's perhaps a bit a hypothetical question but I came across this recently at work and don't really have a clear answer in mind.

The dataset is small and highly imbalanced. We have a many numerical variables and one categorical one that specifies the type of the business with four distinct categories. Only one of them (let's call it category X) is relevant for future predictions since the product is tailored only for category X. It is worth noting that there are significant differences in predictive power across levels of that categorical variable. The best approach would involve including only category X in the training sample, however, due to the high imbalance there would not be enough target variable top class observations and the entire set gets really small. We eventually decided to train two models on the entire set (including all levels of the categorical variable): one including the categorical variable and one excluding it.

At the moment we have two models:

  1. With the categorical variable (called A)
  2. Without the categorical variable (called B)

Model A has slightly better performance than model B because the categorical variable was generally relevant to the problem, however, we should also consider it's generalization power with regards to the specific type of clients it's built for.

The main question is: which model is more suited for this given problem?

  1. Is it model A because of including the categorical variable and accounting for differences in performance across levels of the categorical variable?

  2. Is it model B because of not including the categorical variable and making it more suitable for future predictions (categorical variable will always have one level only) which could lead to better generalisation?

I realize that my question is a bit hypothetical but I'm not able to disclose more information. I would be very grateful if you could share your thoughts or articles that would help us make a decision. I'm not giving my preference at the moment not to bias your answers :wink: Thank you!

What kind of model are you running? You mention that model A has better performance but did you determine this solely off predictive behaviour or from statistical tests showing that the full model (model A) is significantly better than the restricted model (model B). If your model is a simple linear model, you could perform a likelihood ratio test since model B is simply a restricted version of model A. If you are using a more complicated model, this may not be as straight forward but may be worth looking into. This type of test will tell you whether you are really gaining the necessary improvement in model performance to warrant the additional parameter.

We're running Elastic-Net. There's really no need for running statistical tests here because it's evident from the resampled CV profile and test set performance. Model A is definitely a better performing model than B so that's not really the essence of the question.

I'm rather trying to tackle this from a sample building standpoint while keeping in mind to what sample the future model will be applied. That's why I believe answering these questions is crucial:

  1. Is it model A because of including the categorical variable and accounting for differences in performance across levels of the categorical variable?

  2. Is it model B because of not including the categorical variable and making it more suitable for future predictions (categorical variable will always have one level only) which could lead to better generalisation?