So I am currently making a model where I have data per "loan" (let's just say it's a very special type of loan) for a number of large customers. Each customer can have between 100-30000 of these "loans".
The goal is to predict if these loans will default (yes/no). The model cannot be made on customer-level, as it has to be per loan. Note that the model also has to work on loans for new customers. The plan is to use both loan-specific (size of the loan etc) and customer specific variables in the model.
The problem: Since some customers have a large number of loans, these customers dominate the data. In fact, one customer has 30% of all the loans in the data set. If I just pool these loans together and throw them in a standard model (e.g. a logistic regression), surely the data from these large customers will completely dominate the predictions.
My first thought was to use some sort of mixed-effect model with e.g. lme4, but I struggle to understand how this can be used in this case given that the model should be able to predict loans for new customers as well (without retraining the model).
Also, I initially wanted to use GBM or Random Forest, but I'm not sure if these algorithms have any suitable methods for dealing with multilevel data.
Does anyone have any advice on how to solve this problem and which R-packages that could be used? Note that the model is going to be used in production.