Modelling with multilevel data

taran · August 2, 2018, 9:41am

Hello,

So I am currently making a model where I have data per "loan" (let's just say it's a very special type of loan) for a number of large customers. Each customer can have between 100-30000 of these "loans".

The goal is to predict if these loans will default (yes/no). The model cannot be made on customer-level, as it has to be per loan. Note that the model also has to work on loans for new customers. The plan is to use both loan-specific (size of the loan etc) and customer specific variables in the model.

The problem: Since some customers have a large number of loans, these customers dominate the data. In fact, one customer has 30% of all the loans in the data set. If I just pool these loans together and throw them in a standard model (e.g. a logistic regression), surely the data from these large customers will completely dominate the predictions.

My first thought was to use some sort of mixed-effect model with e.g. lme4, but I struggle to understand how this can be used in this case given that the model should be able to predict loans for new customers as well (without retraining the model).

Also, I initially wanted to use GBM or Random Forest, but I'm not sure if these algorithms have any suitable methods for dealing with multilevel data.

Does anyone have any advice on how to solve this problem and which R-packages that could be used? Note that the model is going to be used in production.

konradino · August 2, 2018, 2:35pm

You shouldn't just plug it into the algorithm like this - before that you should create aggregations of your data and most probably roll it to a user level. You could then have a number of aggregated variables that would describe the user as well as the history that given user had: e.g. # of loans, max/ min/ avg DPD, latest DPD etc, repayment history parameters and so on.

On top of that once you already have a proper user structure you can join that user's most recent loan level predictors as well. Most recent history would play the most crucial role here. You should also take into consideration the fact that there are new/ renewed users most probably which means that loans could have been given to renewed users not only upon financial/ credit parameters but also as a business decision. You should account for that when building your aggregated dataset.

Max · August 2, 2018, 7:06pm

You might also filter your data on who are the relevant people to include in your training set. For example, if the model were being applied to first time loan applications, you might want to have your training set reflect that too. Are those customers with repeat loans "like" the other people that the model would make predictions on?

taran · August 3, 2018, 6:36am

Thank you for your replies!

Sorry if I wasn't clear, but it's an absolute requirement that the model is done per loan, not per customer (a customer can have two loans with completely different risk profile). I can't say too much do to confidentiality issues but it's not exactly normal loans (hope I didn't confuse you too much with the terminology).

And yes, new customers can also get these repeat loans with many observations, so filtering them out would be unfortunate in my opinion. The model is initially going to be applied on all loans, both new and existing.

konradino · August 3, 2018, 8:06am

If that's the case then I would definitely consider doing two models: one for new and a separate one for renewed customers. The renewed model could also have additional variables about past performance of loans under that given client which will make it more powerful. You need to keep in mind that combining both client groups could bias your model because they are not being scored/ underwritten in exactly the same way.

By saying they are not exactly normal loans - what do you mean? Are they credit lines? More of an overdraft or what exactly? Is the new loan granted to a given client only after the previous one was fully repaid?
*Konrad Semsch