Multiple logistic regression model + risk scores to calculate employee attrition


I work with data from human resources and my goal is to build a logistic regression model in order to predict employee attrition. (Employees having a status active=0 or left the business=1). With that, I want to calculate risk scores for each individual employee that tells that employees with different characteristics might have a high/medium/low risk to leave the business.

I have about 40 variables for that, e.g. overtime, sick leaves taken, compensation data, department etc., numerical and categorical data alike.

Using numerical variables are sort of ok to build the model (I have been studying R for a while but I am far from being an expert). However, I can hardly understand how to exactly use and interpret categorical predictors in a multiple logistic regression model. Explanations I have found so far seem to be quite rough or vague about this, so I do not get how to apply this in practice. Can you please suggest any good source of information for someone who is learning R and is moderately comfortable with statistics in general?

My other question is about the risk score calculation. If you could give me any hint on - once my model isaccurate enough - what method should I use to come up with risk scores, please?
What I am thinking is e.g. employee001 works as a developer (+60 risk score), has been promoted in the last 12 months (-15 risk score), but has a commute time more than 45 minutes to work (+25 risk score) has 70 risk score, while employe002 has only 20, so employee001 has a high, why 002 has a low chance to leave. What should be the appropriate steps to come up with something like this?
Appreciate any help on the above.

1 Like

Hi, and welcome!

It's worth getting a copy of Applied Logistic Regression 3rd Edition by David W. Hosmer Jr., Stanley Lemeshow and Rodney X. Sturdivant (2009), especially since you have a stats background. Unfortunately, it's code agnostic, with no examples in any language. I'm working to remedy that in R see, e.g., but it's slow going.

The rule of thumb for categorical variables is to treat them as continuous if there are more than a dozen or so, and to create dummy binary variables if they are not.

For example, if a variable can take on one of three values, say, red, yellow, blue you would create three substitute binary variables of those names.

The risk metric that comes out of a logistic regression is the odds ratio, which is just what it sounds like. An odds ratio of 0 means that the outcome, Y is equally likely with or without the independent variables X_i ...X_n. OR > 0 means more likely, 1\frac 1 2 one and a half times more likely, -\frac 1 2, half as likely, etc.

If you have enough historical data, you'll want to partition it into a training set and validation set and use the goodness of fit tests to see how well the model does in practice.

Come on back when you have specific questions, and please see the FAQ: What's a reproducible example (`reprex`) and how do I do one? on how to attract good answers.


Many thanks for the response, very informative.
What do you mean by treat categorical variables as continuous, can you tell an example, please?

I did the training and test sets with the continuous variables already, and will come back with more specific questions and a reprex soon.

1 Like

Good question.

Assume three of the variables classify some attribute.

  1. Gender, coded as female/male
  2. Age cohort, coded as under 18, 19-35, 36-65, 66 and older
  3. State of residence, coded as AL, AK, AR, ... WV

In the first case, we would create two binary variables, female and male, each code 1/0

In the second case, we would create four binary variables, "minor","adult","middle_aged", "senile"

In the third case, we would not recode.

1 Like

Based on you kind answers, I am working on to reshape the data for this and recode all the categorical ones to binary variables.
I have 2 additional questions:

  1. How do I determine which variables to keep and which to leave behind? I read about information value calculation and WOE here, but I hardly understand the concept, e.g. what can be considered a good or a bad predictor?

  2. How will I get odds ratio for an individual? Do I have to calculate is somehow from the coefficients of the model?

A more descriptive explanation of WOE/IV may help you better understand those alternatives to the logistic regression approach. I haven't used them, so can only comment on logistic regression diagnostics.

To build a model for determining the binary treatment, response or dependent variable, conventionally designated Y, as a function of the `covariates or dependent variable or variables, conventionally designated X_i ... X_n, there are alternative approaches:

  1. Purposeful selection of covariates
  2. Stepwise additive
  3. Stepwise subtractive

Depending on the number of covariates, I prefer to begin with a fully saturated model, using the stepwise subtractive approach

fit <- glm(Y ~ x1 + x2 + x3 ...)

The summary fit identifies candidate covariates for deletion

  1. Those that have a p-value less than the pre-selected $\alpha$
  2. Those that have NA p-values due to missingness
  3. Those that are collinear

For the surviving covariates, I'd usually then check for interactions

fit2 <- glm(Y ~ x1 + x2 + x1*x2 ...)

and make the same assessment. The end result is the main effects model.

To calculate the odds ratio I use this function

odr <- function(x) {
    exp(cbind(OR = coef(x), confint(x)))

where x is a fitted glm model.

The next step is to assess goodness of fit. The Hosmer-Lemeshow goodness of fit test has a null hypothesis H_0, that the fit is poor; accordingly a high p-value is evidence of a good fit. The results are based on dividing the probabilities for Y into deciles and then to examine the expected and actual results against their estimates.

The odds ratio for an individual is simple: it is either 1 or 0, because the OR is calculated from a population, not an individual.

I will use the stepwise subtractive method as you suggested, but I ran into a problem at the very beginning.
After recoding all my categorical variables to binomial, I have about 1400 variables. If I run the fully saturated model, I have either NAs or p-values with the highest significance (***).
Should this change after I remove all the variables that had NA p-value?
Regarding checking the interactions for the surviving covariates, does this mean that I will have to check all the possible combination of the remaining variables, or is there any method to shorten the list, (e.g. would it make sense to only check the interactions for the variables that I think they make sense)?

1 Like

That many is survey daunting. The NAs likely indicate either all 1 or all 0 cases or collinearity with other covariates. The olsrr package has a function that will generate all possible combinations of covariates, but that can be very computationally expensive.

Subtractive doesn't seem promising in this case. Additive isn't much better, considering that you'd need some principled rationale for choosing the order, which leaves purposeful, based on domain knowledge: choose, test, discard and add until the total deviance reaches a stable low. The question should always be with you: what does adding this covariate tell me that I didn't already know?

Makes sense. Domain knowledge is no problem, I am better at that than at stats :wink:
I will continue to work on this and return here when I encounter the next obstacle. Thank you so much for your help so far, really appreciate it.

1 Like

This is solved. Thanks very much for you guidance, it was very helpful!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.