Employee Churn Prediction "Outlier Decision"

Hi im working on a Employee Churn Prediction and I have a question with de Salary outliers:
I have a database of 960 values, and the salary outliers are 110 values. So my question is directed, what decision to make, because I don't want to eliminate the outliers because they are a big part of the total, but at the same time if I don't eliminate them and leave them there they can negatively affect my prediction model. At the same time, if I modify them like replacing them with the average, it should also affect me in the final results, because I am not working with the original data.
What would you recommend me to do in this case, in advance thank you very much

Translated with www.DeepL.com/Translator (free version)

What makes you think those 110 observations are "outliers"? Maybe you have a salary distribution with a long tail(s). Is there a missing variable that help explain why 10% of the population has such different salaries? I would try not to discard or substitute just yet.

1 Like

Gotta tell ya, if over 10% of your data is "outliers", then they ain't outliers. That's just your data.

Try transforming the data, for example using a log transform, and they will probably not look like "outliers" anymore.

1 Like

it might be that you need one model for the Majority population and another for your top performers.

1 Like

Yeah maybe it's because of the job they're working on. But i did the boxplot look:

Thank´s mate i will try that

But only because they are marked as outliers in an boxplot (meaning they are outside a range of 1.5 times the IQR) doesn't tell you you should remove them from your data set. It is an important information and as @phil_hummel already pointed out there might be a fgood and maybe even interesting explanation for that that might be worth to investigate further instead of ignorign those.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.