Hi im working on a Employee Churn Prediction and I have a question with de Salary outliers:
I have a database of 960 values, and the salary outliers are 110 values. So my question is directed, what decision to make, because I don't want to eliminate the outliers because they are a big part of the total, but at the same time if I don't eliminate them and leave them there they can negatively affect my prediction model. At the same time, if I modify them like replacing them with the average, it should also affect me in the final results, because I am not working with the original data.
What would you recommend me to do in this case, in advance thank you very much
What makes you think those 110 observations are "outliers"? Maybe you have a salary distribution with a long tail(s). Is there a missing variable that help explain why 10% of the population has such different salaries? I would try not to discard or substitute just yet.
But only because they are marked as outliers in an boxplot (meaning they are outside a range of 1.5 times the IQR) doesn't tell you you should remove them from your data set. It is an important information and as @phil_hummel already pointed out there might be a fgood and maybe even interesting explanation for that that might be worth to investigate further instead of ignorign those.