Extremely Large Outliers - How to explain it?

This is for my assignment, where I have been given a dataset (Survey Methodology for Enterprise Surveys - World Bank Group) and assigned to predict the businesses with growth potential. However, there are some extreme numerical values. The "employee" variable refer to the number of full-time employees.

Below is the box plot:
Picture 1

The question given by the lecturer is: "Check the dataset for outliers and replace or delete those values as appropriate. You must provide justifications for your cleaning strategies and discuss the potential issues associated with your chosen strategies."

Min: 1
1st quartile: 9
Median: 21
Mean: 87.95
3rd quartile: 72
Max: 64000

Please assist me in whether I should delete the outlier or not and how to explain the reason.

Thank you! Your help is much appreciated!

This is where the domain knowledge circle of the three Venn diagram circle of the definition of data science comes in. Where does the data come from? What businesses do the data represent? If this is a sample of 50 businesses in your local shopping area, then that 64000 point is likely an error. If it is a sample of all possible businesses, then it may not be an error, although for reference the giant insurer Prudential only has 42000 employees. At any rate, the 64000 point is probably unrepresentative of the data and unhelpful to your study. I would delete it.

2 Likes

Hi! The data is from an Enterprise Survey - Survey Methodology for Enterprise Surveys - World Bank Group

If that's the case, is it possible that the 64000 is a data entry error where it should be 6400?

Ah, then 6400 is possible, and would not even be the maximum, although the mean of 88 and third quartile of 72 are small relative to 6400.

It would not be unreasonable to replace it with 6400. Another possibility is to set it equal to Third Quartile + 3*(Third Quartile - First Quartile) called Tukey's fences.

1 Like

Thank you for the help! Have a great day ahead!