Handling bad data in modeling preprocessing

I am wondering what the best practice is for handling bad data (more specifically, illogical data as a result of incorrectly hand-entered data) in a predictive modeling workflow.

As a motivating example, say you have a predictor in your data set that has to be non-negative, maybe something like years of age of a person, and you come across a record of -9 due to a simple typo of a person hand-entering that value when the data is collected. Say you can't simply remove that record due to scarcity of data, and you have no reason to believe any of the other data they entered was done incorrectly. And, of course, say this error was found in the training set.

What would be the appropriate response in data preprocessing? In the context of tidymodeling, my idea would be to add in a recipe step for that variable that would set any negative values to NA and then tack on some imputation step, whether it be regression, knn, or just a simple mean impute.

Any thoughts on this would be appreciated, thanks!

I think so. I always do outlier identification / screening first, then imputation (with median) next to fill in missing values (including, but not limited to, those created by outlier screening).

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.