I am wondering what the best practice is for handling bad data (more specifically, illogical data as a result of incorrectly hand-entered data) in a predictive modeling workflow.
As a motivating example, say you have a predictor in your data set that has to be non-negative, maybe something like years of age of a person, and you come across a record of
-9 due to a simple typo of a person hand-entering that value when the data is collected. Say you can't simply remove that record due to scarcity of data, and you have no reason to believe any of the other data they entered was done incorrectly. And, of course, say this error was found in the training set.
What would be the appropriate response in data preprocessing? In the context of tidymodeling, my idea would be to add in a recipe step for that variable that would set any negative values to
NA and then tack on some imputation step, whether it be regression, knn, or just a simple mean impute.
Any thoughts on this would be appreciated, thanks!