Data prediction for missing data

mara · August 13, 2018, 12:28pm

I'm a big fan of this article on Missing Data Imputation (by Andrew Gelman, I believe) which includes R code for various methods, as well:
http://www.stat.columbia.edu/~gelman/arm/missing.pdf

Packages worth checking out (mentioned in this RViews post by Joseph Rickert, Missing Values, Data Science and R)

Amelia implements the Amelia II algorithm which assumes that the complete data set (missing and observed data) are multivariate normal. Imputations are done via the EMB (expectation-maximization with bootstrapping) algorithm. The JSS paper describes a strategy for combining the models resulting from each imputed data set. The Amelia vignette contains examples.

BaBoon provides two variants of the the Bayesian Bootstrap predictive mean matching to impute multiple missing values. Originally developed for survey data, the imputation algorithms are described as being robust with respect to imputation model misspecification. The best description and rationale for the algorithms seems to be the PhD thesis of one of the package authors.

Hmisc contains several functions that are helpful for missing value imputation including agreImpute() , impute() and transcan() . Documentation on Hmisc can be found here.

mi takes a Bayesian approach to imputing missing values. The imputation algorithm runs multiple MCMC chains to iteratively draw imputed values from conditional distributions of observed and imputed data. In addition to imputation algorithm, the package contains functions for visualizing the pattern of missing values in a data set and assessing the convergence of the MCMC chains. A vignetteshows a worked example and the associated JSS paper delves deeper into the theory and the mechanics of using the method.

mice which is an acronym for multivariate imputation of chained equations, formalizes the multiple implementation process outline above and is probably the gold standard for FCS multiple imputation. Package features include:

Columnwise specification of the imputation model

Support for arbitrary patterns of missing data

Passive imputation techniques that maintain consistency among data transformations

Subset selection of predictors

Support of arbitrary complete-data methods

Support pooling various types of statistics

Diagnostics for imputations

Callable user-written imputation functions