Data prediction for missing data


#1

Hi all,

I am new to R. I am analyzing export data for my project. In this project, there are 1,584,823 total records, with 157 meter DataID(s).

But there are some missing data in certain minutes.

I need to write algorithm to fill these missing data, for these meterID no:
35
77
94
252
483
484
739
871
1086
1185
1283
1507
1589
1714
1718
1790
1791
1801
2034
2072
2094
2129
2461
3310
3367
3527
3778
3893
4029
4031
4514
4998
5131
5193
5403
5785
5810
5814
5892
6412
6673
6910
7017
7030
7117
7287
7429
7674
7989
8156
8829
8890
9134
9295
9639
9729

What algorithm should I use for predicting missing data?

I have attached the graph -plotting existing data of all 56 meters.
meterIDs-for-prediction-missingdata2.pdf (2.1 MB)

This is the 56 meters of existing data plot that, I am going to add missing data prediction algorithm. Existing data, most of all meters are linearly increasing.

Out of these 56 meters, there are some spikes in certain meter IDs, for december. These spikes (I assume noisy data), will be another issue. I was wondering how should I predict for missing data of those meterID that has spikes in existing data.

Original csv file (org_data.csv) that I m analyzing:
[https://drive.google.com/open?id=12a3EfbSKKuPRAYUC-c58tbnBaiVlweVI ]


#2

There are loads of methods for imputing missing values. This article may be helpful to you.


#3

I feel this might be too big of a question without additional info.

For example, is this a standard problem with which you might point to standard methods you're looking to implement? Could you point to those? (There's a good amount of research on this. )

You should offer a minimal REPRoducible EXample (reprex)? (A reprex makes it much easier for others to understand your issue and figure out how to help.) Perhaps with just a couple of meters, timestamps, and examples of missing data.

You might give more background, for example, I see a ton of obs on meter id 35 (assuming dataid is your standin for meter id), it's not clear what's missing. Are these supposed to be synchronous observations?


Having said all that, assuming your looking for synchronous observations (an obs for every meter at every minute), you might just use the last obs available.


#4

I'm a big fan of this article on Missing Data Imputation (by Andrew Gelman, I believe) which includes R code for various methods, as well:
http://www.stat.columbia.edu/~gelman/arm/missing.pdf

Packages worth checking out (mentioned in this RViews post by Joseph Rickert, Missing Values, Data Science and R)

Amelia implements the Amelia II algorithm which assumes that the complete data set (missing and observed data) are multivariate normal. Imputations are done via the EMB (expectation-maximization with bootstrapping) algorithm. The JSS paper describes a strategy for combining the models resulting from each imputed data set. The Amelia vignette contains examples.

BaBoon provides two variants of the the Bayesian Bootstrap predictive mean matching to impute multiple missing values. Originally developed for survey data, the imputation algorithms are described as being robust with respect to imputation model misspecification. The best description and rationale for the algorithms seems to be the PhD thesis of one of the package authors.

Hmisc contains several functions that are helpful for missing value imputation including agreImpute() , impute() and transcan() . Documentation on Hmisc can be found here.

mi takes a Bayesian approach to imputing missing values. The imputation algorithm runs multiple MCMC chains to iteratively draw imputed values from conditional distributions of observed and imputed data. In addition to imputation algorithm, the package contains functions for visualizing the pattern of missing values in a data set and assessing the convergence of the MCMC chains. A vignetteshows a worked example and the associated JSS paper delves deeper into the theory and the mechanics of using the method.

mice which is an acronym for multivariate imputation of chained equations, formalizes the multiple implementation process outline above and is probably the gold standard for FCS multiple imputation. Package features include:

  • Columnwise specification of the imputation model
  • Support for arbitrary patterns of missing data
  • Passive imputation techniques that maintain consistency among data transformations
  • Subset selection of predictors
  • Support of arbitrary complete-data methods
  • Support pooling various types of statistics
  • Diagnostics for imputations
  • Callable user-written imputation functions