I am new to R. I am analyzing export Wh, data for my project.
Original csv file (org_data.csv) that I m analyzing can be found in this link:
In the original csv file, there are 1,584,823 total records, with 157 meters.
Recorded from 1-Oct-2015 00:00:00 to 31-Mar-2016 23:59:59
In the csv file, there are three columns: local minutes, dataID, meter_value.
- Local minutes is formatted with “yyyy-mm-dd hh:mm:ss-UTC
- dataID represents the ID number for each 157 meter,
- meter_value represents the export Wh
Quick view of original csv file, as below:
We can observe that, export Wh are recorded every minute for each dataID, but there are only 6 export Wh records for 2015-10-01 00:00:xx. There are missing records for remaining 151 meters.
Same goes for 2015-10-01 00:01:xx, there are only 7 meters recorded in the csv file. 150 meters' records are missing.
Objective of this project is to write algorithm to fill these missing data for 56 meters. Hence, there should have 56 records for each minutes starting from 1-Oct-2015 00:00:00 to 31-Mar-2016 23:59:59.
the dataID that I would like to predict missing data, are as below:
Before data prediction, I have done:
- data importing (read csv file in R-studio)
- data processing: convert “localminute” to “datetime” type and “dataid” converts to “factor” type.
- data visualization: plot all the dataID using facet function
This is the 56 meters of existing data plot. Existing data, most of all meters are linearly increasing. Based on these existing data, I have to write algorithm for missing data prediction algorithm.
Out of these 56 meters, there are some spikes in certain meter IDs, for december. These spikes (I assume noisy data), will be another issue. Hence,I also would like to ask, how should I predict for missing data of those meterID that has spikes in existing data.