Data Preparation for Machine Learning


Hi Everyone,

I am hoping to use a NN for some data I am working with (environmental & meteorological data to hindcast pollution values) but I am having trouble determining how exactly my data should look before applying anything. Right now I’ve got everything in a 2D matrix - all doubles or ints - with each row an observation representing one day and the columns are things like average temp, average humidity, weekly ozone average, etc.

I have gone through some documentation on the Keras package - which was very helpful actually - and I know (think I know) that I need a matrix (rather than a df) of numerical values, but I have lots of little questions, like:

  1. Are ALL variables included in my matrix going to be used to predict my final value? I’m assuming yes, but I need to verify, because I would like to know…

  2. …can I leave variables like ‘observation number’ (or day number’ in my case) or the date and have them be ignored? They shouldn’t have any impact on the results, but I feel like I am going to want them afterward.

  3. Can the testing and training sets have a different number of variables? They definitely have a different number of observations, but do they need to be identical in terms of variables?

  4. I’m also confused about how to include the variable that gets used for training. I only have some data points with actual values/results to be used for training. The rest of my observations (dates) have missing values (estimating these missing values is the whole reason I am doing this). I don’t want to include a variable with mostly missing values. I don’t even think I can do that. But I also will need someplace for the results to go.

  5. Are there any references for dealing with these kinds of minor details?

I am probably confused about other things too but this seems like a good place to start. Thanks in advance for any help or any pointers in the right direction. For info - I’m using RStudio 1.1.423, R 3.4.3 on MacOS High Sierra, Keras package 2.1.4.



You should be clear about all these questions before carrying out any analysis. I would recommend reading the following books:

  1. Introduction to Statistical Learning with R by James et al. This book is available for free from the following link:

Then follow up with

  1. Applied Predictive Modeling by Kuhn and Johnson.

My guess is that your data includes time series. If that’s the case, then I refer an excellent book “Forecasting principles and practice” by Rob Hyndman. This book is available online at


Thanks, Timesaver!

I’ll have a look at those texts - thank you for the recommendations.


You should remove them from the matrix that you pass to the fitting function. You may want to leave date in since that might have some predictive utility or at least make some features to represent date (e.g. month, day of the week, etc). lubridate and recipes can be good for this. The latter can also make predictors for holidays if that is relevant for your problem.

They should be the same variables. Also, make sure that any preprocessing (such as range normalization) is estimated from the training set and applied to both sets. You should only use the test data for prediction and performance estimation. It should not be used on any way for fitting the model.

You would probably need to impute those. recipes and other packages have some imputation methods. Again, the imputation models should be estimated only from the training set.

For example, if you use 5-nearest neighbors for imputation, then you use the 5 nearest neighbors from the training set to impute any data points in either the training and test set.

Leaving them out is an option but that is risky. It has been well studied that your model can end up having considerable bias by leaving our rows with missing values, especially if those missing data are not at random or if the reason that they are missing is related to the outcome.

I’m not sure that I understand this part. Can you elaborate?


Thanks Max for the detailed response - that helps a lot actually.

I updated the Keras package and was finally able to load in data for the example I was following and have a look at the structure - with that, combined with your response - things are much clearer now. And I’ll check out the recipes package - that might be useful for other things I am working on as well.

My last question was more about accessing the resulting predicted values at the end of the process. The examples I’ve read through don’t focus on this really, but I assume as I play around with everything I’ll figure that out. I am still relatively new to R, and I keep thinking in C++ - I want to declare everything before I use it, so I was thinking I must have to declare an array or a vector first to store resulting values.

Thanks again!


There is usually a predict function (or method) that can be used.