I am hoping to use a NN for some data I am working with (environmental & meteorological data to hindcast pollution values) but I am having trouble determining how exactly my data should look before applying anything. Right now I’ve got everything in a 2D matrix - all doubles or ints - with each row an observation representing one day and the columns are things like average temp, average humidity, weekly ozone average, etc.
I have gone through some documentation on the Keras package - which was very helpful actually - and I know (think I know) that I need a matrix (rather than a df) of numerical values, but I have lots of little questions, like:
Are ALL variables included in my matrix going to be used to predict my final value? I’m assuming yes, but I need to verify, because I would like to know…
…can I leave variables like ‘observation number’ (or day number’ in my case) or the date and have them be ignored? They shouldn’t have any impact on the results, but I feel like I am going to want them afterward.
Can the testing and training sets have a different number of variables? They definitely have a different number of observations, but do they need to be identical in terms of variables?
I’m also confused about how to include the variable that gets used for training. I only have some data points with actual values/results to be used for training. The rest of my observations (dates) have missing values (estimating these missing values is the whole reason I am doing this). I don’t want to include a variable with mostly missing values. I don’t even think I can do that. But I also will need someplace for the results to go.
Are there any references for dealing with these kinds of minor details?
I am probably confused about other things too but this seems like a good place to start. Thanks in advance for any help or any pointers in the right direction. For info - I’m using RStudio 1.1.423, R 3.4.3 on MacOS High Sierra, Keras package 2.1.4.