Can I use Neural Networks on this regression problem?

I'm working on a Machine Learning project where I have both: continuous and discrete variables. The goal is to predict the target variable: score in around 1 second or less.

The nature of the data is as you can see below:

> str(myds)
'data.frame':   841500 obs. of  30 variables:
 $ score                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ amount_sms_received       : int  0 0 0 0 0 0 3 0 0 3 ...
 $ amount_emails_received    : int  3 36 3 12 0 63 9 6 6 3 ...
 $ distance_from_server      : int  17 17 7 7 7 14 10 7 34 10 ...
 $ age                       : int  17 44 16 16 30 29 26 18 19 43 ...
 $ points_earned             : int  929 655 286 357 571 833 476 414 726 857 ...
 $ registrationYYYY          : Factor w/ 2 levels ...
 $ registrationDateMM        : Factor w/ 9 levels ...
 $ registrationDateDD        : Factor w/ 31 levels ...
 $ registrationDateHH        : Factor w/ 24 levels ...
 $ registrationDateWeekDay   : Factor w/ 7 levels ...
 $ catVar_05                 : Factor w/ 2 levels ...
 $ catVar_06                 : Factor w/ 140 levels ...
 $ catVar_07                 : Factor w/ 21 levels ...
 $ catVar_08                 : Factor w/ 1582 levels ...
 $ catVar_09                 : Factor w/ 70 levels ...
 $ catVar_10                 : Factor w/ 755 levels ...
 $ catVar_11                 : Factor w/ 23 levels ...
 $ catVar_12                 : Factor w/ 129 levels ...
 $ catVar_13                 : Factor w/ 15 levels ...
 $ city                      : Factor w/ 22750 levels ...
 $ state                     : Factor w/ 55 levels ...
 $ zip                       : Factor w/ 26659 levels ...
 $ catVar_17                 : Factor w/ 2 levels ...
 $ catVar_18                 : Factor w/ 2 levels ...
 $ catVar_19                 : Factor w/ 3 levels ...
 $ catVar_20                 : Factor w/ 6 levels ...
 $ catVar_21                 : Factor w/ 2 levels ...
 $ catVar_22                 : Factor w/ 4 levels ...
 $ catVar_23  

Question 1: Given the requirements above, what would be the best prediction algorithm?

If I go to the following Wizard link:

And I check:

Column types: { Numerical, Categorical }
Target type: Numerical
Number of columns: 10s
Number of rows: 100'000s

Then, the only enabled predictive algorithm is: KNN.

Unfortunately KNN is not an option for me because I have the requirement that the prediction needs to be done in 1 second or less.

Then, if we transform the dataset by removing many (almost) not used discrete values on discrete variables, then doing one hot encoding to discrete variables, then doing target/mean encoding for: { city, zip }, then we will get around 300 numerical columns.

Then, we input that again into the Wizard:

Column types: { Numerical } (changed)
Target type: Numerical
Number of columns: 100s (changed)
Number of rows: 10'000s (changed)

and now we get: Neural Networks as an option. By the way, if we change the number of rows from: 10'000 to 100'000 again, then the Neural Networks option disapears.

For now let's proceed with: Number of rows: 10'000s`

If we change from:

Column types: { Numerical }
Target type: Numerical
Number of columns: 100s
Number of rows: 10'000s`

To:

Column types: { Numerical, Binary } (changed)
Target type: Numerical
Number of columns: 100s
Number of rows: 10'000s`

(just adding: Binary to the column types)

Then the Neural Networks dissapears again.

My concern here is that when we do hot encoding to the discrete variables the resulting columns are actually binary.

Question 2: Could you give me some hints about what's going on here?

Question 3: Do you know about any table or checklist, that let me know what Machine Learning algorithms out there should be discarded given the nature of a given problem? I did a search on Goolge but didn't get a really reliable answer.

The Wizard above doesn't tell me why it is discarding the Neural Networks.

Thanks!

Hi,

There is no easy way to known what machine learning model is the best for your data.There are many factors to consider when choosing the "right" model, but in the end you'll have to try different models and compare them to see which is best.

Complexity vs. interpretation of a model
It's always good to balance the complexity / power of a model and how easy you can interpret the model structure. Take for example simple regression, it's still used as a basis for most machine leaning tasks (or at least as a comparison) because the model is robuust, very well understood and relatively easy to interpret. Compare that to neural networks, who potentially are more powerful, but are so complex they are often called "black boxes" because can't know how they take their decisions (or at least it's very difficult). Then you have models like random forest where there is some way to get feature importance, though the bagging of many trees at the same time makes things more complex again.

Understanding your dataset
Which model might be best, almost fully depends on the type of dataset, meaning the variables/data types it contains. Some datasets like images are known to work very well with neural networks, whereas sets with a limited number of features that are numerical often do great with just regression-based analysis. Some models require huge amounts of data to get decent results (like neural networks) whereas others can work with smaller sets.

Feature selection
In datasets with a huge input space (like yours) it can be handy to perform feature selection, where you try and reduce the input to a smaller set of features that capture the most variability in the data. This can be done by hand if you know your data very well, or you can use certain techniques to help you with this (just google and you'll find tons of suggestions)

Feature engineering
Related to the above, but here you try and convert features in your data into new, more meaningful features or collapse many features into one that might be able to capture the same information. This can often be a trial and error process, but thinking about what you know about the data can help a lot.

For example: Your dataset has zip codes. In most models, if you like to use them as features you'll have to create a one-hot implementation, which in your case would add 26659 columns (that'll almost certainly not produce a good model). Depending on what you know about the data and its use, you could try several things. You could for example aggregate the zip codes into larger areas to reduce the number of features. Another idea could be transforming each zip code into a longitude and latitude (many tools online that can help), this way you convert a categorical variable into 2 numeric ones, which reduces the features from 26659 to 2! The same can be done for dates and times (like you have as well) by using the POSIX date-time format that converts it again into a numeric value.

Engineering features is not guaranteed to always work, but when it does it can do wonders for your model :slight_smile:

Hope this already sheds a bit more light...
PJ

2 Likes

Thank you @pieterjanvc for your extensive explanation.

I really liked the idea about converting the zip code to latitude and longitude. Probably I will use it. It has 2 advantages:

  1. Converting high cardinality discrete variable to continuous numeric variable (obvios reason)
  2. Prevent the overfitting issue caused by the Target Encoding

Going a bit back to my original question:

Do have any idea of why https://mod.rapidminer.com/#app discarded Neural Networks when using binary variables?

Thanks!

Hi,

I have limited experience with neural network theory and model building, but one of the things I do know is that they can't cope very well with sparse datasets (meaning many features are 0 at any given instance). This definitely happens when you convert your categorical variables to one-hot implementation as everything but the correct category will be set to 0.

Also, I think that neural networks operate better on continuous variables (most machine learning does actually) because of the underlying mathematics (of which I don't dare to go into detail because I'm not good at that haha)

I'm sure there are others here with better guesses :slight_smile:
PJ

2 Likes

Thank you @pieterjanvc for your answer. I will keep the question open for a couple of days more, maybe we get some other good answer as yours.

Summarizing my problem:

  • I have big data with many columns (after cleaning up) and many rows (around 800'000)
  • Many discrete variables, which after one hot encoding I will have in total, around 300 columns
  • mod.rapidminer.com only sugest me to use: KNN (discarding Neural Networks) but I need to do predictions in 1 sec of less and KNN takes longer than that (so, this is not an option)
  • Neural Networks is suspicious to don't like too much the one hot encoding because binary values but in the other hand (even though it takes a while for its training) the prediction time is immediate

Is any better algorithm than Neural Networks using one hot encodingto get a prediction time in1 sec` or less?

Thanks!

Just in case, I'm going to put on the table the following thought from:

Adriana Santos-Caballero
University of Barcelona
To go to the original post, please click here.

Whether to convert input variables to binary depends on the input variable. You could think of neural network inputs as representing a kind of "intensity": i.e., larger values of the input variable represent greater intensity of that input variable. After all, assuming the network has only one input, a given hidden node of the network is going to learn some function f(wx+b). where f is the transfer function (e.g. the sigmoid) and x the input variable.

This setup does not make sense for categorical variables. If categories are represented by numbers, it makes no sense to apply the function f(wx+b) to them. E.g. imagine your input variable represents an animal, and sheep=1 and cow=2. It makes no sense to multiply sheep by w and add b to it, nor does it make sense for cow to be always greater in magnitude than sheep. In this case, you should convert the discrete encoding to a binary, 1-of-k encoding.

For real-valued variables, just leave them real-valued (but normalize inputs). E.g. say you have two input variables, one the animal and one the animal's temperature. You'd convert animal to 1-of-k, where k=number of animals, and you'd leave temperature as-is.

https://stats.stackexchange.com/questions/33083/how-to-deal-with-a-mix-of-binary-and-continuous-inputs-in-neural-networks

That thought makes me a bit unhappy because Neural Networks was my first option for the prediction (bear in mind my prediction time limit of: 1 sec)

Thanks!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.