I'm working on a `Machine Learning`

project where I have both: `continuous`

and `discrete`

variables. The goal is to predict the target variable: `score`

in around `1 second`

or less.

The nature of the data is as you can see below:

```
> str(myds)
```

```
'data.frame': 841500 obs. of 30 variables:
$ score : num 0 0 0 0 0 0 0 0 0 0 ...
$ amount_sms_received : int 0 0 0 0 0 0 3 0 0 3 ...
$ amount_emails_received : int 3 36 3 12 0 63 9 6 6 3 ...
$ distance_from_server : int 17 17 7 7 7 14 10 7 34 10 ...
$ age : int 17 44 16 16 30 29 26 18 19 43 ...
$ points_earned : int 929 655 286 357 571 833 476 414 726 857 ...
$ registrationYYYY : Factor w/ 2 levels ...
$ registrationDateMM : Factor w/ 9 levels ...
$ registrationDateDD : Factor w/ 31 levels ...
$ registrationDateHH : Factor w/ 24 levels ...
$ registrationDateWeekDay : Factor w/ 7 levels ...
$ catVar_05 : Factor w/ 2 levels ...
$ catVar_06 : Factor w/ 140 levels ...
$ catVar_07 : Factor w/ 21 levels ...
$ catVar_08 : Factor w/ 1582 levels ...
$ catVar_09 : Factor w/ 70 levels ...
$ catVar_10 : Factor w/ 755 levels ...
$ catVar_11 : Factor w/ 23 levels ...
$ catVar_12 : Factor w/ 129 levels ...
$ catVar_13 : Factor w/ 15 levels ...
$ city : Factor w/ 22750 levels ...
$ state : Factor w/ 55 levels ...
$ zip : Factor w/ 26659 levels ...
$ catVar_17 : Factor w/ 2 levels ...
$ catVar_18 : Factor w/ 2 levels ...
$ catVar_19 : Factor w/ 3 levels ...
$ catVar_20 : Factor w/ 6 levels ...
$ catVar_21 : Factor w/ 2 levels ...
$ catVar_22 : Factor w/ 4 levels ...
$ catVar_23
```

**Question 1:** Given the requirements above, what would be the best prediction algorithm?

If I go to the following Wizard link:

And I check:

Column types: { Numerical, Categorical }

Target type: Numerical

Number of columns: 10s

Number of rows: 100'000s

Then, the only enabled predictive algorithm is: `KNN`

.

Unfortunately `KNN`

is not an option for me because I have the requirement that the prediction needs to be done in `1 second`

or less.

Then, if we transform the dataset by removing many (almost) not used discrete values on discrete variables, then doing one hot encoding to discrete variables, then `doing target/mean encoding`

for: `{ city, zip }`

, then we will get around `300`

numerical columns.

Then, we input that again into the Wizard:

Column types: { Numerical } `(changed)`

Target type: Numerical

Number of columns: 100s `(changed)`

Number of rows: 10'000s `(changed)`

and now we get: `Neural Networks`

as an option. By the way, if we change the number of rows from: `10'000`

to `100'000`

again, then the `Neural Networks`

option disapears.

For now let's proceed with: Number of rows: 10'000s`

If we change from:

Column types: { Numerical }

Target type: Numerical

Number of columns: 100s

Number of rows: 10'000s`

To:

Column types: { Numerical, Binary } `(changed)`

Target type: Numerical

Number of columns: 100s

Number of rows: 10'000s`

(just adding: `Binary`

to the column types)

Then the `Neural Networks`

dissapears again.

My concern here is that when we do `hot encoding`

to the discrete variables the resulting columns are actually `binary`

.

**Question 2:** Could you give me some hints about what's going on here?

**Question 3:** Do you know about any table or checklist, that let me know what `Machine Learning`

algorithms out there should be discarded given the nature of a given problem? I did a search on `Goolge`

but didn't get a really reliable answer.

The Wizard above doesn't tell me why it is discarding the `Neural Networks`

.

Thanks!