Hello,

Integer encoding is not going to be useful for more than 2 categorical variables because the algorithms will interpret a numerical value on a scale.

So if you have 100 neighbourhoods, labelled 1 - 100, then the algorithm assumes that neighbourhood 1 and 5 are more closely related than 1 and 85, whereas there is not relationship between neighbourhoods like that purely based on the name. As a counter example: The **numerical values** values of longitude and latitude are mathematically and in real-life related, as close values are also close in space.

The only time integer labels can make *some* sense is on **ordinal variables** where there is a natural order in the data. For example, if you have 5 age groups: infant, child teenager, adult and senior, there you could label them as 1 - 5, and a cutoff of for example < 4 in a decision tree would mean: younger than and adult, although it's not a perfect way of doing this.

For all other **categorical data**, one-hot vectors are the only real solution. That said, you can sometimes redesign categorical data to become numeric, and solve the issue. A good example in your data is the neighbourhoods. Given you have the long and lat, it's likely (depending on the question asked) that it can capture most information the much longer neighbourhood one-hot vector does, as these variables are very much related. So training models without the neighbourhood would reduce the input space significantly and might yield even better results than a model with 100 extra variables (one-hot).

Another example, (again, depending on their meaning), could be to to convert the day and month to a date (if you also have year), you again can make something numerical by converting the date to Epoch Unix Time. So the more creative you get, the more you potentially are able to reduce the input size for the models.

You should always test the difference between a one-hot versus a self-created new variable, and see which works best (make sure you always have the same test-set to check on).

Hope this helps,

PJ