Encode categorical variables in R or not

I have these variables in my dataset:
Premise types - house, apartment, commercial
Month - jan to dec
day of week - mon, tue, wed,
neighbourhood - about 100 different neighbourhood names
latitude - numeric
longitude - numeric

If I use tree based algorithms - say decision tree or random forest to train, is integer/label encoding enough for these variables in the dataset? - Integer encoding/label encoding for premise types, occ month, occ day of week, neighbourhood? One hot encoding will be required only if I use other algorithms? Also lat/lon can be as is for these tree based algorithms? Sorry for a lot of these questions. Any help will be appreciated.
Thanks,

Hello,

Integer encoding is not going to be useful for more than 2 categorical variables because the algorithms will interpret a numerical value on a scale.

So if you have 100 neighbourhoods, labelled 1 - 100, then the algorithm assumes that neighbourhood 1 and 5 are more closely related than 1 and 85, whereas there is not relationship between neighbourhoods like that purely based on the name. As a counter example: The numerical values values of longitude and latitude are mathematically and in real-life related, as close values are also close in space.

The only time integer labels can make some sense is on ordinal variables where there is a natural order in the data. For example, if you have 5 age groups: infant, child teenager, adult and senior, there you could label them as 1 - 5, and a cutoff of for example < 4 in a decision tree would mean: younger than and adult, although it's not a perfect way of doing this.

For all other categorical data, one-hot vectors are the only real solution. That said, you can sometimes redesign categorical data to become numeric, and solve the issue. A good example in your data is the neighbourhoods. Given you have the long and lat, it's likely (depending on the question asked) that it can capture most information the much longer neighbourhood one-hot vector does, as these variables are very much related. So training models without the neighbourhood would reduce the input space significantly and might yield even better results than a model with 100 extra variables (one-hot).

Another example, (again, depending on their meaning), could be to to convert the day and month to a date (if you also have year), you again can make something numerical by converting the date to Epoch Unix Time. So the more creative you get, the more you potentially are able to reduce the input size for the models.

You should always test the difference between a one-hot versus a self-created new variable, and see which works best (make sure you always have the same test-set to check on).

Hope this helps,
PJ

1 Like

Okay thanks PJ. Just to restate:
So for neighbourhood, I can drop this and use only the lat/long for training? That will definitely reduce the input space.

so for month and day of week there is no need for label or integer encoding then?
Also the original dataset did have the epoch unix time stamp and then I had to extract the months, years, week and so on from the same. Perhaps I can continue to use the unix timestamp and drop the individual month, day and week columns when training?I'd have to try this.

For premise type, I would need the one hot encoding?

Thanks a ton again for your insights.

Hi,

The most important thing that I miss at this point is what question you are trying to answer with the model. This really defines the approach in choosing the correct inputs and models.

For the neighbourhood: Indeed try to see if there is difference between using it (one-hot) or not, given the lat/long might be enough.

If you have the Unix timestamp, I suggest indeed you use that instead of separate day/month/year. Again, it depends on the question you're asking. If you're comparing stuff in time, the Unix stamp really works, but if for example you want to see if thing are different on weekdays than weekends, the Unix stamp won't be useful (in that case you could create a variable that had 0 = week, 1 = weekend for example).

The premises should be one-hot yes, and it's only 3 variables so that should be fine. Nearly every algorithm would need one-hot for categorical data (depending on the implementation, it will automatically do it in the background, so don't be fooled to think it can handle categorical data), thus that's why it's always a good idea to see if you can either group it (reduce the input space) or find a way to make the variable continuous (numeric).

PJ

Okay. Sorry I should have mentioned more about the problem/model. For my project, I am trying to predict crime categories - classified as Assault, Break and Enter, Robbery, Theft Over and Auto Theft - given the inputs - which are occurence year, day, month, hour, premise type, neighbourhood, latitude, longitude. I read somewhere that for applying classifier in R, say a decision tree, categorical variables need not be encoded. anyways, I plan to use KNN classifier, Decision tree and random forest to compare and see what models works well.

is this to try to understand the past, or to make predictions of the future?
I wonder if year would be useful, as if your data contains crimes from I don't know 2018, its never going to be 2018 again, so future data wont ever include whatever part of the tree splits notes on year <=2018 etc.

month of year, and day of month might be reasonable, you could also look up national holidays and understand the day of week to accomodate any patterns relating to business/leisure. You could even look up other related data from other sources such as sunrise and sunset on the given days, and the weather at the different occasions.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.