Date Fields - Do they need processing for my analysis

Dates and time are always a pain,
look into:
?as.Date

Please use the original dput structure I posted for this following question:

The first column where you see GO-2019770786 is the event_unique_id, although it says unique I see duplicates. I understand that one event can have multiple offences i,e, MCI categories in the dataset and those will not be duplicates. However, I found the duplicate event ids with the same MCI for some records. In this case, how would I drop the duplicates.

I am not sure how to proceed here.

The first two records are duplicates whereas the last two are not.

event_unique_id premisetype ucr_code ucr_ext offence MCI
GO-20141262553 Other 1430 100 Assault Assault
GO-20141262553 Other 1430 100 Assault Assault
GO-20141296470 Commercial 2120 200 B&E Break and Enter
GO-20141296470 Commercial 1480 100 Assault - Resist/ Prevent Seiz Assault

If you want the data set to have a data frame with no duplicated rows, you can use the unique() function. If the data frame is named DF

DF_uniq <- unique(DF)
1 Like

Thank you so much. This works.

If I use tree based algorithms - say decision tree or random forest to train, is integer/label encoding enough for these variables in the dataset? - Integer encoding/label encoding for premise types, occ month, occ day of week, neighbourhood? One hot encoding will be required only if I use other algorithms? Also lat/lon can be as is for these tree based algorithms? Sorry for a lot of these questions. Any help will be appreciated.
Thanks,

Please start a new thread for this question. It is very different than your initial question and a new thread will be much more likely to attract someone with the right knowledge.

Okay. Sure. Thanks for your help. I just posted it.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.