Hello
I am new to R and machine learning. I want to encode this column as a factor. It contains no missing data when checked this way
length(which(is.na(dataset$ethnicity)))
[1] 0
After converting it to factor as below
dataset$ethnicity <- factor(dataset$ethnicity,
levels = c("?", "Asian", "Black", "Hispanic", "Latino", "Middle Eastern", "Others", "South Asian", "Turkish", "White-European" ),
labels = c(0,1,2,3,4,5,6,7,8,9))
It shows the number of missing data as below
length(which(is.na(dataset$ethnicity)))
[1] 29
Having looked at the missing boolean properly with
is.na(dataset$ethnicity)
It appears it is the "Middle Eastern" and "South Asian" data that was not converted into factor. I don't know if it is because it is the only spaced double word data. Also some of the data took incorrect codes, which were not the intended. Is there a way to deal with this?
Thanks