I am new to R and machine learning. I want to encode this column as a factor. It contains no missing data when checked this way
After converting it to factor as below
dataset$ethnicity <- factor(dataset$ethnicity,
levels = c("?", "Asian", "Black", "Hispanic", "Latino", "Middle Eastern", "Others", "South Asian", "Turkish", "White-European" ),
labels = c(0,1,2,3,4,5,6,7,8,9))
It shows the number of missing data as below
Having looked at the missing boolean properly with
It appears it is the "Middle Eastern" and "South Asian" data that was not converted into factor. I don't know if it is because it is the only spaced double word data. Also some of the data took incorrect codes, which were not the intended. Is there a way to deal with this?