Spaced Double Word Data not Factored

Hello
I am new to R and machine learning. I want to encode this column as a factor. It contains no missing data when checked this way
length(which(is.na(dataset$ethnicity)))
[1] 0

After converting it to factor as below
dataset$ethnicity <- factor(dataset$ethnicity,
levels = c("?", "Asian", "Black", "Hispanic", "Latino", "Middle Eastern", "Others", "South Asian", "Turkish", "White-European" ),
labels = c(0,1,2,3,4,5,6,7,8,9))

It shows the number of missing data as below

length(which(is.na(dataset$ethnicity)))
[1] 29

Having looked at the missing boolean properly with
is.na(dataset$ethnicity)
It appears it is the "Middle Eastern" and "South Asian" data that was not converted into factor. I don't know if it is because it is the only spaced double word data. Also some of the data took incorrect codes, which were not the intended. Is there a way to deal with this?
Thanks

Is it possible that dataset$ethnicity before you converted it to a factor actually is not using exactly "Middle Eastern" or "South Asian"?

It may be worth looking at unique(dataset$ethnicity) before changing it to be a factor to see if that might be the case.

But, you also don't need to specify the levels if simply converting the column to a vector. R should automatically assign levels based on the unique values in the data. A reason to specify the levels explicitly is to control the order, but that introduces the potential for what seems to have happened here to occur.

Thanks; I have found the solution. I used the unique(dataset$ethnicity) and found that there was a unique data missing that was not specified in the levels and label. Also "Middle Eastern" was actually stored as "Middle Eastern ". There was a space after the last letter 'n'. It was only through unique(dataset$ethnicity) that I discovered this.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.