I have two questions regarding data cleaning and re coding of a GENDER variable from a survey I did.
My original numeric data has the following datapoints under GENDER variable:
"1" (which is female)
"2" (which is male)
"4" (which is other)
"5" (which is unsure)"
Then I apparently have a two ="NA's"
NOTE: nobody selected option "3" which was "non binary" in our survey
I start by running this code in R:
...
convert from numeric survey responses to factor variable
newdata$gender <- as.factor(newdata$gender)
str(newdata$gender) # looks OK I got following: factor w/ 4 levels "1", "2", "4", "5"
table(newdata$gender) #
summary(newdata$gender) #
.... Question #1
why do table and summary outputs differ? 4 levels for table...but summary includes NA's see results pasted below
Question #2.
how do I re code this factor simply so that it is only 3 levels (Male, Female, Other)
where anybody who selected 3, 4 or 5 (or NAs) simply becomes subsumed into "other" ? Remember nobody ever selected "3" in teh survey.
table's default behaviour is to ignore NA, summary is to not. the table function behaviour can be altered by passing useNA argument. see below and for example for Q2.
Great, thanks. A follow up question re: including/excluding NAs. In fact, rather than “including” NAs, my main problem is getting rid of them, especially when I calculate % tables and draw histograms/bars. How do I get rid of them so they dont interfere with my (table/visual) presentation of data? See attached pic for example of problem
you can take your whole data.frame table and use function na.omit() on it, to throw away records/observations with NA values in it. If thats what you wish to do...
this implies that there isnt a single observation in newdata that doesnt have an NA somewhere in one of its columns.... this is bad.
however, you dont assign the result of the na.omit(newdata) to any R name with <- as I would have expected you to do. therefore when you group and summarise newdata, its the same newdata as before you ran na.omit
Wow, that is good to know, thx. I see now what you mean.
I therefore think (?) the solution might be to use na.omit on certain variables included in some calculation/tabulation, rather than the whole dataset, right? In other words, when analyzing variables like GENDER (independent var.) and ACCEPTANCE (dependent variable) I could use these commands (if I understand you correctly)
rather than work with seperate vectors and use na.omit which would break the relationship between 1 elemenent of 1 vector and 1 element of another, it makes more sense to me to select() the two columns of interest into their own table, na.omit() that table and send that to your summarising function or plot