R command to drop rows in a specified column with 'NA' not working?

amelio · December 16, 2022, 3:51pm

I successfully merged two data sets and am now trying to remove all rows with an 'NA' value for the 'Provider_name' column in R. I have tried the below commands but neither of them are deleting the rows, at all, when I write.table to csv file. R does not return any error messages. What am I missing?

APSIVMerged[!is.na(APSIVMerged$Provider_Name),] 

APSIVMerged %>% drop_na(Provider_Name

FJCC · December 16, 2022, 4:13pm

Are you storing the result of the filtering?


APSIVMerged <- APSIVMerged[!is.na(APSIVMerged$Provider_Name),]

amelio · December 16, 2022, 4:44pm

OK, fixed that dumb mistake. But now I am able to see that rather than only deleting the observations for "NA" in column, "Provider_Name", it's deleting all rows with NA. Any ideas on where I'm going wrong there?

nirgrahamuk · December 16, 2022, 5:06pm

that's unexpected as in principle it should work.


(ex_df <- data.frame(
  x = c(1,NA),
  y= c(NA,2)
))

ex_df[!is.na(ex_df$x),]

Can you demonstrate /prove that you are correct in your assessment that other NA's which are not associated with the provider name column NAs are also lost ?

amelio · December 16, 2022, 5:24pm

The original data set contains ~65,000 observations, merged (to include one new variable) from another data set results in ~72,000 observations. "NA" appears in various variables for different reasons and a clean set of observations would result in a return to roughly ~65,000 lines after remove all observations with "NA" in Provider_Name. The result from executing the command we're discussing is ~20,500 observations instead of 65,000 which... when I manually cleaned in excel the other day, was not the case. I was able to get it to ~65,000.

nirgrahamuk · December 16, 2022, 5:37pm

my recommendation is to go quantative and try counting the NA's in the relevant files; Perhaps your join went awry ?

if you are correct and you have way more NA in your datasets than you should, you will need to backtrack your steps and find how you are introducing them; however the shoe could be on the other foot, with the expectation you formed in excel not being born out. I can hardly comment, I didnt see the excel, and I haven't seen your data in R , nor the code you've used aside from the !is.na() stuff.

amelio · December 16, 2022, 5:58pm

Might it have anything to do with the way the variables are "formatted" - I have Provider_Name as a "character" variable (which makes the most sense). Would that have any impact?

nirgrahamuk · December 16, 2022, 6:07pm

No, NA character symbols are no problem

(ex_df <- data.frame(
  x = as.character(c(1,NA)),
  y= c(NA,2)
))

ex_df[!is.na(ex_df$x),]

system · December 23, 2022, 6:08pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.