How to deal with NA values in R?

Hi,

I am engaged in a college project in R which is all about the application of logistic regression.

Whatever the data set was given to me , I found out that there are lot blank spaces present and so for this I converted all the blank spaces to NA and after applying glm I found out that the output is not showing correctly as there are missing values in the dataset.

I have applied na.omit() in R to delete the NA values but as I am doing this all columns and rows are getting deleted. I want only na to get deleted in the cells where na values are present.

If anyone can help me out regarding above facts it will be nice help for me to accomplish my project work and also if other information is required please let me know.

Regards
Saikat

Please shows us an example of what you did.

This is the r code of what I did.

X2 <- read.csv(file.choose(), header = T, na.strings=c (""," ","NA"))
X2[X2 == 0] <- NA

head (X2)
A22 = X2

library(dplyr)
library(tidyverse)

#install.packages('tidyverse')
A22 = A22 %>% mutate(across(where(is.character), toupper))
str(A22)

A22$Time = as.factor(A22$Time)

A22$Time = as.numeric(A22$Time)
A22$Time
categ = cbind(num_cat=unique(A22$Time),Actual=unique(X2$Time))
categ
A2 = subset(A22, Time == "Baseline")
A2

for (i in 1:nrow(A22)){
if(A22[i,1:1] == 'Baseline'){A22$Time3[i]='1'}
else {A22$Time3[i]='0'}
}
A22$Time3 = as.numeric(A22$Time3)
A22$Time3

library(ISLR)
Fir <- glm(Time3 ~ Community + Location_cat + Drop_yn + Dose_yn + Dose + Age + Gender
+ Country_b + Time_aus + language + language_others + Ethnicity + Postcode_SEIFA
+ Postcode + Edu_combine + Education + Country_Q + Employ_comb + Employ_cat
+ Employment + House_cat + Household + Referral, data = A22,
family = "binomial", control = list(maxit = 100))
summary(Fir)

step(Fir)

With the help of r code, by applying glm the correct output is not coming due to missing values.

We need a reproducible example (reprex)

Hello,

The solution is that you have to transform to factor:

data$column_name <- as.factor(data$column_name)

I hope this helps!

Hi,

I want to build a model using glm(logistic regression)

But errors are coming due to missing values.

The error screenshot is attached.
glm screenshot

In the error the information is coming out is that glm cant be computed due to the presence of missing values.

If anyone can help me out regarding the above fact, it would be a nice help for me.

Regards
Saikat

Hi,

The missing values are present in the the dataset where character values are present and I want to replace the missing values such that there will be no errors when I do glm.

In the data set screenshot given above, it can be observed that NA values are present and when I am trying to use na.omit() all the contents of the rows and columns are getting deleted.

Regarding above facts, if anyone has any idea how to solve the issue it will be much helpful for me.

Regards
Saikat

So, before to run the glm(), please

data$alcohol_safe[data$alcohol_safe == "NA"] <- 0.0

You can assign in your criteria, what value to replace these NA´s, in my case I use to asign 0 or 0.0 as the last code. (0.0 to recognize which values were NA´s)

If your ploblem are NA´s, sometimes we must use a critical thinking or criteria in these values, and sometimes you have to replace these for zero values.

I hope this helps, let´s change NA´s values with that code and try again

your_data[is.na(your_data)] <- 0

And your linear model will run without NA´s

Hello everyone!
I would disagree with @bustosmiguel. Simply replacing missing values with 0 is commonly not what you want and can result in very wrong results. na.omitis the right call there. I suspect by your description that either there is a column with only NAs in your dataset or there is at least one NA in each row, so that na.omit() returns an empty data.frame. Since you are using the tidyverse anyway, you can filter() the rows that have NAs in the columns you want.
In your case this would be something like

AA22 %>%
filter(!is.na(Time3),
       !is.na(Comunity),
       ... # repeat for all 
)

You can make this less cumbersome by using if_any() and any_of()

AA22 %>% 
    filter(
           !if_any(
              any_of(c("Time3",
                       "Community",
                       ... # add all columns
                )), is.na)
           )

However, if this also results an empty data frame, you dont have any complete cases. In this case it may be necessary to not consider certain parameters in your glm. Check which parameters have many missing cases by running:

AA22 %>% 
    summarise(
         across(
            everything(),
            ~sum(ifelse(is.na(.x),1,0))
         )
    )

In base R, you could also subset your dataset to only those columns that you want to include in the glm AA22[c("Time3", "Community", ...)] # add your columns, running na.omit() should then return the same as the dplyr-approach above (however missing the additional columns).
Hope this helps!

Best,
Valentin

1 Like

Of course, the idea it´s to share experience, thanks for that!

Some NA´s but some! could be replaced by 0 or the mean, sometimes it is necessary to take decisions. like that.

Obviously to change a lot of NA´s in all data, could have a not realistic output.