How to deal with NA values in R?

spatra1992 · December 28, 2021, 10:51am

Hi,

I am engaged in a college project in R which is all about the application of logistic regression.

Whatever the data set was given to me , I found out that there are lot blank spaces present and so for this I converted all the blank spaces to NA and after applying glm I found out that the output is not showing correctly as there are missing values in the dataset.

I have applied na.omit() in R to delete the NA values but as I am doing this all columns and rows are getting deleted. I want only na to get deleted in the cells where na values are present.

If anyone can help me out regarding above facts it will be nice help for me to accomplish my project work and also if other information is required please let me know.

Regards
Saikat

HanOostdijk · December 28, 2021, 11:10am

Please shows us an example of what you did.

spatra1992 · December 28, 2021, 11:52am

This is the r code of what I did.

X2 <- read.csv(file.choose(), header = T, na.strings=c (""," ","NA"))
X2[X2 == 0] <- NA

head (X2)
A22 = X2

library(dplyr)
library(tidyverse)

#install.packages('tidyverse')
A22 = A22 %>% mutate(across(where(is.character), toupper))
str(A22)

A22$Time = as.factor(A22$Time)

A22$Time = as.numeric(A22$Time)
A22$Time
categ = cbind(num_cat=unique(A22$Time),Actual=unique(X2$Time))
categ
A2 = subset(A22, Time == "Baseline")
A2

for (i in 1:nrow(A22)){
if(A22[i,1:1] == 'Baseline'){A22$Time3[i]='1'}
else {A22$Time3[i]='0'}
}
A22$Time3 = as.numeric(A22$Time3)
A22$Time3

library(ISLR)
Fir <- glm(Time3 ~ Community + Location_cat + Drop_yn + Dose_yn + Dose + Age + Gender
+ Country_b + Time_aus + language + language_others + Ethnicity + Postcode_SEIFA
+ Postcode + Edu_combine + Education + Country_Q + Employ_comb + Employ_cat
+ Employment + House_cat + Household + Referral, data = A22,
family = "binomial", control = list(maxit = 100))
summary(Fir)

step(Fir)

spatra1992 · December 28, 2021, 11:53am

With the help of r code, by applying glm the correct output is not coming due to missing values.

jrkrideau · December 28, 2021, 2:47pm

We need a reproducible example (reprex)

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

bustosmiguel · December 28, 2021, 2:57pm

Hello,

The solution is that you have to transform to factor:

data$column_name <- as.factor(data$column_name)

I hope this helps!

spatra1992 · December 29, 2021, 1:41pm

Hi,

I want to build a model using glm(logistic regression)

But errors are coming due to missing values.

The error screenshot is attached.
glm screenshot

In the error the information is coming out is that glm cant be computed due to the presence of missing values.

If anyone can help me out regarding the above fact, it would be a nice help for me.

Regards
Saikat

spatra1992 · December 29, 2021, 1:47pm

Hi,

The missing values are present in the the dataset where character values are present and I want to replace the missing values such that there will be no errors when I do glm.

In the data set screenshot given above, it can be observed that NA values are present and when I am trying to use na.omit() all the contents of the rows and columns are getting deleted.

Regarding above facts, if anyone has any idea how to solve the issue it will be much helpful for me.

Regards
Saikat

bustosmiguel · December 29, 2021, 2:34pm

So, before to run the glm(), please

data$alcohol_safe[data$alcohol_safe == "NA"] <- 0.0

You can assign in your criteria, what value to replace these NA´s, in my case I use to asign 0 or 0.0 as the last code. (0.0 to recognize which values were NA´s)

If your ploblem are NA´s, sometimes we must use a critical thinking or criteria in these values, and sometimes you have to replace these for zero values.

I hope this helps, let´s change NA´s values with that code and try again

bustosmiguel · December 30, 2021, 4:56pm

your_data[is.na(your_data)] <- 0

And your linear model will run without NA´s

valentingar · December 30, 2021, 6:14pm

Hello everyone!
I would disagree with @bustosmiguel. Simply replacing missing values with 0 is commonly not what you want and can result in very wrong results. na.omitis the right call there. I suspect by your description that either there is a column with only NAs in your dataset or there is at least one NA in each row, so that na.omit() returns an empty data.frame. Since you are using the tidyverse anyway, you can filter() the rows that have NAs in the columns you want.
In your case this would be something like

AA22 %>%
filter(!is.na(Time3),
       !is.na(Comunity),
       ... # repeat for all 
)

You can make this less cumbersome by using if_any() and any_of()

AA22 %>% 
    filter(
           !if_any(
              any_of(c("Time3",
                       "Community",
                       ... # add all columns
                )), is.na)
           )

However, if this also results an empty data frame, you dont have any complete cases. In this case it may be necessary to not consider certain parameters in your glm. Check which parameters have many missing cases by running:

AA22 %>% 
    summarise(
         across(
            everything(),
            ~sum(ifelse(is.na(.x),1,0))
         )
    )

In base R, you could also subset your dataset to only those columns that you want to include in the glm AA22[c("Time3", "Community", ...)] # add your columns, running na.omit() should then return the same as the dplyr-approach above (however missing the additional columns).
Hope this helps!

Best,
Valentin

bustosmiguel · December 30, 2021, 6:52pm

Of course, the idea it´s to share experience, thanks for that!

Some NA´s but some! could be replaced by 0 or the mean, sometimes it is necessary to take decisions. like that.

Obviously to change a lot of NA´s in all data, could have a not realistic output.

system · January 20, 2022, 6:53pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.