Removing NA values from a specific column and row

Amonda · March 20, 2019, 11:06am

Hi everyone,
I have a data frame with NA value and I need to remove it.
I tried all function like "na.omit" or "is.na" or "complete.cases" or "drop_na" in tidyr.
All of these function work but the problem that they remove all data.
For example:

> DF <- data.frame(x = c(1, 2, 3, 7, 10), y = c(0, 10, 5,5,12), z=c(NA, 33, 22,27,35))
> DF %>% drop_na(y)
   x  y  z
1  1  0 NA
2  2 10 33
3  3  5 22
4  7  5 27
5 10 12 35
> DF %>% drop_na(z)
   x  y  z
2  2 10 33
3  3  5 22
4  7  5 27
5 10 12 35

With these function, I'm removing all values in row 1.
What I want to do is to remove only NA values from column z without deleting/removing values for x and y. Maybe to have something like below or masking this values. Because later I need to do a PCA and I can't remove such an important data in x and y.
x y z
1 1 0
2 2 10 33
3 3 5 22
4 7 5 27
5 10 12 35

Hope I was clear enough by explaining my problem
Thanks in advance

Yarnabrina · March 20, 2019, 11:25am

There's an existing thread on SO, and it seems very popular.

This works:

DF <- data.frame(x = c(1, 2, 3, 7, 10),
                 y = c(0, 10, 5, 5, 12),
                 z = c(NA, 33, 22, 27, 35))

DF
#>    x  y  z
#> 1  1  0 NA
#> 2  2 10 33
#> 3  3  5 22
#> 4  7  5 27
#> 5 10 12 35

DF[is.na(x = DF)] <- 0

DF
#>    x  y  z
#> 1  1  0  0
#> 2  2 10 33
#> 3  3  5 22
#> 4  7  5 27
#> 5 10 12 35

^{Created on 2019-03-20 by the reprex package (v0.2.1)}

Amonda · March 20, 2019, 1:28pm

I already google it a lot, but all solution are like removing column/row or replacing it with 0 or with mean.
Your code works but for me zero is a value that why I was hoping if there is a solution to extract the NA values and not replacing with 0 or any values.
Thanks a lot for your respond

Yarnabrina · March 20, 2019, 1:38pm

I'm not sure I understand.

In your example, you substituted NA by 0. If you want, you can use anything you prefer following my code.

But from this of yours, it seems that you want a blank in the dataset?

I don't whether that can be done or not, but I had an idea that NA behaves almost like that. I may be wrong, of course.

andresrcs · March 20, 2019, 1:57pm

I 2nd Anirban's comment, NA stands for Not Available and is the way to represent a blank in R, you can't have columns of different length on a dataframe or a matrix.

chris.prener · March 20, 2019, 2:38pm

Can you tell us why you need to "remove" the NA value? Understanding your use case might help us help you better...

It is possible to achieve the "blank" effect with character data, but I would not recommend this as a strategy for dealing with missing data.

Amonda · March 20, 2019, 3:54pm

Because I need to do an 3D PCA. Don't know why but I have problem reading my NA values.
For example, if I need to do spearman correlation with table containing NA values there is no problem, everything is working. But when I start doing PCA, I have an error as I have NA values. So that why I asked if there is a possibility to remove it or any solution.
Apparently, there is a library called (missMDA), which can handle "PCA with NA" but never used it!

Yarnabrina · March 20, 2019, 4:08pm

PCA can take the correlation matrix as an argument. So, if you already have that, say R based on Spearman's correlation, you can try with princomp(covmat = R).

chris.prener · March 20, 2019, 9:26pm

I can't add anything specific since I don't use PCA, but as a general R piece of advice, I'd encourage you to reframe your question @Amonda - it isn't that you need to get rid of NA values necessary, but rather understand how PCA handles missing data and go from there. It seems like you're treating NA values as a nuisance or bug, when they're very much a feature.

It sounds like you've got at least two avenues here so far:

It would be great if you could give both options a try, and depending on what you find, create a reprex (reproducible example) similar to what you had in your initial post. The reprex package would be helpful now that you're adding in potentially other packages, though. You can get some details here on creating reprexs.

andresrcs · March 20, 2019, 9:56pm

I'm not a statistician, so my understanding of PCA is very vague, but as far as I know, when you deal with missing values (NAs) in general, you have two basic options, delete the observation (i.e. the whole row) or impute the missing values, for the latter here is a nice article with several options for this task, but I can't advice on which is more suitable for PCA.

http://r-statistics.co/Missing-Value-Treatment-With-R.html#3.%20Imputation%20with%20mean%20/%20median%20/%20mode

Amonda · March 22, 2019, 8:41am

Ok, I will try it.
Thanks a lot for your help and opinion.

tfruehbeck · March 24, 2019, 7:08am

What yiu need is imputation, replacing NA with the least bias possible.
See package MICE.

Dalila1 · March 25, 2019, 8:26pm

Here is why you cannot just remove a value from a variable without removing the whole observation where the value is:
PCA is based on linear algebra--it works only with matrices and vectors--i.e. numerical variables. This means you can't just remove a value from a variable while keeping the other variables as you are working with matrices.
Even if a function exists that can deal with missing variables for PCA, the function most likely will still remove the whole observation to decompose your matrix.
Because PCA works with matrices, it assumes that you are providing a filled rectangle with r rows and n columns.
Not knowing your data, I have no opinion on imputing

Bugs · March 26, 2019, 7:17pm

PCA deletes the entire row if there is even one missing value. The choice is to impute a value or delete the row. Imputing a single value is generally accepted in a large data set. Imputing multiple values makes people more uneasy. Try missingdata.org or there are hundreds of sites that you can find by searching key words: imputing, missing, data. If you still need to remove NA you could convert all data to text, replace NA with a blank or a period, and then convert back to numeric. This is the brute force approach and will work if the people are creative when entering data.

Duster · March 28, 2019, 9:47am

You're trouble is that 'NA' designates a missing value. So, anything you replace it with must either still indicate a missing value, or be a value. You have no real alternatives. So, you can either interpolate an estimated value (say column average) or use some more sophisticated apprroch to interpolating a missing value. R treats NA as a missing value.

Various routines in R deal with NAs in different ways, so your best approach is not to get fussy about the data if it is otherwise correct. Instead look at the commands you plan to use for your PCA. If you are employing prcomp(), look at the "na.action" section in help.

system · April 4, 2019, 9:47am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.