replacing NA with for loop vs dplyr

Hi,

I work a dataset with 1 500 00 rows and I used the old method to delete NA : I use replace NA by 0 + loop for. It's too long (very very !!)
I would like to know if I can replace this method by a function from dplyr.

description :
If I have NA in 4 columns I have put 0 to replace NA of each cells
If I have NA in 2 columns (e.g X2,Y2) I have to put 0 in cells and put also to 0 in (X1,Y1)
If (X1,Y1) and (X2,Y2) different of NA I keep the values.

Thanks in advance to your help !

tab.na<-replace(tab,is.na(tab),0)
for (i in 1:nrow(tab.na)){
  if(tab.na$X1[i]!=0 | tab.na$Y1[i]!=0 | tab.na$X2[i]!=0 | tab.na$Y2[i]!=0)
    {
    tab.na$X1[i] <-0
    tab.na$Y1[i] <- 0
    tab.na$X2[i] <- 0
    tab.na$Y2[i] <- 0
    }
}
|X1     Y1      |X2|       Y2|
|NA   |NA     |NA|      NA|
|NA|   NA|     NA|      NA|
|NA   |NA|      NA      |NA|
|6002 |6002    |NA|      NA|
|5000 |500|    4000    |5440|
|5001 |501|    4001    |5441|
|5002 |502|    4002|    5442|
|5003 |503    |4003|    5443|
|5004 |504|    4004|    5444|


Some nice options in the thread here using a combination of tidyr::replace_na(), scoped dplyr verbs (e.g. mutate_at()), as well as some base R options:

2 Likes

I'm sorry for the late answer.

1.I tried mutate_all(funs(replace_na(., 0))) and it's OK (It's going really fast).
2. for mutate_if it's not the same thing
Example :
_ I want to put 0 in X1 and Y1 column when there are 0 in X2 and Y2 column
_ I want to put 0 in X2 and Y2 column when there are 0 in X1 and Y1 column
_ When the colunm X1,Y1, X2 and Y2 has fill, we don't change anything.
Is it possible ?

|X1      Y1       |X2|       Y2|
6002    6002       0           0
 0       0        |6002    |6002
|5000  |500|       4000    |5440|

Could you please turn this into a self-contained reprex (short for reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

install.packages("reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

What to do if you run into clipboard problems

If you run into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

For pointers specific to the community site, check out the reprex FAQ.

you will find hereafter the reprex

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

data<-tab %>%
      select(X1,Y1,X2,Y2)
#> Error in eval(lhs, parent, parent): objet 'tab' introuvable

tab.na<-data%>%
  mutate_all(.,funs(replace_na(., 0))) %>%
  mutate_if(is.numeric,funs(replace_na(., 0)))
#> Error in UseMethod("tbl_vars"): pas de méthode pour 'tbl_vars' applicable pour un objet de classe "function"

You need to include your data for it to be reproducible by others. That's why you're getting errors right away— tab doesn't exist.

This post might help:

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
  library(tidyr)

  df <- data.frame(stringsAsFactors = FALSE,
                   X1 = c(NA,NA,1004,NA),
                   Y1= c(NA,NA,100,100),
                   X2 = c(NA,50,100,100),
                   Y2 = c(100,100,200,300)
                  )
  tab.na<-df%>%
    mutate_all(.,funs(replace_na(., 0))) %>%
    mutate_if(is.numeric,funs(replace_na(., 0)))

You can see that there are no error but the result is not OK because
I don't want a change to line 3
I want zeros everywhere else.

If I understand you correctly this is what you want

library(dplyr)
df <- data.frame(stringsAsFactors = FALSE,
                 X1 = c(NA,NA,1004,NA),
                 Y1= c(NA,NA,100,100),
                 X2 = c(NA,50,100,100),
                 Y2 = c(100,100,200,300)
)
df%>%
    mutate_all(~case_when(
        rowSums(is.na(df)) > 0 ~ 0,
        TRUE ~ .
    ))
#>     X1  Y1  X2  Y2
#> 1    0   0   0   0
#> 2    0   0   0   0
#> 3 1004 100 100 200
#> 4    0   0   0   0

Wonderful, it's exactly what I want !!!!
Thank you!
It's much better than my old loop for. :sunglasses:

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

I would like if I can apply ythis method if I add one column.
I can no longer use mutate_all because I want any a change in first column.

  df <- data.frame(stringsAsFactors = FALSE,
                   A = c("Terre","Env","Soleil","Lune"),
                   X1 = c(NA,NA,1004,NA),
                   Y1= c(NA,NA,100,100),
                   X2 = c(NA,50,100,100),
                   Y2 = c(100,100,200,300)
                  )

You can use mutate_if() or mutate_at()

library(dplyr)
df <- data.frame(stringsAsFactors = FALSE,
                 A = c("Terre","Env","Soleil","Lune"),
                 X1 = c(NA,NA,1004,NA),
                 Y1= c(NA,NA,100,100),
                 X2 = c(NA,50,100,100),
                 Y2 = c(100,100,200,300)
)
df %>%
    mutate_if(is.numeric, ~case_when(
        rowSums(is.na(df)) > 0 ~ 0,
        TRUE ~ .
    ))
#>        A   X1  Y1  X2  Y2
#> 1  Terre    0   0   0   0
#> 2    Env    0   0   0   0
#> 3 Soleil 1004 100 100 200
#> 4   Lune    0   0   0   0

df %>%
    mutate_at(vars(X1:Y2), ~case_when(
        rowSums(is.na(df)) > 0 ~ 0,
        TRUE ~ .
    ))
#>        A   X1  Y1  X2  Y2
#> 1  Terre    0   0   0   0
#> 2    Env    0   0   0   0
#> 3 Soleil 1004 100 100 200
#> 4   Lune    0   0   0   0

Thanks to your help !

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.