replacing NA with for loop vs dplyr

sbl_bah · February 26, 2019, 12:04pm

Hi,

I work a dataset with 1 500 00 rows and I used the old method to delete NA : I use replace NA by 0 + loop for. It's too long (very very !!)
I would like to know if I can replace this method by a function from dplyr.

description :
If I have NA in 4 columns I have put 0 to replace NA of each cells
If I have NA in 2 columns (e.g X2,Y2) I have to put 0 in cells and put also to 0 in (X1,Y1)
If (X1,Y1) and (X2,Y2) different of NA I keep the values.

Thanks in advance to your help !

tab.na<-replace(tab,is.na(tab),0)
for (i in 1:nrow(tab.na)){
  if(tab.na$X1[i]!=0 | tab.na$Y1[i]!=0 | tab.na$X2[i]!=0 | tab.na$Y2[i]!=0)
    {
    tab.na$X1[i] <-0
    tab.na$Y1[i] <- 0
    tab.na$X2[i] <- 0
    tab.na$Y2[i] <- 0
    }
}

|X1     Y1      |X2|       Y2|
|NA   |NA     |NA|      NA|
|NA|   NA|     NA|      NA|
|NA   |NA|      NA      |NA|
|6002 |6002    |NA|      NA|
|5000 |500|    4000    |5440|
|5001 |501|    4001    |5441|
|5002 |502|    4002|    5442|
|5003 |503    |4003|    5443|
|5004 |504|    4004|    5444|

mara · February 26, 2019, 12:09pm

Some nice options in the thread here using a combination of tidyr::replace_na(), scoped dplyr verbs (e.g. mutate_at()), as well as some base Ｒ options:

sbl_bah · February 26, 2019, 7:06pm

I'm sorry for the late answer.

1.I tried mutate_all(funs(replace_na(., 0))) and it's OK (It's going really fast).
2. for mutate_if it's not the same thing
Example :
_ I want to put 0 in X1 and Y1 column when there are 0 in X2 and Y2 column
_ I want to put 0 in X2 and Y2 column when there are 0 in X1 and Y1 column
_ When the colunm X1,Y1, X2 and Y2 has fill, we don't change anything.
Is it possible ?

|X1      Y1       |X2|       Y2|
6002    6002       0           0
 0       0        |6002    |6002
|5000  |500|       4000    |5440|

mara · February 26, 2019, 7:26pm

Could you please turn this into a self-contained reprex (short for reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

install.packages("reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

What to do if you run into clipboard problems

If you run into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

For pointers specific to the community site, check out the reprex FAQ.

sbl_bah · February 26, 2019, 8:05pm

you will find hereafter the reprex

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

data<-tab %>%
      select(X1,Y1,X2,Y2)
#> Error in eval(lhs, parent, parent): objet 'tab' introuvable

tab.na<-data%>%
  mutate_all(.,funs(replace_na(., 0))) %>%
  mutate_if(is.numeric,funs(replace_na(., 0)))
#> Error in UseMethod("tbl_vars"): pas de méthode pour 'tbl_vars' applicable pour un objet de classe "function"

mara · February 26, 2019, 8:09pm

You need to include your data for it to be reproducible by others. That's why you're getting errors right away— tab doesn't exist.

This post might help:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

sbl_bah · February 26, 2019, 8:40pm

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
  library(tidyr)

  df <- data.frame(stringsAsFactors = FALSE,
                   X1 = c(NA,NA,1004,NA),
                   Y1= c(NA,NA,100,100),
                   X2 = c(NA,50,100,100),
                   Y2 = c(100,100,200,300)
                  )
  tab.na<-df%>%
    mutate_all(.,funs(replace_na(., 0))) %>%
    mutate_if(is.numeric,funs(replace_na(., 0)))

You can see that there are no error but the result is not OK because
I don't want a change to line 3
I want zeros everywhere else.

andresrcs · February 26, 2019, 9:28pm

If I understand you correctly this is what you want

library(dplyr)
df <- data.frame(stringsAsFactors = FALSE,
                 X1 = c(NA,NA,1004,NA),
                 Y1= c(NA,NA,100,100),
                 X2 = c(NA,50,100,100),
                 Y2 = c(100,100,200,300)
)
df%>%
    mutate_all(~case_when(
        rowSums(is.na(df)) > 0 ~ 0,
        TRUE ~ .
    ))
#>     X1  Y1  X2  Y2
#> 1    0   0   0   0
#> 2    0   0   0   0
#> 3 1004 100 100 200
#> 4    0   0   0   0

sbl_bah · February 26, 2019, 9:42pm

Wonderful, it's exactly what I want !!!!
Thank you!
It's much better than my old loop for.

andresrcs · February 26, 2019, 9:43pm

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

sbl_bah · February 26, 2019, 10:24pm

I would like if I can apply ythis method if I add one column.
I can no longer use mutate_all because I want any a change in first column.

  df <- data.frame(stringsAsFactors = FALSE,
                   A = c("Terre","Env","Soleil","Lune"),
                   X1 = c(NA,NA,1004,NA),
                   Y1= c(NA,NA,100,100),
                   X2 = c(NA,50,100,100),
                   Y2 = c(100,100,200,300)
                  )

andresrcs · February 26, 2019, 10:50pm

You can use mutate_if() or mutate_at()

library(dplyr)
df <- data.frame(stringsAsFactors = FALSE,
                 A = c("Terre","Env","Soleil","Lune"),
                 X1 = c(NA,NA,1004,NA),
                 Y1= c(NA,NA,100,100),
                 X2 = c(NA,50,100,100),
                 Y2 = c(100,100,200,300)
)
df %>%
    mutate_if(is.numeric, ~case_when(
        rowSums(is.na(df)) > 0 ~ 0,
        TRUE ~ .
    ))
#>        A   X1  Y1  X2  Y2
#> 1  Terre    0   0   0   0
#> 2    Env    0   0   0   0
#> 3 Soleil 1004 100 100 200
#> 4   Lune    0   0   0   0

df %>%
    mutate_at(vars(X1:Y2), ~case_when(
        rowSums(is.na(df)) > 0 ~ 0,
        TRUE ~ .
    ))
#>        A   X1  Y1  X2  Y2
#> 1  Terre    0   0   0   0
#> 2    Env    0   0   0   0
#> 3 Soleil 1004 100 100 200
#> 4   Lune    0   0   0   0

sbl_bah · February 27, 2019, 6:34am

Thanks to your help !

system · March 6, 2019, 6:34am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.