So what are the recommended idiomatic dplyr equivalents of complete.cases() and na.omit()?

JohnMount · February 22, 2018, 12:02am

What are the currently advised idiomatic dplyr equivalents of complete.cases() and na.omit()? I am interested in the case where the data is remote (in a database) and we are using dbplyr and can not use stats::complete.cases() or stats::na.omit().

Thanks!

mara · February 22, 2018, 12:58pm

Hi John,

I believe the equivalents are: filter_all() + any_vars() for complete.cases(), and
all_vars() using is.na for na.omit().

JohnMount · February 22, 2018, 1:31pm

That looks good, thanks.

Expanding it a bit:

filter_all(mtcars, all_vars(!is.na(.))) 
mutate_at(mtcars, colnames(mtcars), funs(if_else(is.na(.),0,.)))

tbradley · February 22, 2018, 2:26pm

@JohnMount, your mutate_at call could be simplified using mutate_all in this case:

mutate_all(mtcars, funs(if_else(is.na(.), 0, .)))

and if all the variables are not numeric, you could do this:

mutate_if(mtcars, is.numeric, funs(if_else(is.na(.), 0, .)))

JohnMount · February 22, 2018, 3:30pm

So I am now unsure if the code works on databases, or if users would understand the code works on databases. The issue is executing an example issues the warning "Applying predicate on the first 100 rows." Does this mean the predicate was or was not applied to all row?. If a customer asks why one trusted the result in the face of such a message what can one say?

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

db <- DBI::dbConnect(RSQLite::SQLite(),
                     ":memory:")
table <- dplyr::copy_to(db,
                        data.frame(idx = 1:200,
                                   val = NA_real_))

res <- table %>%
  mutate_if(.,
            is.numeric,
            funs(if_else(is.na(.), 0, .))) %>%
  collect(.)

## Applying predicate on the first 100 rows

head(res)

## # A tibble: 6 x 2
##     idx   val
##   <int> <dbl>
## 1     1     0
## 2     2     0
## 3     3     0
## 4     4     0
## 5     5     0
## 6     6     0

tail(res)

## # A tibble: 6 x 2
##     idx   val
##   <int> <dbl>
## 1   195     0
## 2   196     0
## 3   197     0
## 4   198     0
## 5   199     0
## 6   200     0

DBI::dbDisconnect(db)

tbradley · February 22, 2018, 3:51pm

I believe this is referring to the predicate is.numeric rather than is.na, so essentially it is making the assumption that if all of the first 100 values are numeric then so are the rest. That is why the results still work. So as long as your predicate is not something that is not true for the whole column you should be fine. One example I could think of is if you are trying to search specifically for columns that contains NA before replacing them and no NA values being in the first 100 rows:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(DBI)

temp <- data.frame(idx = 1:200,val = c(1:100, rep(NA_real_, 100)))
temp2 <- data.frame(idx = 1:200,val = c(1:99, rep(NA_real_, 101)))

db <- DBI::dbConnect(RSQLite::SQLite(),
                     ":memory:")

table <- dplyr::copy_to(db, temp)

table2 <- dplyr::copy_to(db, temp2)

res <- table %>% 
  mutate_if(all_vars(any(is.na(.))), funs(if_else(is.na(.), 0, .))) %>% 
  collect()
#> Applying predicate on the first 100 rows

res2 <- table2 %>% 
  mutate_if(all_vars(any(is.na(.))), funs(if_else(is.na(.), 0, .))) %>% 
  collect()
#> Applying predicate on the first 100 rows

head(res)
#> # A tibble: 6 x 2
#>     idx   val
#>   <int> <dbl>
#> 1     1  1.00
#> 2     2  2.00
#> 3     3  3.00
#> 4     4  4.00
#> 5     5  5.00
#> 6     6  6.00

tail(res)
#> # A tibble: 6 x 2
#>     idx   val
#>   <int> <dbl>
#> 1   195    NA
#> 2   196    NA
#> 3   197    NA
#> 4   198    NA
#> 5   199    NA
#> 6   200    NA

head(res2)
#> # A tibble: 6 x 2
#>     idx   val
#>   <int> <dbl>
#> 1     1  1.00
#> 2     2  2.00
#> 3     3  3.00
#> 4     4  4.00
#> 5     5  5.00
#> 6     6  6.00

tail(res2)
#> # A tibble: 6 x 2
#>     idx   val
#>   <int> <dbl>
#> 1   195     0
#> 2   196     0
#> 3   197     0
#> 4   198     0
#> 5   199     0
#> 6   200     0

Created on 2018-02-22 by the reprex package (v0.2.0).

So to answer your question about confidence, if your predicate function is something that applies to the entire column like is.numeric than you know that a column can't have (to my knowledge) two different variable types. If you predicate is something that does not have this quality then you should be more cautious.

I remember seeing this explained somewhere, but unfortunately I am drawing a blank on where If I think of it, I will post it.

EDIT: I added a second example with NA_real_ value in the 100th row of the table and you can see that the column is correctly changed from NA to 0

tbradley · February 22, 2018, 3:59pm

If I remember the article about this correctly, this was done so that applying these predicate functions wouldn't create a huge bottleneck when applied to databases with millions of rows. I am not sure whether there is a way to change this behaviour to apply the predicate function to every row

JohnMount · February 22, 2018, 4:18pm

For some predicate on type-carying databases we should not need to apply to any rows (i.e. the schema details can tell us if a column is numeric).

JohnMount · February 22, 2018, 10:49pm

Back to the complete.cases() question. Any idea how to append a column that has the value of complete.cases() in it? I.e. what is the dplyr version of:

d <- dplyr::starwars
numcols <- colnames(d)[vapply(d, 
                              is.numeric, 
                              logical(1))]
chrcols <- colnames(d)[vapply(d,
                              is.character,
                              logical(1))]
d$is_complete <- complete.cases(d[ , 
                                   c(numcols, chrcols),
                                   drop = FALSE])
dplyr::glimpse(d)

mara · February 23, 2018, 2:35pm

Not an answer, just adding a full reprex of the above example you've given:

suppressPackageStartupMessages(library(dplyr))
d <- dplyr::starwars
numcols <- colnames(d)[vapply(d, 
                              is.numeric, 
                              logical(1))]
chrcols <- colnames(d)[vapply(d,
                              is.character,
                              logical(1))]
d$is_complete <- complete.cases(d[ , 
                                   c(numcols, chrcols),
                                   drop = FALSE])
dplyr::glimpse(d)
#> Observations: 87
#> Variables: 14
#> $ name        <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader",...
#> $ height      <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 18...
#> $ mass        <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, ...
#> $ hair_color  <chr> "blond", NA, NA, "none", "brown", "brown, grey", "...
#> $ skin_color  <chr> "fair", "gold", "white, blue", "white", "light", "...
#> $ eye_color   <chr> "blue", "yellow", "red", "yellow", "brown", "blue"...
#> $ birth_year  <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24....
#> $ gender      <chr> "male", NA, NA, "male", "female", "male", "female"...
#> $ homeworld   <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alde...
#> $ species     <chr> "Human", "Droid", "Droid", "Human", "Human", "Huma...
#> $ films       <list> [<"Revenge of the Sith", "Return of the Jedi", "T...
#> $ vehicles    <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>...
#> $ starships   <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Adv...
#> $ is_complete <lgl> TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE,...

Created on 2018-02-23 by the reprex package (v0.2.0).

JohnMount · February 23, 2018, 3:21pm

Thanks, Mara. By the way the your notes on filter and later notes on mutate_if did in fact help me a lot. What I am doing is: for a model that throws an exception if any of the inputs are NA do the following: 1) make all the inputs not NA 2) go back and make the model predictions NA on rows that were altered in step 1. With everybody's help here I can now do this by adding row-ids and doing a join (not bad!). However, if there is a quick performant way to land a column saying which rows got altered that would be even better (and fun to teach).

So long-winded: thanks!

t-kalinowski · March 19, 2018, 9:48pm

check out tidyr::drop_na(). I think it covers both complete.cases() and na.omit()

JohnMount · March 20, 2018, 2:49pm

Thanks. I see how tidyr::drop_na() can filter down to complete rows. But I do not see how it makes landing a column indicating which rows are complete (something one can do with complete.cases()) easy.

t-kalinowski · March 23, 2018, 2:32am

If the code needs to run on a database, then I would probably build it like this

complete_cases <- db %>% filter_all(any_vars(is.na(.))) %>% pull(rowid)
db %>% mutate(is_complete_case = rowid %in% complete_cases)

Or, depending on your backend and needs, you might use a join instead of pull

complete_cases <- db %>% 
    filter_all(any_vars(is.na(.))) %>% 
    select(rowid) %>% mutate(complete = TRUE)
db %>% 
    left_join(complete_cases) %>% 
    mutate_at(vars(complete), funs(coalesce(., FALSE))