So what are the recommended idiomatic dplyr equivalents of complete.cases() and na.omit()?

database

#1

What are the currently advised idiomatic dplyr equivalents of complete.cases() and na.omit()? I am interested in the case where the data is remote (in a database) and we are using dbplyr and can not use stats::complete.cases() or stats::na.omit().

Thanks!


#2

Hi John,

I believe the equivalents are: filter_all() + any_vars() for complete.cases(), and
all_vars() using is.na for na.omit().


#3

That looks good, thanks.

Expanding it a bit:

filter_all(mtcars, all_vars(!is.na(.))) 
mutate_at(mtcars, colnames(mtcars), funs(if_else(is.na(.),0,.)))

#4

@JohnMount, your mutate_at call could be simplified using mutate_all in this case:

mutate_all(mtcars, funs(if_else(is.na(.), 0, .)))

and if all the variables are not numeric, you could do this:

mutate_if(mtcars, is.numeric, funs(if_else(is.na(.), 0, .)))

#5

So I am now unsure if the code works on databases, or if users would understand the code works on databases. The issue is executing an example issues the warning “Applying predicate on the first 100 rows.” Does this mean the predicate was or was not applied to all row?. If a customer asks why one trusted the result in the face of such a message what can one say?

library("dplyr")
## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
db <- DBI::dbConnect(RSQLite::SQLite(),
                     ":memory:")
table <- dplyr::copy_to(db,
                        data.frame(idx = 1:200,
                                   val = NA_real_))

res <- table %>%
  mutate_if(.,
            is.numeric,
            funs(if_else(is.na(.), 0, .))) %>%
  collect(.)
## Applying predicate on the first 100 rows
head(res)
## # A tibble: 6 x 2
##     idx   val
##   <int> <dbl>
## 1     1     0
## 2     2     0
## 3     3     0
## 4     4     0
## 5     5     0
## 6     6     0
tail(res)
## # A tibble: 6 x 2
##     idx   val
##   <int> <dbl>
## 1   195     0
## 2   196     0
## 3   197     0
## 4   198     0
## 5   199     0
## 6   200     0
DBI::dbDisconnect(db)

#6

I believe this is referring to the predicate is.numeric rather than is.na, so essentially it is making the assumption that if all of the first 100 values are numeric then so are the rest. That is why the results still work. So as long as your predicate is not something that is not true for the whole column you should be fine. One example I could think of is if you are trying to search specifically for columns that contains NA before replacing them and no NA values being in the first 100 rows:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(DBI)

temp <- data.frame(idx = 1:200,val = c(1:100, rep(NA_real_, 100)))
temp2 <- data.frame(idx = 1:200,val = c(1:99, rep(NA_real_, 101)))

db <- DBI::dbConnect(RSQLite::SQLite(),
                     ":memory:")

table <- dplyr::copy_to(db, temp)

table2 <- dplyr::copy_to(db, temp2)

res <- table %>% 
  mutate_if(all_vars(any(is.na(.))), funs(if_else(is.na(.), 0, .))) %>% 
  collect()
#> Applying predicate on the first 100 rows

res2 <- table2 %>% 
  mutate_if(all_vars(any(is.na(.))), funs(if_else(is.na(.), 0, .))) %>% 
  collect()
#> Applying predicate on the first 100 rows

head(res)
#> # A tibble: 6 x 2
#>     idx   val
#>   <int> <dbl>
#> 1     1  1.00
#> 2     2  2.00
#> 3     3  3.00
#> 4     4  4.00
#> 5     5  5.00
#> 6     6  6.00

tail(res)
#> # A tibble: 6 x 2
#>     idx   val
#>   <int> <dbl>
#> 1   195    NA
#> 2   196    NA
#> 3   197    NA
#> 4   198    NA
#> 5   199    NA
#> 6   200    NA

head(res2)
#> # A tibble: 6 x 2
#>     idx   val
#>   <int> <dbl>
#> 1     1  1.00
#> 2     2  2.00
#> 3     3  3.00
#> 4     4  4.00
#> 5     5  5.00
#> 6     6  6.00

tail(res2)
#> # A tibble: 6 x 2
#>     idx   val
#>   <int> <dbl>
#> 1   195     0
#> 2   196     0
#> 3   197     0
#> 4   198     0
#> 5   199     0
#> 6   200     0

Created on 2018-02-22 by the reprex package (v0.2.0).

So to answer your question about confidence, if your predicate function is something that applies to the entire column like is.numeric than you know that a column can’t have (to my knowledge) two different variable types. If you predicate is something that does not have this quality then you should be more cautious.

I remember seeing this explained somewhere, but unfortunately I am drawing a blank on where :man_shrugging: If I think of it, I will post it.

EDIT: I added a second example with NA_real_ value in the 100th row of the table and you can see that the column is correctly changed from NA to 0


#7

If I remember the article about this correctly, this was done so that applying these predicate functions wouldn’t create a huge bottleneck when applied to databases with millions of rows. I am not sure whether there is a way to change this behaviour to apply the predicate function to every row


#8

For some predicate on type-carying databases we should not need to apply to any rows (i.e. the schema details can tell us if a column is numeric).


#9

Back to the complete.cases() question. Any idea how to append a column that has the value of complete.cases() in it? I.e. what is the dplyr version of:

d <- dplyr::starwars
numcols <- colnames(d)[vapply(d, 
                              is.numeric, 
                              logical(1))]
chrcols <- colnames(d)[vapply(d,
                              is.character,
                              logical(1))]
d$is_complete <- complete.cases(d[ , 
                                   c(numcols, chrcols),
                                   drop = FALSE])
dplyr::glimpse(d)

#10

Not an answer, just adding a full reprex of the above example you’ve given:

suppressPackageStartupMessages(library(dplyr))
d <- dplyr::starwars
numcols <- colnames(d)[vapply(d, 
                              is.numeric, 
                              logical(1))]
chrcols <- colnames(d)[vapply(d,
                              is.character,
                              logical(1))]
d$is_complete <- complete.cases(d[ , 
                                   c(numcols, chrcols),
                                   drop = FALSE])
dplyr::glimpse(d)
#> Observations: 87
#> Variables: 14
#> $ name        <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader",...
#> $ height      <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 18...
#> $ mass        <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, ...
#> $ hair_color  <chr> "blond", NA, NA, "none", "brown", "brown, grey", "...
#> $ skin_color  <chr> "fair", "gold", "white, blue", "white", "light", "...
#> $ eye_color   <chr> "blue", "yellow", "red", "yellow", "brown", "blue"...
#> $ birth_year  <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24....
#> $ gender      <chr> "male", NA, NA, "male", "female", "male", "female"...
#> $ homeworld   <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alde...
#> $ species     <chr> "Human", "Droid", "Droid", "Human", "Human", "Huma...
#> $ films       <list> [<"Revenge of the Sith", "Return of the Jedi", "T...
#> $ vehicles    <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>...
#> $ starships   <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Adv...
#> $ is_complete <lgl> TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE,...

Created on 2018-02-23 by the reprex package (v0.2.0).


#11

Thanks, Mara. By the way the your notes on filter and later notes on mutate_if did in fact help me a lot. What I am doing is: for a model that throws an exception if any of the inputs are NA do the following: 1) make all the inputs not NA 2) go back and make the model predictions NA on rows that were altered in step 1. With everybody’s help here I can now do this by adding row-ids and doing a join (not bad!). However, if there is a quick performant way to land a column saying which rows got altered that would be even better (and fun to teach).

So long-winded: thanks!


#12

check out tidyr::drop_na(). I think it covers both complete.cases() and na.omit()


#13

Thanks. I see how tidyr::drop_na() can filter down to complete rows. But I do not see how it makes landing a column indicating which rows are complete (something one can do with complete.cases()) easy.


#14

If the code needs to run on a database, then I would probably build it like this

complete_cases <- db %>% filter_all(any_vars(is.na(.))) %>% pull(rowid)
db %>% mutate(is_complete_case = rowid %in% complete_cases)

Or, depending on your backend and needs, you might use a join instead of pull

complete_cases <- db %>% 
    filter_all(any_vars(is.na(.))) %>% 
    select(rowid) %>% mutate(complete = TRUE)
db %>% 
    left_join(complete_cases) %>% 
    mutate_at(vars(complete), funs(coalesce(., FALSE))