Drop all-NA columns from a dataframe

Andrea · March 5, 2018, 1:36pm

Hi,

I want to drop the columns of a data frame which are completely filled with NA, and keep all the others (including those which have a few NA, but are not all NA). My current solution:

library(purrr)

big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8, NA), 1e6, T), 
                                     sample(250, 1e6, T)), simplify = FALSE)
bd <- do.call(data.frame, big_data)
names(bd) <- paste0('X', seq_len(30))
rm(big_data)

# current solution
index <- map_lgl(bd, ~ all(is.na(.)))
bd_sans_NA_cols <- bd[, index]

It works and it's blazing fast. Is this the way you would do it, or is there a more tidyverse-y way? Note that the solution must be fast because my real use case (which I can't share for IP reasons) is a data frame over 10 Gb big (in memory: weirdly, on disk it's about half as big).

nutterb · March 5, 2018, 2:06pm

This is pretty much identical to how I would do it. Although I'd be more likely to write

bd_sans_NA_cols <- bd[!map_lgl(bd, ~ all(is.na(.)))]

This takes out one line of code (not really a big deal) and using the [ extractor without the comma indexes the object like a list, and will guarantee you get a data frame back. Alternatively, you could use

bd_sans_NA_cols <- bd[, !map_lgl(bd, ~ all(is.na(.))), drop = FALSE]

This guards against getting back a single column and then having it quietly coerced to a vector.

edgararuiz · March 5, 2018, 2:09pm

Hi, try this:

db %>%
  map(~.x) %>%
  discard(~all(is.na(.x))) %>%
  map_df(~.x)

cole · March 5, 2018, 2:10pm

I think you need to reverse it... because at present you are keeping only all-NA columns. So you really should finish with:

bd_sans_NA_cols <- bd[, !index]

In any case, I prefer the dplyr select helpers. They take some finagling to work sometimes if you're not familiar, but they are very readable (and in some cases able to be dispatched against a database!)

library(dplyr)
bd %>% select_if(function(x){!all(is.na(x))})

EDIT: especially with large tables, I think it is worth bd <- as_tibble(bd) to get nicer printing in the console!

EDIT2: Warning, perfectionist at work. Ultimate readability

not_all_na <- function(x) {!all(is.na(x))}
bd %>% select_if(not_all_na)

nutterb · March 5, 2018, 2:26pm

Since you have mentioned your data being large and your impression of "blazing fast," I'll point out that there are sometimes some tradeoffs between speed and clarity. In the recommendations thus far:

library(microbenchmark)
microbenchmark(
  as_written = {
    index <- map_lgl(bd, ~ all(is.na(.)))
    bd[, !index]
  },
  single_line = bd[!map_lgl(bd, ~ all(is.na(.)))],
  edgar = {
    bd %>%
      discard(~all(is.na(.x))) %>%
      map_df(~.x)
  },
  cole = {
    not_all_na <- function(x) {!all(is.na(x))}
    bd %>% select_if(not_all_na)
  }
)

Unit: milliseconds
        expr       min        lq      mean    median        uq      max neval
  as_written  43.76033  45.29519  55.75666  55.64834  59.51291 129.0887   100
 single_line  43.05741  46.01761  56.45448  57.01282  60.15601 133.7674   100
       edgar 121.80319 135.51564 143.75961 140.58312 143.98598 207.4292   100
        cole  86.31394  94.40614 101.40569 100.67682 105.60633 166.4541   100

Even on a 10Gb data frame (the one in your example comes in at 1 Gb on my computer), that difference isn't much. If you get into Tb and Pb area, however, that may start to add up.

edgararuiz · March 5, 2018, 2:30pm

@nutterb

Would you mind trying the benchmark with this new code? I should have remove the first map:

db %>%
  discard(~all(is.na(.x))) %>%
  map_df(~.x)

Andrea · March 5, 2018, 2:32pm

yeah @cole, you're right, since the reprex doesn't include the output in this case I didn't notice that I forgot the !

nutterb · March 5, 2018, 2:33pm

Updated above. [And now I'm filling in 20 characters.]

cole · March 5, 2018, 2:35pm

Thanks for the useful benchmark!!

So that my bias can be made abundantly clear, only one of these can currently be translated naturally into SQL and dispatch against a database to deal with data that doesn't fit into memory (i.e. Tb or Pb scale) But yes, there can definitely be a tradeoff between clarity and speed. I always love seeing how different ways of attacking a problem perform!

Andrea · March 5, 2018, 2:41pm

@nutterb sure, as I said, the data frame I put in the reprex is smaller than my real use case:

Note that the solution must be fast because my real use case (which I can’t share for IP reasons) is a data frame over 10 Gb big (in memory: weirdly, on disk it’s about half as big).

Ok, so my solution is the fastest, but cole solution is the most readable and it can be used for data on a database. Unfortunately data come to me not as a data base connection, but as myriads of csv files (.......). I'll check with the data owners whether these csv are the results of multiple extractions from a data base

Andrea · March 5, 2018, 2:44pm

@cole I choose you! Your code comes second as speed after mine, but first as readability and future usefulness.

hughparsonage · March 7, 2018, 10:13am

Late to the party, but there's

library(hutils)
drop_empty_cols(bd)

Comparable speed when bd is a data.frame, but much faster when bd is a data.table:

Unit: milliseconds
                 expr     min       lq     mean   median       uq      max neval cld
  drop_empty_cols(bd) 23.5529 24.47631 26.36756 25.55753 26.32990 44.02416   100  a 
       as_written(bd) 41.6780 43.08841 45.37851 44.33739 45.94657 62.48236   100   b

Andrea · March 7, 2018, 10:27am

I don't like data.table API so I don't use it, but it's still nice to know. Thanks!

benjamin.gowan · March 12, 2018, 4:38am

I'm also late to the party, but for what its worth the nifty janitor package also has a remove_empty_cols() function as well.

Cheers,
Ben

Andrea · March 12, 2018, 8:54am

Very nice, I like janitor but I forgot about remove_empty_cols(). janitor is indeed tideverse-y, so this solution is probably on-par with @cole's - I don't know about timings though.

t-kalinowski · March 19, 2018, 11:52am

So long as you're not unnecessarily making extra copies, a simple for loop can often times be just as fast.

For example:

drop_empty_cols_for_loop <- function(df) {
  for (nm in names(df))
    if(all(is.na(df[[nm]])))
      df[[nm]] <- NULL
  df
}

Unit: milliseconds
        expr       min        lq      mean    median        uq      max neval cld
  as_written  58.80964  79.04760  91.39597  90.13768  98.80640 165.4905   100  a 
 single_line  58.31732  82.85494  92.44658  90.55527 103.38178 141.1990   100  a 
       edgar 178.98066 201.05968 217.73046 216.26938 232.51264 290.2671   100   b
        cole  69.54099  82.84602  94.62421  93.20986 105.09101 144.8547   100  a 
      tomasz  67.37430  82.23327  92.41387  90.74170  97.82132 154.3680   100  a

If you're interested, I have this and a few other related functions in my personal package.

pjperez · April 29, 2019, 9:44am

I used your solution and it worked, but apparently the janitor package is going to deprecate remove_empty_cols(). Right now the prefered function is: df_ok <- janitor::remove_empty(df, which = c("cols"))

Thanks Andrea and to all of you