I want to drop the columns of a data frame which are completely filled with NA, and keep all the others (including those which have a few NA, but are not all NA). My current solution:
It works and it's blazing fast. Is this the way you would do it, or is there a more tidyverse-y way? Note that the solution must be fast because my real use case (which I can't share for IP reasons) is a data frame over 10 Gb big (in memory: weirdly, on disk it's about half as big).
This takes out one line of code (not really a big deal) and using the [ extractor without the comma indexes the object like a list, and will guarantee you get a data frame back. Alternatively, you could use
bd_sans_NA_cols <- bd[, !map_lgl(bd, ~ all(is.na(.))), drop = FALSE]
This guards against getting back a single column and then having it quietly coerced to a vector.
I think you need to reverse it... because at present you are keeping only all-NA columns. So you really should finish with:
bd_sans_NA_cols <- bd[, !index]
In any case, I prefer the dplyrselect helpers. They take some finagling to work sometimes if you're not familiar, but they are very readable (and in some cases able to be dispatched against a database!)
Since you have mentioned your data being large and your impression of "blazing fast," I'll point out that there are sometimes some tradeoffs between speed and clarity. In the recommendations thus far:
Unit: milliseconds
expr min lq mean median uq max neval
as_written 43.76033 45.29519 55.75666 55.64834 59.51291 129.0887 100
single_line 43.05741 46.01761 56.45448 57.01282 60.15601 133.7674 100
edgar 121.80319 135.51564 143.75961 140.58312 143.98598 207.4292 100
cole 86.31394 94.40614 101.40569 100.67682 105.60633 166.4541 100
Even on a 10Gb data frame (the one in your example comes in at 1 Gb on my computer), that difference isn't much. If you get into Tb and Pb area, however, that may start to add up.
So that my bias can be made abundantly clear, only one of these can currently be translated naturally into SQL and dispatch against a database to deal with data that doesn't fit into memory (i.e. Tb or Pb scale) But yes, there can definitely be a tradeoff between clarity and speed. I always love seeing how different ways of attacking a problem perform!
@nutterb sure, as I said, the data frame I put in the reprex is smaller than my real use case:
Note that the solution must be fast because my real use case (which I can’t share for IP reasons) is a data frame over 10 Gb big (in memory: weirdly, on disk it’s about half as big).
Ok, so my solution is the fastest, but cole solution is the most readable and it can be used for data on a database. Unfortunately data come to me not as a data base connection, but as myriads of csv files (.......). I'll check with the data owners whether these csv are the results of multiple extractions from a data base
Comparable speed when bd is a data.frame, but much faster when bd is a data.table:
Unit: milliseconds
expr min lq mean median uq max neval cld
drop_empty_cols(bd) 23.5529 24.47631 26.36756 25.55753 26.32990 44.02416 100 a
as_written(bd) 41.6780 43.08841 45.37851 44.33739 45.94657 62.48236 100 b
Very nice, I like janitor but I forgot about remove_empty_cols(). janitor is indeed tideverse-y, so this solution is probably on-par with @cole's - I don't know about timings though.
I used your solution and it worked, but apparently the janitor package is going to deprecate remove_empty_cols(). Right now the prefered function is: df_ok <- janitor::remove_empty(df, which = c("cols"))