Summarizing all columns of a dataframe with multiple functions -- code cleanup help request

jchou · April 12, 2018, 2:22pm

I would like to take a dataframe (or tibble) with columns of multiple types and apply an arbitrary number of functions on each column, returning a tibble where each row is a column in the original dataframe, and each column is the result of a different summarizing function.

Having the final output as a tibble allows easy identification of columns in the original input that meet particular criteria; e.g., all columns in the dataframe input with greater than a certain proportion of NA.

The follow code does work, but I would describe it as having not just code smell but more like code reek. Any suggestions on cleaning it up? I like working in the tidyverse, but base R would also be acceptable.

library(tidyverse) # dplyr for funs(), tibble for rownames_to_column(), tidyr for unnest()

num_unique = function(v) { length(unique(v)) }

fxns = funs( # use funs so can use '.' and include additional arguments, like na.rm = TRUE
  typeof,
  num_unique, # can't define a named function in here; need to define outside of funs()
  mean(., na.rm = TRUE),
  na_frac = mean(is.na(.)),
  na_num = sum(is.na(.))
)

x <- tibble(
  ints = 1:10,
  char = letters[1:10],
  fac = as.factor(letters[1:10]),
  lgl = c(rep(TRUE, 5), rep(FALSE, 5))
  )
x[5, ] <- NA # to test that na.rm is working
x

#suppressWarnings, so that applying numeric functions to non-numeric columns is quieter
suppressWarnings(sapply(fxns, function(fn) {x %>% summarise_all(fn)})) %>% # returns list, with dim and dimnames
  as.data.frame %>% # convert list to dataframe; if convert to tibble, lose rownames
  rownames_to_column(var = 'column') %>% # requires a dataframe
  as.tibble() %>% # have column of rownames now, but each column is still a list-column
  unnest() # simplify list-columns back to vectors

Thus, the input x is:

## # A tibble: 10 x 4
##     ints char  fac   lgl  
##    <int> <chr> <fct> <lgl>
##  1     1 a     a     TRUE 
##  2     2 b     b     TRUE 
##  3     3 c     c     TRUE 
##  4     4 d     d     TRUE 
##  5    NA <NA>  <NA>  NA   
##  6     6 f     f     FALSE
##  7     7 g     g     FALSE
##  8     8 h     h     FALSE
##  9     9 i     i     FALSE
## 10    10 j     j     FALSE

... and the output is:

## # A tibble: 4 x 6
##   column typeof    num_unique   mean na_frac na_num
##   <chr>  <chr>          <int>  <dbl>   <dbl>  <int>
## 1 ints   integer           10  5.56    0.100      1
## 2 char   character         10 NA       0.100      1
## 3 fac    integer           10 NA       0.100      1
## 4 lgl    logical            3  0.444   0.100      1

Things I'd like to clean up, if possible:

be able to define the function within the funs() list of functions
not have to go through the convoluted transformations from:
- an sapply output of a list with with dim and dimnames attributes,
- to a data.frame (with columns all list-columns; can't use tibble, as would lose the rownames)
- to a tibble (so I can more easily simplify the list-columns, now with a column made from rownames)
- to a tibble, simplified back to normal vectors for columns

I guess I'm happy that this Frankenstein code works, but can't help but think there should be a more elegant way to do it.

lbusett · April 12, 2018, 7:54pm

You may consider having a look at the skimr package:

https://cran.r-project.org/web/packages/skimr/vignettes/Using_skimr.html

jchou · April 12, 2018, 8:25pm

Thank you! That's a nice package -- far more robust than I would have implemented.

skim_to_wide() creates the output type I want, and skim_with() allows addition of customized functions.

Although it looks like skimr will be what I'll actually use, I'd still like to try to improve my own coding. I modified my code to the following, but I'm not sure whether it's 'better' (slightly tidier) or just 'different'.

summary_plus <- function(df) {
  # appears that factors get clobbered into integers; may want to coerce factors back to strings
  num_unique = function(column) { length(unique(column)) }
  
  fxns = funs( # use funs so can use '.' and include additional arguments, like na.rm = TRUE
    typeof,
    num_unique, # can't define a named function in here; need to define outside of funs()
    mean(., na.rm = TRUE),
    na_frac = mean(is.na(.)),
    na_num = sum(is.na(.))
  )
  
  suppressWarnings( map(fxns, function(fxn) { df %>% summarise_all(fxn) }) ) %>% # list by function of 1 x col's tibbles
    map(t) %>% # list by function of col's x 1 of ?matrices
    map(as.vector) %>% # needed for as.tibble() to convert without error 'must be 1d atomic vectors or lists'
    as.tibble() %>% # tibble, but missing column of original column names
    bind_cols(list(columns = names(df)), .) # add column names into first position
}

markdly · April 13, 2018, 5:56pm

In the interest of refining the code, perhaps this is a good opportunity to use imap_dfr from the purrr package.

For this to work the summary_plus function takes a vector v and a name name_v as arguments and returns a one row tibble as a result.

library(tidyverse)
x <- tibble(
  ints = 1:10, 
  char = letters[1:10], 
  fac = as.factor(letters[1:10]),
  lgl = c(rep(TRUE, 5), rep(FALSE, 5))
)
x[5, ] <- NA

summary_plus <- function(v, name_v) {
  tibble(
    column     = name_v,
    typeof     = typeof(v),
    num_unique = length(unique(v)),
    mean       = ifelse(is.numeric(v) | is.logical(v), mean(v, na.rm = TRUE), NA),
    na_frac    = mean(is.na(v)),
    na_num     = sum(is.na(v))
  )}

x %>% imap_dfr(summary_plus)
#> # A tibble: 4 x 6
#>   column typeof    num_unique   mean na_frac na_num
#>   <chr>  <chr>          <int>  <dbl>   <dbl>  <int>
#> 1 ints   integer           10  5.56    0.100      1
#> 2 char   character         10 NA       0.100      1
#> 3 fac    integer           10 NA       0.100      1
#> 4 lgl    logical            3  0.444   0.100      1

Created on 2018-04-14 by the reprex package (v0.2.0).

jchou · April 14, 2018, 1:21pm

Beautiful! That's exactly what I was hoping for, I knew there had to be a better way. If this were StackOverflow, your response would get the big green checkmark.

I really need to work on wrapping my head around purrr -- I wasn't even aware of the imap_dfr function...