summarize with summary function

lhunsicker · May 11, 2019, 12:05am

How does one use group_by with a function like summary or quantile that yields more than one column? No problem with a single value summary like mean:

babynames %>% group_by(year) %>% summarize(average = mean(n))

But if I want to calculate a set of summaries by year, I am told that the function in summarise needs to have only one value, not a vector of values:

babynames %>% group_by(year) %>% summarize(Summary = summary(n))
Error: Column Summary must be length 1 (a summary value), not 6

Is there an alternative to summarize() to apply the function by the grouping variable?

Thanks in advance for any help on this.

Larry Hunsicker

Yarnabrina · May 11, 2019, 3:10am

You can use it in a list, but you'll loose the details of the values. To retain those, use enframe first,and then list. See below:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
iris %>%
  group_by(Species) %>%
  summarise(list(tibble::enframe(x = summary(object = Petal.Width)))) %>%
  tidyr::unnest()
#> # A tibble: 18 x 3
#>    Species    name    value
#>    <fct>      <chr>   <dbl>
#>  1 setosa     Min.    0.1  
#>  2 setosa     1st Qu. 0.2  
#>  3 setosa     Median  0.2  
#>  4 setosa     Mean    0.246
#>  5 setosa     3rd Qu. 0.3  
#>  6 setosa     Max.    0.6  
#>  7 versicolor Min.    1    
#>  8 versicolor 1st Qu. 1.2  
#>  9 versicolor Median  1.3  
#> 10 versicolor Mean    1.33 
#> 11 versicolor 3rd Qu. 1.5  
#> 12 versicolor Max.    1.8  
#> 13 virginica  Min.    1.4  
#> 14 virginica  1st Qu. 1.8  
#> 15 virginica  Median  2    
#> 16 virginica  Mean    2.03 
#> 17 virginica  3rd Qu. 2.3  
#> 18 virginica  Max.    2.5

^{Created on 2019-05-11 by the reprex package (v0.2.1)}

Hope this helps.

lhunsicker · May 11, 2019, 4:48pm

Thanks, Yarnabina. Basically, one sends summarise() a set of single objects that are named lists of the components of summary(). At the end, one could alternatively use deframe() to get the summaries as a set of separate lists. Now, what would be the simplest way to get the results into a single data.frame (or tibble) with rows having the species in the first column and the components of the summary list as the next six columns?

Larry Hunsicker

Yarnabrina · May 11, 2019, 5:16pm

Well, the first thing that comes to my mind is a base R solution. It gives a matrix, but you can always use as.data.frame.

do.call(what = rbind,
        args = with(data = iris,
                    expr = by(data = Petal.Length,
                              INDICES = Species,
                              FUN = summary)))
#>            Min. 1st Qu. Median  Mean 3rd Qu. Max.
#> setosa      1.0     1.4   1.50 1.462   1.575  1.9
#> versicolor  3.0     4.0   4.35 4.260   4.600  5.1
#> virginica   4.5     5.1   5.55 5.552   5.875  6.9

Update

You can use the new unnest_wider function available in the development version of tidyr.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

iris %>%
  group_by(Species) %>%
  summarise(temp = list(c(summary(object = Petal.Width)))) %>%
  unnest_wider(temp)
#> # A tibble: 3 x 7
#>   Species     Min. `1st Qu.` Median  Mean `3rd Qu.`  Max.
#>   <fct>      <dbl>     <dbl>  <dbl> <dbl>     <dbl> <dbl>
#> 1 setosa       0.1       0.2    0.2 0.246       0.3   0.6
#> 2 versicolor   1         1.2    1.3 1.33        1.5   1.8
#> 3 virginica    1.4       1.8    2   2.03        2.3   2.5

lhunsicker · May 11, 2019, 5:47pm

Excellent! That's exactly what I want.
Could you remind me how to load the development version of tidyr? Somewhere on git_hub, I know.

Reese · May 11, 2019, 7:07pm

Thanks for mentioning about enframe. How can I do this for multiple columns at once Sepal.Width, Petal.Length and Petal.Width ? And give the value columns the names of the summarized variables?

EconomiCurtis · May 12, 2019, 5:15pm

2 posts were split to a new topic: issues installing rlang on windows 10

joels · May 13, 2019, 5:42pm

Below is a function that returns quantiles for all numeric columns in a data frame. This function adds in the quantile names separately at the end of the pipe. At the end of this post is another version of the function that extracts the quantile names within the pipe directly from the named output of the quantile function. Although this second approach seems more tidyverse-ish, it also seems more complicated to reason about. I'd be interested in a more transparent way to extract the quantile names if anyone can suggest a better approach.

library(tidyverse) 

quantile_summary = function(data, group, probs=seq(0,1,0.25), ...) {

  group=enquo(group)
  q.names = paste0(probs*100, "%")
   
  data %>%
    group_by(!!group) %>%
    summarise_if(is.numeric, funs(list(quantile(., probs=probs, ...)))) %>% 
    unnest() %>% 
    # Add column with quantile names. Make quantile names a factor so that 
    #  they'll be ordered correctly when sorted
    group_by(!!group) %>% 
    mutate(quantiles = factor(q.names, levels=q.names)) %>% 
    # Reorder columns to put quantile names second
    select(!!group, quantiles, everything())
}

iris %>% 
  quantile_summary(Species)

quantile_summary of iris data frame

   Species    quantiles Sepal.Length Sepal.Width Petal.Length Petal.Width
   <fct>      <fct>            <dbl>       <dbl>        <dbl>       <dbl>
 1 setosa     0%                4.3         2.3          1            0.1
 2 setosa     25%               4.8         3.2          1.4          0.2
 3 setosa     50%               5           3.4          1.5          0.2
 4 setosa     75%               5.2         3.68         1.58         0.3
 5 setosa     100%              5.8         4.4          1.9          0.6
 6 versicolor 0%                4.9         2            3            1  
 7 versicolor 25%               5.6         2.52         4            1.2
 8 versicolor 50%               5.9         2.8          4.35         1.3
 9 versicolor 75%               6.3         3            4.6          1.5
10 versicolor 100%              7           3.4          5.1          1.8
11 virginica  0%                4.9         2.2          4.5          1.4
12 virginica  25%               6.22        2.8          5.1          1.8
13 virginica  50%               6.5         3            5.55         2  
14 virginica  75%               6.9         3.18         5.88         2.3
15 virginica  100%              7.9         3.8          6.9          2.5

mtcars %>% 
  quantile_summary(cyl, probs=c(0.25, 0.5, 0.75))

quantile_summary of mtcars data frame

    cyl quantiles   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <fct>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     4 25%        22.8  78.8  65.5  3.81  1.88  18.6     1   0.5   4    1   
2     4 50%        26   108    91    4.08  2.2   18.9     1   1     4    2   
3     4 75%        30.4 121.   96    4.16  2.62  20.0     1   1     4    2   
4     6 25%        18.6 160   110    3.35  2.82  16.7     0   0     3.5  2.5 
5     6 50%        19.7 168.  110    3.9   3.22  18.3     1   0     4    4   
6     6 75%        21   196.  123    3.91  3.44  19.2     1   1     4    4   
7     8 25%        14.4 302.  176.   3.07  3.53  16.1     0   0     3    2.25
8     8 50%        15.2 350.  192.   3.12  3.76  17.2     0   0     3    3.5 
9     8 75%        16.2 390   241.   3.22  4.01  17.6     0   0     3    4

starwars %>% 
  quantile_summary(species, na.rm=TRUE, probs=0.5)

quantile_summary of starwars data frame

   species   quantiles height  mass birth_year
   <chr>     <fct>      <dbl> <dbl>      <dbl>
 1 NA        50%         180.  48           62
 2 Aleena    50%          79   15           NA
 3 Besalisk  50%         198  102           NA
 4 Cerean    50%         198   82           92
 5 Chagrian  50%         196   NA           NA
 6 Clawdite  50%         168   55           NA
 7 Droid     50%         132   53.5         33
 8 Dug       50%         112   40           NA
 9 Ewok      50%          88   20            8
10 Geonosian 50%         183   80           NA
# … with 28 more rows

Here is another approach that extracts the quantile names within the pipe directly from the output of quantile. It works, but the code seems like it would probably be hard to follow, given the use of map and multiple "pronouns" within mutate. Perhaps there's a simpler approach.

quantile_summary2 = function(data, group, probs=seq(0,1,0.25), ...) {
  
  group=enquo(group)
  
  data %>%
    group_by(!!group) %>%
    summarise_if(is.numeric, funs(list(quantile(., probs=probs, ...)))) %>% 
    # Add quantile names
    #  Get quantile names from the last column since this will always be a 
    #  nested list of summary values (assuming there is at least one numeric 
    #  column in the input data frame)
    mutate(quantiles = .[[ncol(.)]] %>% 
                         map(~factor(names(.x), levels=names(.x)))) %>%  
    unnest() %>% 
    # Reorder columns to put quantile names second
    select(!!group, quantiles, everything())
}

Yarnabrina · May 14, 2019, 5:33am

Hi Joel,

As you have guessed, it's hard to follow your second approach, especially for me as I'm quite unfamiliar with tidyverse.

I tried to modify my previous code in two ways, which I believe are simpler, but with limitations.

In the 1st one, I'd like to change the value of enframe with each variable as their name, which I don't know how to do it. For the 2nd one, it is essentially similar to your 1st method of manually insertion, but a little indirectly. I'd like to extract the names of the results of summary without running it on a fake data.

Here are my approaches, and their results:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)
library(tidyr)

# way 1
# how to name the value columns with the variable names inside enframe
iris %>%
  group_by(Species) %>%
  summarise_if(.predicate = is.numeric,
               .funs = ~ list(enframe(x = summary(object = .)))) %>%
  unnest() %>%
  select(which(x = !duplicated(x = lapply(X = .,
                                          FUN = summary))))
#> # A tibble: 18 x 6
#>    Species    name    value value1 value2 value3
#>    <fct>      <chr>   <dbl>  <dbl>  <dbl>  <dbl>
#>  1 setosa     Min.     4.3    2.3    1     0.1  
#>  2 setosa     1st Qu.  4.8    3.2    1.4   0.2  
#>  3 setosa     Median   5      3.4    1.5   0.2  
#>  4 setosa     Mean     5.01   3.43   1.46  0.246
#>  5 setosa     3rd Qu.  5.2    3.68   1.58  0.3  
#>  6 setosa     Max.     5.8    4.4    1.9   0.6  
#>  7 versicolor Min.     4.9    2      3     1    
#>  8 versicolor 1st Qu.  5.6    2.52   4     1.2  
#>  9 versicolor Median   5.9    2.8    4.35  1.3  
#> 10 versicolor Mean     5.94   2.77   4.26  1.33 
#> 11 versicolor 3rd Qu.  6.3    3      4.6   1.5  
#> 12 versicolor Max.     7      3.4    5.1   1.8  
#> 13 virginica  Min.     4.9    2.2    4.5   1.4  
#> 14 virginica  1st Qu.  6.22   2.8    5.1   1.8  
#> 15 virginica  Median   6.5    3      5.55  2    
#> 16 virginica  Mean     6.59   2.97   5.55  2.03 
#> 17 virginica  3rd Qu.  6.9    3.18   5.88  2.3  
#> 18 virginica  Max.     7.9    3.8    6.9   2.5

# way 2
# how to get the names of the summary output without a fake data
iris %>%
  group_by(Species) %>%
  summarise_if(.predicate = is.numeric,
               .funs = ~ list(summary(object = .))) %>%
  unnest() %>%
  group_by(Species) %>%
  mutate(Statistic = names(x = summary(object = rnorm(n = 1)))) %>%
  ungroup() %>%
  select(Species, Statistic, everything())
#> # A tibble: 18 x 6
#>    Species    Statistic Sepal.Length Sepal.Width Petal.Length Petal.Width
#>    <fct>      <chr>            <dbl>       <dbl>        <dbl>       <dbl>
#>  1 setosa     Min.              4.3         2.3          1          0.1  
#>  2 setosa     1st Qu.           4.8         3.2          1.4        0.2  
#>  3 setosa     Median            5           3.4          1.5        0.2  
#>  4 setosa     Mean              5.01        3.43         1.46       0.246
#>  5 setosa     3rd Qu.           5.2         3.68         1.58       0.3  
#>  6 setosa     Max.              5.8         4.4          1.9        0.6  
#>  7 versicolor Min.              4.9         2            3          1    
#>  8 versicolor 1st Qu.           5.6         2.52         4          1.2  
#>  9 versicolor Median            5.9         2.8          4.35       1.3  
#> 10 versicolor Mean              5.94        2.77         4.26       1.33 
#> 11 versicolor 3rd Qu.           6.3         3            4.6        1.5  
#> 12 versicolor Max.              7           3.4          5.1        1.8  
#> 13 virginica  Min.              4.9         2.2          4.5        1.4  
#> 14 virginica  1st Qu.           6.22        2.8          5.1        1.8  
#> 15 virginica  Median            6.5         3            5.55       2    
#> 16 virginica  Mean              6.59        2.97         5.55       2.03 
#> 17 virginica  3rd Qu.           6.9         3.18         5.88       2.3  
#> 18 virginica  Max.              7.9         3.8          6.9        2.5

Can you please let me know how to modify them?

(I'm using summary just to continue from my last code. You can obviously use quantile(x = ., probs = p instead of summary(object = .), if you want.)

joels · May 15, 2019, 4:27am

When I was putting together my answer, I first tried to create the names inside enframe, but I couldn't come up with anything that worked. Your second approach works, but I also don't like having to generate a "dummy" summary (or quantile) just to get the names.

I'd still like to find a method that can work within a pipe but that is simpler than my second approach. For now, here's some additional explanation of my code. First, let's stop after summarise_if and look at the output:

d = iris %>%
  group_by(Species) %>%
  summarise_if(is.numeric, funs(list(quantile(.)))) 

d

  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>      <list>       <list>      <list>       <list>     
1 setosa     <dbl [5]>    <dbl [5]>   <dbl [5]>    <dbl [5]>  
2 versicolor <dbl [5]>    <dbl [5]>   <dbl [5]>    <dbl [5]>  
3 virginica  <dbl [5]>    <dbl [5]>   <dbl [5]>    <dbl [5]>

Now look at just the last column. It has three rows, each of which is a list element containing a named vector of quantiles.

d[[ncol(d)]]

[[1]]
  0%  25%  50%  75% 100% 
 0.1  0.2  0.2  0.3  0.6 

[[2]]
  0%  25%  50%  75% 100% 
 1.0  1.2  1.3  1.5  1.8 

[[3]]
  0%  25%  50%  75% 100% 
 1.4  1.8  2.0  2.3  2.5

If we just do d %>% unnest we get the quantiles for all four numeric columns, but no column with the names of the quantiles. However, we can see above that each of the quantile vectors has the names we need. We just need to find a way to access them.

.[[ncol(.)]] returns the last column of the output of summarise_if (just as d[[ncol(d)]] did above). I chose that column because we know it will always be one of the quantile columns. This column is a three-element list. But instead of the values, we want the names. We can get them with d %>% mutate(quantiles = .[[ncol(.)]] %>% map(names)), which maps over the three list elements of .[[ncol(.)]] and returns just the names instead of the values. I also used factor to ensure that the quantiles would be ordered in increasing magnitude and then reordered the columns with select.

tbradley · May 15, 2019, 4:38pm

You can also do it using purrr::partial like this:

library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.5.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)

p <- c(0.2, 0.5, 0.8)

p_names <- map_chr(p, ~paste0(.x*100, "%"))

p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>% 
  set_names(nm = p_names)

mtcars %>% 
  group_by(cyl) %>% 
  summarize_at(vars(mpg, hp), p_funs)
#> # A tibble: 3 x 7
#>     cyl `mpg_20%` `hp_20%` `mpg_50%` `hp_50%` `mpg_80%` `hp_80%`
#>   <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>    <dbl>
#> 1     4      22.8       65      26        91       30.4       97
#> 2     6      18.3      110      19.7     110       21        123
#> 3     8      13.9      175      15.2     192.      16.8      245

Created on 2019-05-15 by the reprex package (v0.2.0).

You can see a full write up for using this approach here

Yarnabrina · May 15, 2019, 5:24pm

Thanks for providing another way to do it.

As it happened, I became too much obsessed with this question, and ended up asking on SO. utubun provided a nice simple solution there, which am I providing below for completeness.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)
library(tidyr)

iris %>%
  group_by(Species) %>%
  summarise_if(.predicate = is.numeric,
               .funs = ~ list(enframe(x = summary(object = .)))) %>%
  gather(key = "attributes",
         value = "summaries",
         -Species) %>%
  unnest() %>%
  spread(key = "attributes",
         value = "value")
#> # A tibble: 18 x 6
#>    Species    name    Petal.Length Petal.Width Sepal.Length Sepal.Width
#>    <fct>      <chr>          <dbl>       <dbl>        <dbl>       <dbl>
#>  1 setosa     1st Qu.         1.4        0.2           4.8         3.2 
#>  2 setosa     3rd Qu.         1.58       0.3           5.2         3.68
#>  3 setosa     Max.            1.9        0.6           5.8         4.4 
#>  4 setosa     Mean            1.46       0.246         5.01        3.43
#>  5 setosa     Median          1.5        0.2           5           3.4 
#>  6 setosa     Min.            1          0.1           4.3         2.3 
#>  7 versicolor 1st Qu.         4          1.2           5.6         2.52
#>  8 versicolor 3rd Qu.         4.6        1.5           6.3         3   
#>  9 versicolor Max.            5.1        1.8           7           3.4 
#> 10 versicolor Mean            4.26       1.33          5.94        2.77
#> 11 versicolor Median          4.35       1.3           5.9         2.8 
#> 12 versicolor Min.            3          1             4.9         2   
#> 13 virginica  1st Qu.         5.1        1.8           6.22        2.8 
#> 14 virginica  3rd Qu.         5.88       2.3           6.9         3.18
#> 15 virginica  Max.            6.9        2.5           7.9         3.8 
#> 16 virginica  Mean            5.55       2.03          6.59        2.97
#> 17 virginica  Median          5.55       2             6.5         3   
#> 18 virginica  Min.            4.5        1.4           4.9         2.2

^{Created on 2019-05-15 by the reprex package (v0.2.1)}

joels · May 15, 2019, 5:49pm

Thanks for your efforts on this Anirban. That's a nice solution! So for quantiles, you could do this:

iris %>%
  group_by(Species) %>%
  summarise_if(is.numeric, ~list(enframe(quantile(.), name="quantiles"))) %>%
  gather(variable, value, -Species) %>%
  unnest() %>%
  mutate(quantiles = factor(quantiles, levels=unique(quantiles)))

And if you want it in wide format, add %>% spread(variable, value)

system · May 22, 2019, 5:50pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.