A statistics question: Summarize grouped data by distribution

Hi

What I'm trying to do is probably right in front of me with tidyverse but I'm not that experienced in R. Here's my data:

GroupA, GroupB, x
1,1,53.1
1,1,49490
1,1,30129301
1,1,....
1,1,1973
1,2,4394829
1,2,313221
1,2,...
1,2,9090909
1,3...
1,15,967647634
...
4,1,...
4,15,...
5,1,...
5,5,877656
5,5,...
5,5,54321 (last row of data)

So in short we have 5 possible values for GroupA and 15 for Group B so each value of x can fit into one of 75 combinations of Group A and Group B.

For each combination of groups, what I want is not just the min, max, std, mean etc for each of the 75 combinations but a distribution (probably quantiles or every 5%) so I can do a frequency distribution - well 75 distributions to be precise. So for example, for each combination of GroupA and GroupB, I want R to calculate the 5th, 10th, ... 95th, 100th percentile value, then count the number of values of x for each percentile - to end up with a frequency distribution with 21 points (including 0).

I'd also like these automatically plotted (i.e. 75 charts) if possible - but not so worried about that.

What I'm really after is looking at whether any of these distributions show a long tail on either side of the mean. I've always hated box plots (and inter quartile ranges - too imprecise for me) so much prefer seeing the distribution - then I can worry about the umpteen ways of comparing distributions (coeff of skewness, kurtosis) etc.

Thanks
RM

If you just want to see the distributions, try using geom_density() and facet_grid() from ggplot2. With 15 plots in one axis, you probably want to open an independent plot window using either the windows() or X11() functions (for Windows and Linux, respectively) and set the width and height to large values.

library(ggplot2)
DF <- data.frame(GroupA = rep(c(1,2), each = 400),
                 GroupB = rep(rep(1:4, each = 100), 2),
                 x = c(rnorm(100), rnorm(100, 2, 1), rnorm(100, 0.5, 2), rnorm(100, -1, .2),
                       rnorm(100, 4, 2), rnorm(100,0, 2), rnorm(100, -2, 0.5), 
                       rnorm(100, .4, 4)))
ggplot(DF, aes(x)) + geom_density() + facet_grid(GroupA ~ GroupB)

Created on 2019-10-08 by the reprex package (v0.3.0.9000)

2 Likes

Here's one way to get the quantiles and the frequencies in those intervals:

set.seed(seed = 41791)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

fake_dataset <- data.frame(A = sample.int(n = 5,
                                          size = 10000,
                                          replace = TRUE),
                           B = sample.int(n = 15,
                                          size = 10000,
                                          replace = TRUE),
                           x = rexp(n = 10000,
                                    rate = 0.00001))

fake_dataset %>%
    group_by(A, B) %>%
    summarise(quantiles = list(quantile(x = x,
                                        probs = seq(from = 0,
                                                    to = 1,
                                                    by = 0.05))),
              frequencies = list(tabulate(match(x = cut(x = x,
                                                        breaks = unlist(x = quantiles),
                                                        labels = 1:20,
                                                        include.lowest = TRUE,
                                                        right = TRUE),
                                                table = 1:20)))) %>%
    ungroup()
#> # A tibble: 75 x 4
#>        A     B quantiles  frequencies
#>    <int> <int> <list>     <list>     
#>  1     1     1 <dbl [21]> <int [20]> 
#>  2     1     2 <dbl [21]> <int [20]> 
#>  3     1     3 <dbl [21]> <int [20]> 
#>  4     1     4 <dbl [21]> <int [20]> 
#>  5     1     5 <dbl [21]> <int [20]> 
#>  6     1     6 <dbl [21]> <int [20]> 
#>  7     1     7 <dbl [21]> <int [20]> 
#>  8     1     8 <dbl [21]> <int [20]> 
#>  9     1     9 <dbl [21]> <int [20]> 
#> 10     1    10 <dbl [21]> <int [20]> 
#> # … with 65 more rows

I kept the quantiles and frequencies as list columns, as I don't know what you plan to do later. You can use new unnest_wider functions to get 21 and 20 columns for those columns.

For the plotting part, see the answer Francis suggested.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.