Descriptive Statistics Table Help

sft_10 · April 25, 2020, 7:25pm

Hi-- just wondering what the best package/method would be to make a table of descriptive statistics if I have both continuous and categorical variables? For context, I'd want it to look something like this

EconomiCurtis · April 25, 2020, 8:20pm

I'd suggest creating a small data.frame with example data to give folks a sense of what you're working with.

There's a bunch of R packages that help you create summary tables. A good way to review which will work best for you is to check out the vignettes. For example, qwraps2 has this nice guide: https://cran.r-project.org/web/packages/qwraps2/vignettes/summary-statistics.html

Also the Tidyverse has a bunch of packages that make munging and summarizing data quite easy, and then tools like gt and kable help get your summary table polished for publication.

I personally like to just build the table I'd like to see. As an example of that, see below. But more often than not of the summary table packages have a set of functions that do this kind of thing for you.

library(dplyr)
library(tidyr)
library(glue)
df <- tibble(
  ob   = 1:4,
  num1 = c(1,2,3,3),
  num2 = c(3,4,4,5),
  cat1 = sample(c(1,0), 4, replace = TRUE),
  cat2 = sample(c(1,0), 4, replace = TRUE)
)

result <- bind_rows(
  
  df %>% 
    select(ob, num1, num2) %>%   #select(is.numeric()) requires dplyr 1.0.0
    pivot_longer(c(-ob)) %>% 
    group_by(name) %>% 
    summarize(
      `Mean (SD)` = glue("{mean(value)} ({sd(value) %>% round(2)})"),
    ),
  
  df %>%  
    select(ob, cat1, cat2) %>% 
    pivot_longer(c(-ob,)) %>%
    group_by(name) %>% 
    summarise(
      `%` = mean(value)
    )
  
) %>% 
    replace_na(
      list(
        `Mean (SD)` = "",
        `%` = "-"
      )
    )

result
#> # A tibble: 4 x 3
#>   name  `Mean (SD)`   `%`  
#>   <chr> <glue>        <chr>
#> 1 num1  "2.25 (0.96)" -    
#> 2 num2  "4 (0.82)"    -    
#> 3 cat1  ""            1    
#> 4 cat2  ""            0.5

^{Created on 2020-04-25 by the reprex package (v0.3.0)}

sft_10 · April 26, 2020, 4:47am

Hi, thanks for your help! For context, here's a bit of the data I'm working with.

read write math science socst female race ses schtyp prog
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|1|57|52|41|47|57|Male|White|Low|Public|General|
|2|68|59|53|63|61|Female|White|Medium|Public|Vocational|
|3|44|33|54|58|31|Male|White|High|Public|General|
|4|63|44|47|53|56|Male|White|High|Public|Vocational|
|5|47|52|57|53|61|Male|White|Medium|Public|Academic|
|6|44|52|51|63|61|Male|White|Medium|Public|Academic|
|7|50|59|42|53|61|Male|Hispanic|Medium|Public|General|
|8|34|46|45|39|36|Male|African American|Medium|Public|Academic|
|9|63|57|54|58|51|Male|White|Medium|Public|General|

Jason.C · May 12, 2020, 4:06pm

Hi, I basically have the same question. I’ve looked through several options but haven't found a way to create a table that shows: means(sd) for multiple variables organized by group.

Using my data (below), I can use 'group_by()' and 'across()' to estimate the 𝑥̄ and sd.

means_sds <- list(
	mean = ~mean(.x, na.rm = TRUE),
	sd = ~sd(.x, na.rm = TRUE)
)

vars.avg.sd <- dat.sm %>% 
	group_by(group.factor) %>% 
	summarise(across(is.numeric, means_sds))

Yet, I have to transpose the data to approximate my goal but, the 𝑥̄ and sd are on alternating rows.

Any help or suggestions would be greatly appreciated.

Cheers,
~ Jason the #rstatsnewbie

t(vars.avg.sd)

Data

 dput(head(dat.sm, 30))
structure(list(group.factor = structure(c(1L, 1L, 1L, 1L, 1L, 
1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L), .Label = c("HC", "CLBP", 
"FM", "TRUE"), class = "factor"), slpQual = c(0, 1, 1, 1, 1, 
0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 2, 0, 2, 1, 
1, 2, 1, 2), slpLat = c(1, 0, 2, 1, 1, 0, 1, 1, 1, 2, 0, 1, 1, 
1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 2, 0, 1, 1, 1, 3), slpDur = c(1, 
0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 
1, 0, 2, 1, 1, 0, 1, 3), slpEff = c(0, 0, 1, 0, 1, 1, 0, 0, 0, 
0, 0, 0, 0, 3, 3, 0, 3, 1, 0, 0, 3, 0, 0, 3, 1, 3, 0, 1, 1, 3
), slpDist = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), slpMeds = c(0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 
0, 0, 0, 0, 0, 3, 1), slpDayFcn = c(0, 0, 1, 0, 1, 0, 0, 1, 0, 
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
), psqi_Global = c(3, 2, 7, 3, 5, 3, 3, 4, 2, 6, 0, 3, 4, 6, 
7, 1, 7, 5, 6, 3, 5, 2, 6, 4, 8, 6, 4, 5, 8, 13), slpQualraw = c(0, 
1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 
2, 0, 2, 1, 1, 2, 1, 2), slpLatraw = c(12, 6, 33, 17, 17, 6, 
19.5, 32, 12, 33, 6, 7, 12, 12, 18, 11, 17, 22, 24.5, 6, 13.5, 
12, 22, 11, 33, 11, 17, 17, 19.5, 109), slpDurraw = c(6.5, 7.5, 
7, 7.5, 8, 7, 7, 7.5, 7.5, 6.5, 9, 7.25, 8, 8, 6.5, 8, 7, 7, 
7.5, 7, 7, 7.5, 7, 8, 5, 6, 7, 7.5, 7, 4.5), slpEffraw = c(92.8571428571429, 
100, 82.3529411764706, 93.75, 84.2105263157895, 82.3529411764706, 
93.3333333333333, 88.2352941176471, 100, 92.8571428571429, 100, 
90.625, 100, 25, 20.9677419354839, 94.1176470588235, 21.875, 
82.3529411764706, 85.7142857142857, 87.5, 21.875, 93.75, 93.3333333333333, 
24.8062015503876, 83.3333333333333, 20, 93.3333333333333, 83.3333333333333, 
82.3529411764706, 50), slpDistraw = c(2, 3, 1, 4, 5, 3, 5, 8, 
1, 7, 0, 2, 7, 5, 6, 1, 6, 3, 4, 5, 8, 5, 7, 3, 3, 4, 3, 16, 
11, 13), slpMedsraw = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3, 1), slpDayFcnraw = c(0, 
0, 1, 2, 1, 0, 0, 1, 0, 1, 0, 0, 1, 2, 3, 0, 3, 0, 1, 3, 2, 0, 
2, 0, 0, 0, 0, 3, 3, 0), psqi_Globalraw = c(113.357142857143, 
117.5, 125.352941176471, 125.25, 116.210526315789, 98.3529411764706, 
124.833333333333, 137.735294117647, 120.5, 141.357142857143, 
115, 107.875, 129, 53, 55.4677419354839, 114.117647058824, 55.875, 
115.352941176471, 125.714285714286, 109.5, 52.375, 118.25, 134.333333333333, 
46.8062015503876, 126.333333333333, 42, 121.333333333333, 128.833333333333, 
126.852941176471, 179.5), ess_total = c(5, 5, 4, 7, 10, 2, 5, 
5, 4, 9, 6, 10, 8, 9, 6, 8, 9, 3, 6, 10, 8, 10, 3, 0, 10, 6, 
6, 6, 9, 6), isi_total = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3L, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA), bdi_total = c(0, 1, 1, 13, 5, 0, 1, 6, 0, 
7, 0, 2, 0, 6, 0, 3, 1, 0, 0, 7, 12, 5, 7, 0, 2, 8, 1, 9, 12, 
4), mcgill_total = c(0, 0, 0, 0, 0, 0, 9, 0, 0, 4, 0, 1, 0, 5, 
0, 0, 0, 0, 0, 38, 34, 16, 0, 0, 0, 5, 2, 14, 0, 0), TIB = c(7, 
7.5, 8.5, 8, 9.5, 8.5, 7.5, 8.5, 7.5, 7, 9, 8, 8, 32, 31, 8.5, 
32, 8.5, 8.75, 8, 32, 8, 7.5, 32.25, 6, 30, 7.5, 9, 8.5, 9)), row.names = c(NA, 
30L), class = "data.frame")

nirgrahamuk · May 12, 2020, 4:29pm

I dont have the version of dplyr with across so I used the old summarise_if

means_sds <- list(
  mean = ~ mean(.x, na.rm = TRUE),
  sd = ~ sd(.x, na.rm = TRUE)
)

vars.avg.sd <- dat.sm %>%
  group_by(group.factor) %>%
  summarise_if(is.numeric, means_sds)
vars.sd  <- vars.avg.sd %>% select(group.factor, ends_with("_sd"))
vars.avg <- vars.avg.sd %>% select(group.factor, ends_with("_mean"))
(vars.sdp  <- pivot_longer(vars.sd ,cols = -1))
(vars.avgp <- pivot_longer(vars.avg,cols = -1))

(vars_both <- bind_cols(vars.avgp, vars.sdp))

(vars_both_tidy <- select(vars_both, 1:3, value1) %>%
  rename(mean = value, sd = value1) %>% rowwise() %>% mutate(
    nameparts = str_split(name, "_"),
    name = paste0(nameparts[1:(length(nameparts) - 1)], collapse = "_")
  ) %>% select(-nameparts))

# A tibble: 63 x 4
group.factor name          mean    sd
<fct>        <chr>        <dbl> <dbl>
1 HC          slpQual      0.826 0.576
2 HC          slpLat       0.826 0.650
3 HC          slpDur       0.522 0.593
4 HC          slpEff       0.870 1.22 
5 HC          slpDist      0.957 0.209
6 HC          slpMeds      0.130 0.458
7 HC          slpDayFcn    0.261 0.449
8 HC          psqi_Global  4.39  2.13 
9 HC          slpQualraw   0.826 0.576
10 HC         slpLatraw    16.5   9.26 
# ... with 53 more rows

Jason.C · May 12, 2020, 5:42pm

Thank you, this is very helpful! I'll need some time to figure out how all of this works.

My next goal is to put this into a table like you did above. I'll try using glue and report back soon!

Edit 1 = failure

Ok, I’ve tried for several hours now and I cannot figure out how to convert the tibble above to a table I want.

My goal is a table similar to this:

Variable	HC	CLBP	FM
slpQual	mean(SD)	mean(SD)	mean(SD)
slpLat	mean(SD)	mean(SD)	mean(SD)
slpDur	mean(SD)	mean(SD)	mean(SD)

I know this should be a simple task, but I simply can't figure out a way to make this simple table. Any help is greatly appreciated.

~ Jason the #rstatsnewbie

system · June 2, 2020, 5:42pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.