Descriptive Summary - Mean and SD

Hello,
I am able to create a simple descriptive table using describe function. But it is not digging enough to show the statistic results within each variable. Like if my data set has gender, political party, education, etc. I want to see the statistic results for gender (male and female), for political party (republican, democrat, other), and for education (no education, hs, college, etc.). This information could help with understand the Mean and SD within each variable with more accuracy.

Can you provide with an example on how to accomplish this goal?

It does not make sense to calculate the mean of a variable that consists of categories. If you have 27 republicans, 29 democrats, and 2 other, how would you calculate a mean? The Hmisc::describe() function does give you the count and the proportion in each category.

The mean and standard deviation of mpg and qsec for automatic (am=0) and manual (am=1) transmissions, then the same for the number of cyclinders (4, 6, 8)

library(tidyverse)

mtcars |> 
  group_by(am) |> 
  summarise(across(c(mpg, qsec), list(mean = mean, sd = sd)))
#> # A tibble: 2 × 5
#>      am mpg_mean mpg_sd qsec_mean qsec_sd
#>   <dbl>    <dbl>  <dbl>     <dbl>   <dbl>
#> 1     0     17.1   3.83      18.2    1.75
#> 2     1     24.4   6.17      17.4    1.79

mtcars |> 
  group_by(cyl) |> 
  summarise(across(c(mpg, qsec), list(mean = mean, sd = sd)))
#> # A tibble: 3 × 5
#>     cyl mpg_mean mpg_sd qsec_mean qsec_sd
#>   <dbl>    <dbl>  <dbl>     <dbl>   <dbl>
#> 1     4     26.7   4.51      19.1    1.68
#> 2     6     19.7   1.45      18.0    1.71
#> 3     8     15.1   2.56      16.8    1.20

Created on 2022-10-08 with reprex v2.0.2

Thanks for the help. This picture shows what I am trying to achieve.

Are you trying to find the mean and standard deviation of some other variable grouped by gender or political party? If so, look at @EconProf's use of group_by().

The table of what you want to achieve does not make sense to me. For gender, instead of finding the differences in the median (not mean?) and standard deviation for males and females, you have just one value of the median and one of the standard deviation for males and females combined.

You also give no indication of what variable you are calculating the median and standard deviation for. If you have a median of 3.4 for that variable for 700 cases (350 male and 350 female) then there should be a median of 3.4 for the same 700 cases (200 democrats, 300 republicans and 200 others).

It would be very helpful if you could provide a reprex with a sample of your data and any R code you have tried so far.

Hi EconProf,

The information on the table is just for illustration, and is not even accurate. But, you are correct. On my data analysis, it would make sense to extract additional information from like "political party", and "ideology". So the end result does not sound so bias.

Suppose that I wanted to calculate the mean and standard deviation of mpg for cars with 4, 6 and 8 cylinders. Given that I am only doing this for one variable, mpg, the code is simpler than shown in my earlier post.

library(tidyverse)

mtcars |> 
  group_by(cyl) |> 
  summarise(mean = mean(mpg, na.rm = TRUE), 
            sd = sd(mpg, na.rm = TRUE)
  )
#> # A tibble: 3 × 3
#>     cyl  mean    sd
#>   <dbl> <dbl> <dbl>
#> 1     4  26.7  4.51
#> 2     6  19.7  1.45
#> 3     8  15.1  2.56

Created on 2022-10-09 with reprex v2.0.2

If I then wanted to do the same thing for cars with automatic and manual transmissions, I would substitute am for cyl in the code above.

For you, group_by(gender) and replace mpg with the name of the relevant variable. Repeat this grouping by political_party and then by education.

It might also be helpful to have a count for each category (e.g., how many males and females). For my example, there 11 cars with 4 cylinders, 7 with 6 cylinders and 14 with 8.

library(tidyverse)

mtcars |> 
  group_by(cyl) |> 
  summarise(mean = mean(mpg, na.rm = TRUE), 
            sd = sd(mpg, na.rm = TRUE),
            count = n()
  )
#> # A tibble: 3 × 4
#>     cyl  mean    sd count
#>   <dbl> <dbl> <dbl> <int>
#> 1     4  26.7  4.51    11
#> 2     6  19.7  1.45     7
#> 3     8  15.1  2.56    14

Created on 2022-10-09 with reprex v2.0.2

If you want one table with the results for all groupings (gender, political_party, education), I am not sure of the best way. Perhaps someone else has an idea.

1 Like

You might want to look at the gtsummary package. It's great for this sort of thing.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.