Count of observations after group_by()

Hi all,

I'm trying to get the number of observations for a specific variable after using dplyr::group_by()

df %>%
    group_by(country, year) %>%
    summarise(mean = mean(variable, na.rm = T),
    sd = sd(variable, na.rm = T),
    N = n()) -> df2

The idea is to get the count of observations of the variable "variable" for each "country" and "year" to compute some standard errors and some nice confidence intervals. I believe in the code above I'm getting the count of all observations in a country in a specific year, but because "variable" has some NA it isn't what I need for the computation of SE and CI. If it clarifies further my question: I think the n() above isn't using the same figure as the one used by mean(), the one I need.

I've tried add_count() to no avail. What would you suggest? Thanks!

Are you looking for the count of value that different that NA ?
N = sum(!is.na(variable)) could be what you want.
otherwise, you could use the wt in tally, %>% add_tally(wt = !is.na(variable)

But, not sure I understood correctly

2 Likes

@cderv Thanks for your reply. I was so focused on n() that I didn't think of looking up sum(). Just to learn how to use add_tally(), could you elaborate how/where it fits in the code below instead of sum()? Thanks a lot!

df %>%
  group_by(cntry, essround) %>%
  summarise(mean = mean(trstep2, na.rm = T),
            sd = sd(trstep2, na.rm = T),
            N = sum(!is.na(trstep2))) -> df2

My understanding is that add_tally() doesn't go inside summarise(), but if I pipe it like below it doesn't work.

df %>%
  group_by(country, year) %>%
  summarise(mean = mean(variable, na.rm = T),
            sd = sd(trstep2, na.rm = T)) %>%
  add_tally(df, wt = !is.na(df$tvariable)) -> df2

Here is an example to show you the difference

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
mtcars$cyl[5] <- NA
gp_df <- mtcars %>%
  mutate(dummy_cat_for_reprex = rep_len(c("dummy1", "dummy2"), n())) %>%
  group_by(dummy_cat_for_reprex) 

gp_df %>%
  summarise(mean = mean(cyl, na.rm = T),
            sd = sd(cyl, na.rm = T),
            N = n(),
            N_without_NA = sum(!is.na(cyl)))
#> # A tibble: 2 x 5
#>   dummy_cat_for_reprex  mean    sd     N N_without_NA
#>   <chr>                <dbl> <dbl> <int>        <int>
#> 1 dummy1                6.4   1.88    16           15
#> 2 dummy2                5.88  1.71    16           16
gp_df %>%
  tally(wt = !is.na(cyl))
#> # A tibble: 2 x 2
#>   dummy_cat_for_reprex     n
#>   <chr>                <int>
#> 1 dummy1                  15
#> 2 dummy2                  16

gp_df %>%
  add_tally(wt = !is.na(cyl)) %>%
  distinct(dummy_cat_for_reprex, n)
#> # A tibble: 2 x 2
#> # Groups:   dummy_cat_for_reprex [2]
#>   dummy_cat_for_reprex     n
#>   <chr>                <int>
#> 1 dummy1                  15
#> 2 dummy2                  16

Created on 2019-01-21 by the reprex package (v0.2.1)

Your add_tally does not work because the table is already summarise.

For what you want to do the sum is ok I think.

3 Likes

I understand now, thanks a lot for taking the time to show me!

1 Like

No problem ! Feel free to ask !

If your question's been answered would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.