Using Summarise function, aggregation conditionally



Is there a way to use summarise function in the following way:

df %>% groupby(ColumnA) %>% summarise(count_colB = n()) # count only if ColumnA == "str1"


What exactly are you trying to do?
From your description above you can filter df before/after grouping to only include rows with str1 in them and it will produce the result you want. But I assume there is more to your question, so can you please clarify.


Yes, exactly, you have interpreted the situation behind my question right, given what I had explained. Now, in more detail:


ColumnA -------- ColumnB
str1 --------------- Employed
str1 --------------- Employed
str1 --------------- Not Employed
str2 --------------- Employed
str2 --------------- Not Employed
str2 --------------- Not Employed
... (40,000
more rows)

Actually what I want is, is this:

after filtering df and grouping by columnA (which has cases like 'str1', 'str2', 'str3', etc), I want R to calculate counts and percentages of "Employed" in ColumnB for every str1, str2, str3, etc.

I don't want to slice data by 'str1', 'str2', 'str3', etc. since it would be very tedious, and I want all the results in same file.

I am new to R so I don't know how to go about it. Would appreciate if you could help.



I am not perfectly sure what you want either, but does this solve your problem?


# Generate random data to use in this example
df <- data.frame(ColumnA = paste0("str", sample(1:3, 15, replace = TRUE)),
                 ColumnB = sample(c("Employed", "Not Employed"), 15, replace = TRUE),
                 stringsAsFactors = FALSE)
df # print
#>    ColumnA      ColumnB
#> 1     str1 Not Employed
#> 2     str1     Employed
#> 3     str2     Employed
#> 4     str1     Employed
#> 5     str2 Not Employed
#> 6     str2 Not Employed
#> 7     str3 Not Employed
#> 8     str2 Not Employed
#> 9     str2 Not Employed
#> 10    str1     Employed
#> 11    str2     Employed
#> 12    str3 Not Employed
#> 13    str1 Not Employed
#> 14    str2 Not Employed
#> 15    str3     Employed

# Using dplyr, if you need to keep the data for further calculations this will be preferred, also calculates the precentage
df %>%
  group_by(ColumnA) %>%
  summarise(count_employed = sum(ColumnB == "Employed"),
            count_not_emplyed = sum(ColumnB == "Not Employed"),
            employed_percentage = count_employed / n() * 100)

#> # A tibble: 3 x 4
#>   ColumnA count_employed count_not_emplyed employed_percentage
#>   <chr>            <int>             <int>               <dbl>
#> 1 str1                 3                 2                60  
#> 2 str2                 2                 5                28.6
#> 3 str3                 1                 2                33.3

# Very simple approach using base that does the same if you only need the results dislayed and do not need percentages

#>        ColumnB
#> ColumnA Employed Not Employed
#>    str1        3            2
#>    str2        2            5
#>    str3        1            2


Thanks for replying.

Actually, ColumnA has names of centers, which I just mentioned as 'str1'...'str10'... Those centers already have a numeric code.. from 1 to 150. Anyways, the question was if I can tabulate the Employment percentage per center through summarize function.

I just realized, it can be done by simple table() and prop() table function also.

Now, the next problem is to save the data in user-friendly way in pdf.



There is this overview by the author of huxtable package that might be helpful for you.
Most common packages to use are kable and xtable. huxtable can be a bit of overkill for your needs, but it is a nice package that makes reporting fairly easy.