Using Summarise function, aggregation conditionally

Divya1608 · May 24, 2018, 6:18am

Is there a way to use summarise function in the following way:

df %>% groupby(ColumnA) %>% summarise(count_colB = n()) # count only if ColumnA == "str1"

mishabalyasin · May 24, 2018, 8:10am

What exactly are you trying to do?
From your description above you can filter df before/after grouping to only include rows with str1 in them and it will produce the result you want. But I assume there is more to your question, so can you please clarify.

Divya1608 · May 24, 2018, 8:41am

Yes, exactly, you have interpreted the situation behind my question right, given what I had explained. Now, in more detail:

df

ColumnA -------- ColumnB
str1 --------------- Employed
str1 --------------- Employed
str1 --------------- Not Employed
str2 --------------- Employed
str2 --------------- Not Employed
str2 --------------- Not Employed
... (40,000
more rows)

Actually what I want is, is this:

after filtering df and grouping by columnA (which has cases like 'str1', 'str2', 'str3', etc), I want R to calculate counts and percentages of "Employed" in ColumnB for every str1, str2, str3, etc.

I don't want to slice data by 'str1', 'str2', 'str3', etc. since it would be very tedious, and I want all the results in same file.

I am new to R so I don't know how to go about it. Would appreciate if you could help.

Thanks!

agranholm · May 24, 2018, 8:58am

I am not perfectly sure what you want either, but does this solve your problem?

suppressPackageStartupMessages(library(tidyverse))

# Generate random data to use in this example
set.seed(100)
df <- data.frame(ColumnA = paste0("str", sample(1:3, 15, replace = TRUE)),
                 ColumnB = sample(c("Employed", "Not Employed"), 15, replace = TRUE),
                 stringsAsFactors = FALSE)
df # print
#>    ColumnA      ColumnB
#> 1     str1 Not Employed
#> 2     str1     Employed
#> 3     str2     Employed
#> 4     str1     Employed
#> 5     str2 Not Employed
#> 6     str2 Not Employed
#> 7     str3 Not Employed
#> 8     str2 Not Employed
#> 9     str2 Not Employed
#> 10    str1     Employed
#> 11    str2     Employed
#> 12    str3 Not Employed
#> 13    str1 Not Employed
#> 14    str2 Not Employed
#> 15    str3     Employed

# Using dplyr, if you need to keep the data for further calculations this will be preferred, also calculates the precentage
df %>%
  group_by(ColumnA) %>%
  summarise(count_employed = sum(ColumnB == "Employed"),
            count_not_emplyed = sum(ColumnB == "Not Employed"),
            employed_percentage = count_employed / n() * 100)

#> # A tibble: 3 x 4
#>   ColumnA count_employed count_not_emplyed employed_percentage
#>   <chr>            <int>             <int>               <dbl>
#> 1 str1                 3                 2                60  
#> 2 str2                 2                 5                28.6
#> 3 str3                 1                 2                33.3

# Very simple approach using base that does the same if you only need the results dislayed and do not need percentages
table(df)

#>        ColumnB
#> ColumnA Employed Not Employed
#>    str1        3            2
#>    str2        2            5
#>    str3        1            2

Divya1608 · May 24, 2018, 9:10am

Thanks for replying.

Actually, ColumnA has names of centers, which I just mentioned as 'str1'...'str10'... Those centers already have a numeric code.. from 1 to 150. Anyways, the question was if I can tabulate the Employment percentage per center through summarize function.

I just realized, it can be done by simple table() and prop() table function also.

Now, the next problem is to save the data in user-friendly way in pdf.

Thanks,

mishabalyasin · May 24, 2018, 10:13am

There is this overview by the author of huxtable package that might be helpful for you.
Most common packages to use are kable and xtable. huxtable can be a bit of overkill for your needs, but it is a nice package that makes reporting fairly easy.