Summarize by distinct count of another column (Trying to replicate Excel's pivot table feature, 'summarize by distinct count' and 'summarize by distinct count percent of grant total')

Hi RStudio Community,

I'm trying to replicate a summary table in R, which I have successfully created in Excel.

Goal:

  • Create a summary table of 2 columns (country, and type), and summarize by the distinct count of another column : customer_id.

Given the following sample dataframe:

library(tidyverse)

customer_id <- c('A1', 'A1', 'A1', 'B2', 'B2', 'C3', 'C3', 'D4', 'D4', 'E5')
country <- c('Brazil', 'Brazil', 'Peru', 'Brazil', 'Costa Rica', 'Brazil', 'Peru', 'Brazil', 'Costa Rica', 'Hawaii')
type <- c('Arabica', 'Robusta', 'Excelsa', 'Arabica', 'Robusta', 'Arabica', 'Excelsa','Robusta', 'Excelsa', 'Arabica')

df <- data.frame(customer_id, country, type)

I want to pivot country and type to the rows, and create two summary columns:

  • distinct_count_of_customer_id
  • percentage_of_grand_total_distinct_count_of_customer_id

The attached screenshot from what I've done in Excel should help explain better what I'm trying to accomplish: a pivot table with country and type, summarized by the distinct count of customer ids in the customer_id column.

As you can see, there are 5 customers, so the grand total is 5. (And the grand total also sums to 100%)

I have tried various combinations of group_by(), summarize(), distinct(), and/or distinct_n()... but I just can't make it work after after a lot of research, and trial and error. Really appreciate any advice.

Getting the information in the pivot table is easy enough. Melding the two types of summaries together takes a little more work and, frankly, I find the combination confusing. For example, the 3 and the 2 in your summary of Brazil by Type do not result in the 4 in the Brazil Total row. And the 5 and the 100% in your Grand Total are not obviously the sum of any of the preceding numbers. However, this gets you most of the way to the final table.

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.2
customer_id <- c('A1', 'A1', 'A1', 'B2', 'B2', 'C3', 'C3', 'D4', 'D4', 'E5')
country <- c('Brazil', 'Brazil', 'Peru', 'Brazil', 'Costa Rica', 'Brazil', 'Peru', 'Brazil', 'Costa Rica', 'Hawaii')
type <- c('Arabica', 'Robusta', 'Excelsa', 'Arabica', 'Robusta', 'Arabica', 'Excelsa','Robusta', 'Excelsa', 'Arabica')

df <- data.frame(customer_id, country, type)
df
#>    customer_id    country    type
#> 1           A1     Brazil Arabica
#> 2           A1     Brazil Robusta
#> 3           A1       Peru Excelsa
#> 4           B2     Brazil Arabica
#> 5           B2 Costa Rica Robusta
#> 6           C3     Brazil Arabica
#> 7           C3       Peru Excelsa
#> 8           D4     Brazil Robusta
#> 9           D4 Costa Rica Excelsa
#> 10          E5     Hawaii Arabica

TotalCust <- length(unique(df$customer_id))

SummCountryType <- df |> group_by(country, type) |> 
  summarize(N = n()) |> 
  mutate(Perc = N/TotalCust)
#> `summarise()` has grouped output by 'country'. You can override using the
#> `.groups` argument.
SummCountryType
#> # A tibble: 6 x 4
#> # Groups:   country [4]
#>   country    type        N  Perc
#>   <chr>      <chr>   <int> <dbl>
#> 1 Brazil     Arabica     3   0.6
#> 2 Brazil     Robusta     2   0.4
#> 3 Costa Rica Excelsa     1   0.2
#> 4 Costa Rica Robusta     1   0.2
#> 5 Hawaii     Arabica     1   0.2
#> 6 Peru       Excelsa     2   0.4

SummaryCountry <- df |> group_by(country) |> 
  summarize(N = length(unique(customer_id))) |> 
  mutate(Perc = N/TotalCust)
SummaryCountry
#> # A tibble: 4 x 3
#>   country        N  Perc
#>   <chr>      <int> <dbl>
#> 1 Brazil         4   0.8
#> 2 Costa Rica     2   0.4
#> 3 Hawaii         1   0.2
#> 4 Peru           2   0.4

SummaryCountry$type <- ""
GlueFunc <- function(Country) {
  tmp1 <- filter(SummCountryType, country == Country)
  tmp2 <- filter(SummaryCountry, country == Country) |> 
    mutate(country = paste(country, "Total"))
  rbind(tmp1, tmp2)
}

COUNTRIES <- unique(df$country)

FINAL <- map_dfr(COUNTRIES, GlueFunc)
FINAL
#> # A tibble: 10 x 4
#> # Groups:   country [8]
#>    country          type          N  Perc
#>    <chr>            <chr>     <int> <dbl>
#>  1 Brazil           "Arabica"     3   0.6
#>  2 Brazil           "Robusta"     2   0.4
#>  3 Brazil Total     ""            4   0.8
#>  4 Peru             "Excelsa"     2   0.4
#>  5 Peru Total       ""            2   0.4
#>  6 Costa Rica       "Excelsa"     1   0.2
#>  7 Costa Rica       "Robusta"     1   0.2
#>  8 Costa Rica Total ""            2   0.4
#>  9 Hawaii           "Arabica"     1   0.2
#> 10 Hawaii Total     ""            1   0.2

Created on 2022-05-15 by the reprex package (v2.0.1)

1 Like

This did the trick, THANK YOU!

I agree with you, it is a weird way to represent the data (where the totals don't add intuitively in the table) but it works for my use case. I wonder if there's a better way to represent the data in a tabular format -- open to any ideas you have.

(My made-up data probably made it more confusing, but you managed to solve it anyway.) :slight_smile:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.