R group small values together

Hi R Community

i have this df:

Item.group = c("Breast", "Breast", "Whole", "Whole", "Whole", "Breast", "Minced", "Thigh"),
Customer = c("D17", "D17", "D17", "D21", "D7", "D17", "D17", "D17"),
Multiple dates this time
Date = c(rep(x = "2018-01-01", times = 4), rep(x = "2018-01-02", times = 4)),
Sales = c(255, 138, 368, NA_integer_, 335, 75, 188, 118),
Campaign.flag = c(0, 0, 0, 0, 1, 0, 0, 0)

Customer is a factor with 40 levels. I want to group them together based on Sales (and the Date) so that I get the 12 biggest customers and a new group "others" that gather the remaining
28 customers with the lowest Sale.

Any good ideas?

Hi @jak123,
You need the forcats package.
It has various fct_lump_*() functions for combining factor levels which have low frequencies.

And how do that work in practice?

anyone that can help?

Hi @jak123,

library(forcats)
your_df$customer_group <- fct_lump_n(your_df$Customer, n=12)

# Check the groupings
table(your_df$customer_group)

Cool ty, but how does it know that it is in terms of Sales i want to sort the top n customers?

Hi @jak123,
Sorry, I didn't read your original question carefully enough. This is how you can arrange your customers by "total Sales" and then pick out say, the top 4, and group the rest as "5".

suppressPackageStartupMessages(library(tidyverse))

# Expand the posted data to make example clearer
your_df <- data.frame(
  Item.group = rep(c("Breast", "Breast", "Whole", "Whole", "Whole", 
                     "Breast", "Minced", "Thigh"), times=2),
  Customer = c("D17", "D17", "D17", "D21", "D7", "D17", "D17", "D17",
               "D21","D23","D11","D28","D28","D30","D31","D34"),
  Date = c(rep(x = "2018-01-01", times = 8), rep(x = "2018-01-02", times = 8)),
  Sales = c(255, 138, 368, NA_integer_, 335, 75, 188, 118,
            121, 56, 98, 145, 133, 108, 41, 67),
  Campaign.flag = c(0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0))
  
# Assume we want the top 4 customers and the rest lumped together.
your_df %>% 
  group_by(Customer) %>% 
  summarise(total_sales = sum(Sales, na.rm=TRUE),
            freq = n()) %>% 
  arrange(desc(total_sales)) %>% 
  mutate(new_cust = seq(1, nrow(.)),
         new_cust = ifelse(new_cust > 5, 5, new_cust)) -> new_df

new_df
#> # A tibble: 9 x 4
#>   Customer total_sales  freq new_cust
#>   <chr>          <dbl> <int>    <dbl>
#> 1 D17             1142     6        1
#> 2 D7               335     1        2
#> 3 D28              278     2        3
#> 4 D21              121     2        4
#> 5 D30              108     1        5
#> 6 D11               98     1        5
#> 7 D34               67     1        5
#> 8 D23               56     1        5
#> 9 D31               41     1        5

Created on 2021-09-13 by the reprex package (v2.0.1)
Further summarizing is then possible, if required.

df %>%
group_by(Customer) %>%
summarise(total_sales = sum(Sales, na.rm=TRUE),
freq = n()) %>%
arrange(desc(total_sales)) %>%
mutate(new_cust = seq(1, nrow(.)),
new_cust = ifelse(new_cust > 7, 7, new_cust)) -> new_df

Not working ( new_df is 40 observations and 4 variables) i need it to be 7 observations with the 6 biggest customers and 1 group with the others based on sales :slight_smile:

also im not sure what the two last lines does :slight_smile:

Im also trying to make a ggplot of the sales on y-axis and then facet wrap customers so I can see which customer buys different item group.

df %>%
group_by(Item.group) %>%
summarize(sum_sale = sum(Sales, na.rm = TRUE)) %>%
ggplot() +
# You can use reorder() for ascending sorting of values
geom_bar(aes(x = reorder(Item.group, sum_sale, FUN = desc),
y = sum_sale), stat = "identity") +
theme_bw() +
facet_wrap(~Customer)
xlab("Item Group") +
ylab("Sum Sales") +
scale_y_continuous(breaks = c(0,1000000, 5000000, 8000000),
labels = c(0,"1m", "5m", "8m"))

Error: At least one layer must contain all faceting variables: Customer.

  • Plot is missing Customer
  • Layer 1 is missing Customer

Any idea?

Try this:

# Assume we want the top 6 customers and the rest lumped together as "Others".
your_df %>%
  group_by(Customer) %>%
  summarise(total_sales = sum(Sales, na.rm=TRUE),
            freq = n()) %>%
  arrange(desc(total_sales)) %>%
  mutate(seq_cust = seq(1, nrow(.)),
         new_cust = ifelse(seq_cust > 6, "Others", Customer)) -> new_df

new_df

# Summarise even further
new_df %>%
  group_by(new_cust) %>%
  summarise(total_sales = sum(total_sales, na.rm=TRUE),
            freq = n()) %>% 
  arrange(desc(total_sales))

Here's how to get a facetted grid of histograms of Item.group sales per Customer

your_df %>%
  group_by(Item.group, Customer) %>%
  summarize(sum_sale = sum(Sales, na.rm = TRUE)) %>%
  ggplot() +
    geom_col(aes(x = Item.group,
                 y = sum_sale)) +
    theme_bw() +
    facet_wrap(~ Customer) +
    xlab("Item Group") +
    ylab("Sum Sales") +
    scale_y_continuous(
      breaks = c(0, 1000000, 5000000, 8000000),
      labels = c(0, "1m", "5m", "8m")
  )

Hi again and thanks a lot for the help, the point with this:

df %>%
group_by(Item.group, Customer) %>%
summarize(sum_sale = sum(Sales, na.rm = TRUE)) %>%
ggplot() +
geom_col(aes(x = Item.group,
y = sum_sale)) +
theme_bw() +
facet_wrap(~ Customer) +
xlab("Item Group") +
ylab("Sum Sales") +
scale_y_continuous(
breaks = c(0, 1000000, 5000000, 8000000),
labels = c(0, "1m", "5m", "8m")
)

was that I wanted to see top 7 customers based on sales -> but with the above code I got all 40 customers :slight_smile: can we fix that?