R group small values together

jak123 · September 6, 2021, 7:41am

Hi R Community

i have this df:

Item.group = c("Breast", "Breast", "Whole", "Whole", "Whole", "Breast", "Minced", "Thigh"),
Customer = c("D17", "D17", "D17", "D21", "D7", "D17", "D17", "D17"),
Multiple dates this time
Date = c(rep(x = "2018-01-01", times = 4), rep(x = "2018-01-02", times = 4)),
Sales = c(255, 138, 368, NA_integer_, 335, 75, 188, 118),
Campaign.flag = c(0, 0, 0, 0, 1, 0, 0, 0)

Customer is a factor with 40 levels. I want to group them together based on Sales (and the Date) so that I get the 12 biggest customers and a new group "others" that gather the remaining
28 customers with the lowest Sale.

Any good ideas?

DavoWW · September 7, 2021, 11:39am

Hi @jak123,
You need the forcats package.
It has various fct_lump_*() functions for combining factor levels which have low frequencies.

jak123 · September 8, 2021, 8:05am

And how do that work in practice?

jak123 · September 10, 2021, 5:22am

anyone that can help?

DavoWW · September 10, 2021, 8:40am

Hi @jak123,

library(forcats)
your_df$customer_group <- fct_lump_n(your_df$Customer, n=12)

# Check the groupings
table(your_df$customer_group)

jak123 · September 12, 2021, 7:43pm

Cool ty, but how does it know that it is in terms of Sales i want to sort the top n customers?

DavoWW · September 13, 2021, 3:02am

Hi @jak123,
Sorry, I didn't read your original question carefully enough. This is how you can arrange your customers by "total Sales" and then pick out say, the top 4, and group the rest as "5".

suppressPackageStartupMessages(library(tidyverse))

# Expand the posted data to make example clearer
your_df <- data.frame(
  Item.group = rep(c("Breast", "Breast", "Whole", "Whole", "Whole", 
                     "Breast", "Minced", "Thigh"), times=2),
  Customer = c("D17", "D17", "D17", "D21", "D7", "D17", "D17", "D17",
               "D21","D23","D11","D28","D28","D30","D31","D34"),
  Date = c(rep(x = "2018-01-01", times = 8), rep(x = "2018-01-02", times = 8)),
  Sales = c(255, 138, 368, NA_integer_, 335, 75, 188, 118,
            121, 56, 98, 145, 133, 108, 41, 67),
  Campaign.flag = c(0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0))
  
# Assume we want the top 4 customers and the rest lumped together.
your_df %>% 
  group_by(Customer) %>% 
  summarise(total_sales = sum(Sales, na.rm=TRUE),
            freq = n()) %>% 
  arrange(desc(total_sales)) %>% 
  mutate(new_cust = seq(1, nrow(.)),
         new_cust = ifelse(new_cust > 5, 5, new_cust)) -> new_df

new_df
#> # A tibble: 9 x 4
#>   Customer total_sales  freq new_cust
#>   <chr>          <dbl> <int>    <dbl>
#> 1 D17             1142     6        1
#> 2 D7               335     1        2
#> 3 D28              278     2        3
#> 4 D21              121     2        4
#> 5 D30              108     1        5
#> 6 D11               98     1        5
#> 7 D34               67     1        5
#> 8 D23               56     1        5
#> 9 D31               41     1        5

^{Created on 2021-09-13 by the reprex package (v2.0.1)}
Further summarizing is then possible, if required.

jak123 · September 13, 2021, 6:11am

df %>%
group_by(Customer) %>%
summarise(total_sales = sum(Sales, na.rm=TRUE),
freq = n()) %>%
arrange(desc(total_sales)) %>%
mutate(new_cust = seq(1, nrow(.)),
new_cust = ifelse(new_cust > 7, 7, new_cust)) -> new_df

Not working ( new_df is 40 observations and 4 variables) i need it to be 7 observations with the 6 biggest customers and 1 group with the others based on sales

jak123 · September 13, 2021, 6:26am

also im not sure what the two last lines does

jak123 · September 13, 2021, 7:30am

Im also trying to make a ggplot of the sales on y-axis and then facet wrap customers so I can see which customer buys different item group.

df %>%
group_by(Item.group) %>%
summarize(sum_sale = sum(Sales, na.rm = TRUE)) %>%
ggplot() +
# You can use reorder() for ascending sorting of values
geom_bar(aes(x = reorder(Item.group, sum_sale, FUN = desc),
y = sum_sale), stat = "identity") +
theme_bw() +
facet_wrap(~Customer)
xlab("Item Group") +
ylab("Sum Sales") +
scale_y_continuous(breaks = c(0,1000000, 5000000, 8000000),
labels = c(0,"1m", "5m", "8m"))

Error: At least one layer must contain all faceting variables: Customer.

Plot is missing Customer
Layer 1 is missing Customer

Any idea?

DavoWW · September 14, 2021, 3:43am

Try this:

# Assume we want the top 6 customers and the rest lumped together as "Others".
your_df %>%
  group_by(Customer) %>%
  summarise(total_sales = sum(Sales, na.rm=TRUE),
            freq = n()) %>%
  arrange(desc(total_sales)) %>%
  mutate(seq_cust = seq(1, nrow(.)),
         new_cust = ifelse(seq_cust > 6, "Others", Customer)) -> new_df

new_df

# Summarise even further
new_df %>%
  group_by(new_cust) %>%
  summarise(total_sales = sum(total_sales, na.rm=TRUE),
            freq = n()) %>% 
  arrange(desc(total_sales))

DavoWW · September 14, 2021, 3:47am

Here's how to get a facetted grid of histograms of Item.group sales per Customer

your_df %>%
  group_by(Item.group, Customer) %>%
  summarize(sum_sale = sum(Sales, na.rm = TRUE)) %>%
  ggplot() +
    geom_col(aes(x = Item.group,
                 y = sum_sale)) +
    theme_bw() +
    facet_wrap(~ Customer) +
    xlab("Item Group") +
    ylab("Sum Sales") +
    scale_y_continuous(
      breaks = c(0, 1000000, 5000000, 8000000),
      labels = c(0, "1m", "5m", "8m")
  )

jak123 · September 15, 2021, 9:18am

Hi again and thanks a lot for the help, the point with this:

df %>%
group_by(Item.group, Customer) %>%
summarize(sum_sale = sum(Sales, na.rm = TRUE)) %>%
ggplot() +
geom_col(aes(x = Item.group,
y = sum_sale)) +
theme_bw() +
facet_wrap(~ Customer) +
xlab("Item Group") +
ylab("Sum Sales") +
scale_y_continuous(
breaks = c(0, 1000000, 5000000, 8000000),
labels = c(0, "1m", "5m", "8m")
)

was that I wanted to see top 7 customers based on sales -> but with the above code I got all 40 customers can we fix that?

system · October 6, 2021, 9:19am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.