Reshaping followed by ggplot: A few questions

Piranha · February 27, 2018, 5:38pm

Hello. This is my second question on this forum, hopefully I have made everything reproducible and easy to follow. I am self-taught in R.

I am using raw data that is in the form of a .csv file. I am reading that data into R using the readxl package. I have re-created the structure of the data below (using fake numbers).

Annual sales of 5 brands (Brands A though E) for Company X

library(tidyverse)
library(readxl)

df <- tibble(
  Year = as.Date(c("2014-12-01", "2015-12-01", "2016-12-01", "2017-12-01", "2014-12-01", 
                   "2015-12-01", "2016-12-01", "2017-12-01", "2014-12-01", "2015-12-01", 
                   "2016-12-01", "2017-12-01", "2014-12-01", "2015-12-01", "2016-12-01", 
                   "2017-12-01", "2014-12-01", "2015-12-01", "2016-12-01", "2017-12-01")),
  Brand = as.character(c("Brand_A", "Brand_A", "Brand_A", "Brand_A", "Brand_B", "Brand_B", "Brand_B", 
                         "Brand_B", "Brand_C", "Brand_C", "Brand_C", "Brand_C", "Brand_D", "Brand_D", 
                         "Brand_D", "Brand_D", "Brand_E", "Brand_E", "Brand_E", "Brand_E")), 
  Sales = c(1200, 1100, 1000, 950, 290, 265, 225, 200, 730, 900, 950, 1500, 500, 700, 900, 
            1000, 1000, 1100, 1200, 1300))

Lets say that I am an analyst at company X. My business objectives with the data:

For my business, I want to create the following (basic) business insights:

What was my sales mix by Brand in 2016?
What was my sales mix by Brand in 2017?
Create a simple line chart with the Year-over-Year percentagae sales growth by brand
for only my top 3 brands as of 2016
Create a simple line chart with the Year-over-Year percentagae sales growth by brand
for only my top 3 brands as of 2017

My approach to the solution is shown below. But first, here are my questions to the community

Q1. Maybe it is because I am still a beginner, but my approach below seems to have a lot of steps and intermediate data frames. Do you have any suggestions for simplifying the code?

Q2a. In order to calculate the sales mix (and rank the brands) I reshaped the original data from the "long" form, into the "wide" form. Is there a way to calculate sum of the variables for each point in the time series (in my case below, rowSums), without having the convert from long to wide?

Q2b. In order to create the multi-variable line chart in ggplot, I took the "wide" format data and turned it back into a "long" format, then used that as the data argument in my ggplot. Related to Q2a, is there a way to take the steps: long data >> wide data >> calculations >> long data >> ggplot, without having to make multiple flips between long and wide?

Q3. In the two output charts, you can see that the variables have different colors. E.g. in 2016 Brand_C is green, but in 2017 Brand_C is red. Is there a way to assign a color to a variable so that they stay consistent across charts, even if their order of appearance changes. In my case, the order of appearance for the variables went from A,C,E in 2016 to C,D,E in 2017.

Here is my code:

# long to wide
df_wide <- df %>%    
  spread(Brand, Sales)

 # calculate Sum of all brands, then % sales mix
df_product_mix <- df_wide %>%  
  mutate(Total_Company = rowSums(.[2:ncol(.)], na.rm = TRUE)) %>%
  mutate_at(2:ncol(.), function(x){x/.$Total_Company * 100})

# Find the names of the Top 3 brands, this will be used later
Top3_2016 <- df_product_mix %>%
  select(-Total_Company) %>%
  filter(Year == "2016-12-01") %>%
  gather(Brand, Mix, -Year) %>%
  arrange(desc(Mix)) %>%
  .[["Brand"]] %>%
  head(3)

# Find the names of the Top 3 brands, this will be used later
Top3_2017 <- df_product_mix %>%
  select(-Total_Company) %>%
  filter(Year == "2017-12-01") %>%
  gather(Brand, Mix, -Year) %>%
  arrange(desc(Mix)) %>%
  .[["Brand"]] %>%
  head(3)

# Function to Calculate Percentage Change vs Year Ago

YY <-  function(x){
  ((x - lag(x,1))/lag(x,1)) * 100 
}

# data frame with the YY calcs
df_wide_YY <- df_wide %>%
  mutate_if(is.numeric, YY)

# 2016 data frame that will be the data argument for ggplot
For_2016_chart <- df_wide_YY %>%
  select(Year, Top3_2016) %>%
  filter(Year != "2017-12-01") %>%
  gather(Brand, Dollars, -Year) %>%
  na.omit()

# 2017 data frame that will be the data argument for ggplot
For_2017_chart <- df_wide_YY %>%
  select(Year, Top3_2017) %>%
  gather(Brand, Dollars, -Year) %>%
  na.omit()  

# ggplot for 2016 sales growth
Chart_2016 <- ggplot(data = For_2016_chart, aes(x=Year, y=Dollars, color=Brand)) +
  geom_line() +
  geom_hline(yintercept=0) +
  scale_x_date(breaks = For_2016_chart$Year, date_labels = "%b-%y") +
  ggtitle("YY Growth Rates: Top 3 Brands of 2016")
Chart_2016

# ggplot for 2017 sales growth
Chart_2017 <- ggplot(data = For_2017_chart, aes(x=Year, y=Dollars, color=Brand)) +
  geom_line() +
  geom_hline(yintercept=0) +
  scale_x_date(breaks = For_2017_chart$Year, date_labels = "%b-%y") +
  ggtitle("YY Growth Rates: Top 3 Brands of 2017")
Chart_2017

mara · February 27, 2018, 6:10pm

Apologies that I don't have time to give you a proper answer, but you might want to check out the dplyr group_by() function. It can save you from doing that wide conversion.

library(tidyverse)
library(readxl)

df <- tibble(
  Year = as.Date(c("2014-12-01", "2015-12-01", "2016-12-01", "2017-12-01", "2014-12-01", 
                   "2015-12-01", "2016-12-01", "2017-12-01", "2014-12-01", "2015-12-01", 
                   "2016-12-01", "2017-12-01", "2014-12-01", "2015-12-01", "2016-12-01", 
                   "2017-12-01", "2014-12-01", "2015-12-01", "2016-12-01", "2017-12-01")),
  Brand = as.character(c("Brand_A", "Brand_A", "Brand_A", "Brand_A", "Brand_B", "Brand_B", "Brand_B", 
                         "Brand_B", "Brand_C", "Brand_C", "Brand_C", "Brand_C", "Brand_D", "Brand_D", 
                         "Brand_D", "Brand_D", "Brand_E", "Brand_E", "Brand_E", "Brand_E")), 
  Sales = c(1200, 1100, 1000, 950, 290, 265, 225, 200, 730, 900, 950, 1500, 500, 700, 900, 
            1000, 1000, 1100, 1200, 1300))

df %>%
  mutate(Year = lubridate::year(Year)) %>%
  group_by(Year) %>%
  summarise(year_total = sum(Sales))
#> # A tibble: 4 x 2
#>    Year year_total
#>   <dbl>      <dbl>
#> 1 2014.      3720.
#> 2 2015.      4065.
#> 3 2016.      4275.
#> 4 2017.      4950.

df %>%
  group_by(Brand) %>%
  summarise(total_sales = sum(Sales))
#> # A tibble: 5 x 2
#>   Brand   total_sales
#>   <chr>         <dbl>
#> 1 Brand_A       4250.
#> 2 Brand_B        980.
#> 3 Brand_C       4080.
#> 4 Brand_D       3100.
#> 5 Brand_E       4600.

Created on 2018-02-27 by the reprex package (v0.2.0).

Piranha · February 27, 2018, 6:15pm

Thank you!

I am going to look into group_by. I was aware of that function, but had trouble with it earlier and was put off from using it for some reason. Will do some more reading on it now.

joels · March 28, 2018, 9:35pm

You can save a little typing by doing group_by(Year = lubridate::year(Year)), which avoids the need for mutate.

joels · March 28, 2018, 10:15pm

If you want to highlight changes in rank from year to year, how about a stacked bar chart with stacking order varying based on annual sales rank?

library(tidyverse)
library(lubridate)
library(scales)

df %>% 
  # Calculate total sales by brand by year
  group_by(Year=year(Year), Brand) %>% 
  summarise(tot_sales = sum(Sales)) %>%
  # Order brand within each Year by tot_sales
  arrange(Year, desc(tot_sales)) %>% 
  # Create an ordering variable based on the new ordering
  mutate(order_var = paste(Year, Brand)) %>% 
  ungroup %>% 
  mutate(order_var = factor(order_var, levels=unique(order_var))) %>% 
  # Pipe the data into ggplot
  # group=order_var stacks the brands by their sales rank within each year
  ggplot(aes(Year, tot_sales, fill=Brand, group=order_var)) +
    # Stacked bars
    geom_col(colour="white", size=0.3) +
    # Text labels with brand and sales
    geom_text(aes(label=paste0(gsub("Brand_", "", Brand), ": $", format(tot_sales, big.mark=",", trim=TRUE))), 
              position=position_stack(vjust=0.5),
              colour="white", size=3) +
    theme_classic() +
    labs(y="", title="Annual Sales by Brand",
         caption="Within each year, brands are stacked in order of total sales") +
    scale_y_continuous(labels=dollar) +
    guides(fill=FALSE) +
    theme(plot.title=element_text(hjust=0.5),
          plot.caption=element_text(margin=margin(t=12), size=8, hjust=0.5))

Rplot112

joels · March 28, 2018, 10:54pm

Regarding your question Q3 about consistent colors for the same variable across different plots: You can do this by creating a named vector of colors. Here's an example:

n = n_distinct(df$Brand)
color_vec = set_names(hcl(seq(15,375,length=n + 1)[1:n], 100, 65), unique(df$Brand))

color_vec

##   Brand_A   Brand_B   Brand_C   Brand_D   Brand_E 
##  "#F8766D" "#A3A500" "#00BF7D" "#00B0F6" "#E76BF3"

Then you can add the following to each plot:

+ scale_color_manual(values=color_vec)

Since each color is explicitly assigned to a brand, the colors will be consistent across plots.

Piranha · March 29, 2018, 8:27pm

@joels Thank you for these suggestions. I will give these a try!