Hello. This is my second question on this forum, hopefully I have made everything reproducible and easy to follow. I am self-taught in R.
I am using raw data that is in the form of a .csv file. I am reading that data into R using the readxl package. I have re-created the structure of the data below (using fake numbers).
Annual sales of 5 brands (Brands A though E) for Company X
library(tidyverse)
library(readxl)
df <- tibble(
Year = as.Date(c("2014-12-01", "2015-12-01", "2016-12-01", "2017-12-01", "2014-12-01",
"2015-12-01", "2016-12-01", "2017-12-01", "2014-12-01", "2015-12-01",
"2016-12-01", "2017-12-01", "2014-12-01", "2015-12-01", "2016-12-01",
"2017-12-01", "2014-12-01", "2015-12-01", "2016-12-01", "2017-12-01")),
Brand = as.character(c("Brand_A", "Brand_A", "Brand_A", "Brand_A", "Brand_B", "Brand_B", "Brand_B",
"Brand_B", "Brand_C", "Brand_C", "Brand_C", "Brand_C", "Brand_D", "Brand_D",
"Brand_D", "Brand_D", "Brand_E", "Brand_E", "Brand_E", "Brand_E")),
Sales = c(1200, 1100, 1000, 950, 290, 265, 225, 200, 730, 900, 950, 1500, 500, 700, 900,
1000, 1000, 1100, 1200, 1300))
Lets say that I am an analyst at company X. My business objectives with the data:
For my business, I want to create the following (basic) business insights:
- What was my sales mix by Brand in 2016?
- What was my sales mix by Brand in 2017?
- Create a simple line chart with the Year-over-Year percentagae sales growth by brand
for only my top 3 brands as of 2016 - Create a simple line chart with the Year-over-Year percentagae sales growth by brand
for only my top 3 brands as of 2017
My approach to the solution is shown below. But first, here are my questions to the community
Q1. Maybe it is because I am still a beginner, but my approach below seems to have a lot of steps and intermediate data frames. Do you have any suggestions for simplifying the code?
Q2a. In order to calculate the sales mix (and rank the brands) I reshaped the original data from the "long" form, into the "wide" form. Is there a way to calculate sum of the variables for each point in the time series (in my case below, rowSums), without having the convert from long to wide?
Q2b. In order to create the multi-variable line chart in ggplot, I took the "wide" format data and turned it back into a "long" format, then used that as the data argument in my ggplot. Related to Q2a, is there a way to take the steps: long data >> wide data >> calculations >> long data >> ggplot, without having to make multiple flips between long and wide?
Q3. In the two output charts, you can see that the variables have different colors. E.g. in 2016 Brand_C is green, but in 2017 Brand_C is red. Is there a way to assign a color to a variable so that they stay consistent across charts, even if their order of appearance changes. In my case, the order of appearance for the variables went from A,C,E in 2016 to C,D,E in 2017.
Here is my code:
# long to wide
df_wide <- df %>%
spread(Brand, Sales)
# calculate Sum of all brands, then % sales mix
df_product_mix <- df_wide %>%
mutate(Total_Company = rowSums(.[2:ncol(.)], na.rm = TRUE)) %>%
mutate_at(2:ncol(.), function(x){x/.$Total_Company * 100})
# Find the names of the Top 3 brands, this will be used later
Top3_2016 <- df_product_mix %>%
select(-Total_Company) %>%
filter(Year == "2016-12-01") %>%
gather(Brand, Mix, -Year) %>%
arrange(desc(Mix)) %>%
.[["Brand"]] %>%
head(3)
# Find the names of the Top 3 brands, this will be used later
Top3_2017 <- df_product_mix %>%
select(-Total_Company) %>%
filter(Year == "2017-12-01") %>%
gather(Brand, Mix, -Year) %>%
arrange(desc(Mix)) %>%
.[["Brand"]] %>%
head(3)
# Function to Calculate Percentage Change vs Year Ago
YY <- function(x){
((x - lag(x,1))/lag(x,1)) * 100
}
# data frame with the YY calcs
df_wide_YY <- df_wide %>%
mutate_if(is.numeric, YY)
# 2016 data frame that will be the data argument for ggplot
For_2016_chart <- df_wide_YY %>%
select(Year, Top3_2016) %>%
filter(Year != "2017-12-01") %>%
gather(Brand, Dollars, -Year) %>%
na.omit()
# 2017 data frame that will be the data argument for ggplot
For_2017_chart <- df_wide_YY %>%
select(Year, Top3_2017) %>%
gather(Brand, Dollars, -Year) %>%
na.omit()
# ggplot for 2016 sales growth
Chart_2016 <- ggplot(data = For_2016_chart, aes(x=Year, y=Dollars, color=Brand)) +
geom_line() +
geom_hline(yintercept=0) +
scale_x_date(breaks = For_2016_chart$Year, date_labels = "%b-%y") +
ggtitle("YY Growth Rates: Top 3 Brands of 2016")
Chart_2016
# ggplot for 2017 sales growth
Chart_2017 <- ggplot(data = For_2017_chart, aes(x=Year, y=Dollars, color=Brand)) +
geom_line() +
geom_hline(yintercept=0) +
scale_x_date(breaks = For_2017_chart$Year, date_labels = "%b-%y") +
ggtitle("YY Growth Rates: Top 3 Brands of 2017")
Chart_2017