Creating a barplot with ordered bars

dplyr
ggplot2

#1

Hi,

I'm trying to create a barplot with bars ordered from the most frequent category to the less frequent one (btw, this is the right plot to create for factor variables, right? A boxplot would only make sense for categorical x and continuous y). I know of this question which is similar:

But it's not the same: I don't have any facets here. my_df has only two columns, month containing abbreviations of the first 10 months of the year, and state which is either on or off. I want to create a barplot which shows the counts for each month, ideally by status, and ordered by count. I tried to order my dataframe by month count (sorted_df_easy) or by month count and status before plotting it. Both approaches don't work:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)
library(ggplot2)
# library(microbenchmark)

n <- 10^5
key <- as.factor(sample(month.abb[1:10], 10))
my_df <- data.frame(month = sample(key, n, replace = TRUE, prob = seq(0.1, 1, 0.1)), 
                    state = sample(c("on", "off"), n, replace = TRUE))
my_df$month[sample(seq_len(n), 100)] <- NA

sorted_df_easy <- my_df %>%
  count(month) %>%
  arrange(-n)

# this doesn't work
ggplot(sorted_df_easy, aes(x = month, y = n)) +
  geom_bar(stat="identity") + 
  coord_flip()


sorted_df_hard <- my_df %>%
  count(state, month) %>%
  arrange(state, -n)

# of course, this is even worse
ggplot(sorted_df_hard, aes(x = month, y = n, fill = state)) +
  geom_bar(stat="identity") + 
  coord_flip()

Created on 2018-09-04 by the reprex package (v0.2.0).

Any solutions? Preferably, I'd rather not use forcats - this is for an edge system, and the less stuff I depend on, the better (that's why I don't load tidyverse, btw). Of course, if the forcats is considerably shorter and more readable than the non-forcats solution, I could change my mind.


#2

You can do it using the factor function inside of of a mutate call after your arrange function. Here is an example:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)
library(ggplot2)
# library(microbenchmark)

n <- 10^5
key <- as.factor(sample(month.abb[1:10], 10))
my_df <- data.frame(month = sample(key, n, replace = TRUE, prob = seq(0.1, 1, 0.1)), 
                    state = sample(c("on", "off"), n, replace = TRUE))
my_df$month[sample(seq_len(n), 100)] <- NA

sorted_df_easy <- my_df %>%
  count(month) %>%
  arrange(-n) %>% 
  mutate(month = factor(month, levels = unique(month)))

# this doesn't work
ggplot(sorted_df_easy, aes(x = month, y = n)) +
  geom_bar(stat="identity") + 
  coord_flip()



sorted_df_hard <- my_df %>%
  count(state, month) %>%
  arrange(state, -n)  %>% 
  mutate(month = factor(month, levels = unique(month)))

# of course, this is even worse
ggplot(sorted_df_hard, aes(x = month, y = n, fill = state)) +
  geom_bar(stat="identity") + 
  coord_flip()

Created on 2018-09-04 by the reprex package (v0.2.0).


#3

Here is the same thing just with forcats (just for comparison):

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)
library(ggplot2)
library(forcats)


n <- 10^5
key <- as.factor(sample(month.abb[1:10], 10))
my_df <- data.frame(month = sample(key, n, replace = TRUE, prob = seq(0.1, 1, 0.1)), 
                    state = sample(c("on", "off"), n, replace = TRUE))
my_df$month[sample(seq_len(n), 100)] <- NA

sorted_df_easy <- my_df %>%
  count(month) %>%
  mutate(month = fct_reorder(month, -n))

# this doesn't work
ggplot(sorted_df_easy, aes(x = month, y = n)) +
  geom_bar(stat="identity") + 
  coord_flip()



sorted_df_hard <- my_df %>%
  count(state, month)%>%
  mutate(month = fct_reorder(month, -n))

# of course, this is even worse
ggplot(sorted_df_hard, aes(x = month, y = n, fill = state)) +
  geom_bar(stat="identity") + 
  coord_flip()

Created on 2018-09-04 by the reprex package (v0.2.0).


#4

You could replace the line
key <- as.factor(sample(month.abb[1:10], 10))
with:
key <- factor(sample(month.abb[1:10], 10), levels = month.abb)
to get the months as properly ordered factors, rather than just alphabetical.

I would then actually query whether ordering the bars by count makes sense, but if this is what you want then @tbradley has provided the solution.


#5

Amaaaazing! Thank you very much :blush:


#6

This is a good point. Ordering by count on a variable that is already inherently ordered may be confusing to your audience. So it is something to consider.


#7

No, it's actually correct in my use case. I kept it simple to avoid bothering you with a more complex data frame, but let's say that in my real case the inherent ordering of the variable doesn't make sense. Think of it this way: you may want to know, across several years, which month is the one where a certain event happens more often. In this case it's October, so it makes sense to order by count and not by month order.