Data wrangling and line graphs

ggplot2

#1

I am new to R, so I suspect this is a common question. I am importing a data from an Excel file (neb_rail). One of the columns contains the month as a character (eg, January, February...etc.), and the other has a year column as double. The sample tibble is below.

# A tibble: 77 x 6
    year month      vol_m3 volume_m3d volume_bbl volume_bpd
   <dbl> <chr>       <dbl>      <dbl>      <dbl>      <dbl>
 1 2018. May       979274.     31589.   6162433.    198788.
 2 2018. April     922323.     30744.   5804048.    193468.
 3 2018. March     840522.     27114.   5289282.    170622.
 4 2018. February  596565.     21306.   3754096.    134075.
 5 2018. January   717726.     23152.   4516550.    145695.
 6 2017. December  748489.     24145.   4710133.    151940.
 7 2017. November  707046.     23568.   4449337.    148311.
 8 2017. October   675768.     21799.   4252509.    137178.
 9 2017. September 639449.     21315.   4023963.    134132.
10 2017. August    590833.     19059.   3718026.    119936.
# ... with 67 more rows

I am trying to graph the data using a multiple line graph, grouped by year with x-axis as month. The code is as follows:

ggplot(data=neb_rail) +
  geom_line(mapping=aes(x=month, y= volume_bpd, group = year, 
                        color=year), size = 1) +
  labs(y="bbl/d", x="Month", title="Oil Exports by Rail") 

I am struggling with getting the x-axis to graph in the proper order (jan, feb, mar...etc), and getting the color on the line graph to be more contrasted rather than the current colors, which are shades of blue and black. A snapshot of the graph is as follows:

I tried converting month to factor, numeric...etc, and nothing works. Any suggestions on how to fix this would be greatly appreciated.

I apologize if this is not posted correctly. This is my first post.


#2

Change group = year for group = factor(year)


#3

Take a look at this post that describes the process for creating a reproducible example (especially last post by John Mount describing how to use his package to create example of a dataframe):

That being said, the problem you are seeing is because right now your month column is character, while it should be a date. Take a look at lubridate package, specifically month and ymd functions.


#4

Didn't work as suggested, so I modified it using the following:

neb_rail$year <- as.factor(neb_rail$year)
neb_rail$month <- as.factor(neb_rail$month)

That helped the graphing colors, but the x-axis is still messed up.

Will try lubridate.

Tks for the feedback.


#5

To get the months in the correct order, convert month to a factor and set the levels in the desired order. This can be done using the built-in month.name vector, so you don't have to type out all the months. In the code below, I've done this on the fly, but you can also do this before plotting. To get discrete colors, use color=factor(year), which turns year into a categorical variable (no need to change the underlying data).

library(tidyverse)
library(scales)

# Fake data
set.seed(2)
d = data_frame(year=rep(2012:2018, each=12), 
               month=rep(month.name, 7),
               volume_bpd = c(replicate(7, 1e5 + cumsum(rnorm(12, 0, 10000)))))

ggplot(d %>% mutate(month=factor(month, levels=month.name))) +
  geom_line(aes(x=month, y= volume_bpd, group=year, color=factor(year)), size = 1) +
  labs(y="bbl/d", x="Month", title="Oil Exports by Rail", colour="Year")

Because of the large number of lines crossing each other, this plot might be easier to read with each line directly labeled. I've also switched to month abbreviations and made a few other changes.

# Add month abbreviations to the data frame
d = d %>% 
  left_join(data.frame(month=month.name, mon=month.abb)) %>% 
  mutate(mon=factor(mon, levels=month.abb))

ggplot(d, aes(x=mon, y= volume_bpd, group=year, color=factor(year))) +
  geom_line(size = 1) +
  geom_text(data=d %>% filter(mon=="Dec"), aes(label=year), 
                    position=position_nudge(0.1), hjust=0) +
  labs(y="bbl/d", x="Month", title="Oil Exports by Rail", colour="Year") +
  theme_classic() +
  expand_limits(x=length(unique(d$month)) + 0.8) +
  scale_y_continuous(labels=comma) +
  guides(colour=FALSE)


#6

Edited to add: I didn't check for updates before posting :woman_facepalming: ... What @joels said, all of it! (I also endorse making year a factor on the fly, even though that's not what I did below)


I am usually the first to tell people to convert their date-like data into actual dates, but in this scenario I am actually going to advocate for making month and year factors. Your x-axis doesn't represent a single progression through time, but instead several years overlaid. If we expect the month positions to actually represent, say, the first day of each month in each year, they wouldn't even have consistent spacing since there are leap years in your dataset. In this scenario, it seems to me that months are functioning as category labels with an intrinsic order to them, the exact thing factors were made for.

Regardless of what you do with month, I think you definitely want to convertyear into a factor, as @felipeflores suggested.

Since year is currently numeric, ggplot() doesn't know that it's supposed to function as a set of category labels with visually distinguishable colors. Instead, it's treating it as numbers (that happen to range between 2,012 and 2,018), and applying a continuous color ramp.

Here are some code examples, working with the limited glimpse of the data provided, reformatted by hand :weary: into something usable (I definitely concur about checking out the post about how to include data in your questions!).

To make the consequences of the two approaches (date vs factor) more obvious, I've tweaked the data a bit so that the 2017 and 2018 data cover the same range of months (as is the case in the full data set)

library(tidyverse)

neb_rail <- wrapr::build_frame(
  "year", "month"    , "vol_m3", "volume_m3d", "volume_bbl", "volume_bpd" |
    2018L , "May"      , 979274L , 31589L      , 6162433L    , 198788L      |
    2018L , "April"    , 922323L , 30744L      , 5804048L    , 193468L      |
    2018L , "March"    , 840522L , 27114L      , 5289282L    , 170622L      |
    2018L , "February" , 596565L , 21306L      , 3754096L    , 134075L      |
    2018L , "January"  , 717726L , 23152L      , 4516550L    , 145695L      |
    2017L , "May"      , 748489L , 24145L      , 4710133L    , 151940L      |
    2017L , "April"    , 707046L , 23568L      , 4449337L    , 148311L      |
    2017L , "March"    , 675768L , 21799L      , 4252509L    , 137178L      |
    2017L , "February" , 639449L , 21315L      , 4023963L    , 134132L      |
    2017L , "January"  , 590833L , 19059L      , 3718026L    , 119936L      )

# `month` and `year` are factors
neb_rail_fac <- neb_rail %>% 
  mutate(
    year = factor(year),
    month = factor(month, levels = month.name)
  )

ggplot(data = neb_rail_fac) +
  geom_line(mapping = aes(
    x = month,
    y = volume_bpd,
    group = year,
    color = year
  ),
  size = 1) +
  labs(y = "bbl/d", x = "Month", title = "Oil Exports by Rail") 

# `year` is a factor, create a new date variable
neb_rail_date <- neb_rail %>% 
  mutate(
    year_month = lubridate::parse_date_time(paste(year, month),orders = "ym"),
    year = factor(year)
  )

ggplot(data = neb_rail_date) +
  geom_line(mapping = aes(
    x = year_month,
    y = volume_bpd,
    group = year,
    color = year
  ),
  size = 1) +
  labs(y = "bbl/d", x = "Month", title = "Oil Exports by Rail") 

Created on 2018-07-31 by the reprex package (v0.2.0).