Plotting Annual Geometric Means On a Time Series Scatterplot

Hello, I am trying to plot concentration of total phosphorus (tp) over a long (mulit-decade) time period. Samples were collected varying number of times per year. I am able to make the scatterplot, but I need to plot a line with points representing the annual geometric means. Is there a way to do this, or do I have to make multiple graphs (after calculating annual geometric means)?

I have looked at SteveXD solution, but in his example he only has values for each year (e.g., 1, 2, etc.), and not months within those years: Plot an average line over the scatter plot within each cell of facet_grid

Here is a mall subset of my data (I actually have several lakes, and parameters ("parm"), with 20 years,. Datafile is named "tpexample", with six columns:

tpexample <- tibble::tribble(
                 ~lake,     ~source,       ~date, ~parm, ~value,  ~unit,
               "lake1", "lakewatch", "30-Jul-92",  "tp",   0.06, "mg/L",
               "lake1", "lakewatch", "18-Aug-92",  "tp",   0.07, "mg/L",
               "lake1", "lakewatch", "29-Sep-92",  "tp",   0.13, "mg/L",
               "lake1", "lakewatch", "29-Oct-92",  "tp",    0.1, "mg/L",
               "lake1", "lakewatch", "16-Nov-92",  "tp",   0.16, "mg/L",
               "lake1", "lakewatch", "16-Dec-92",  "tp",   0.13, "mg/L",
               "lake1", "lakewatch", "19-Jan-93",  "tp",   0.09, "mg/L",
               "lake1", "lakewatch", "10-Feb-93",  "tp",   0.09, "mg/L",
               "lake1", "lakewatch", "24-Mar-93",  "tp",   0.05, "mg/L",
               "lake1", "lakewatch", "28-Apr-93",  "tp",   0.05, "mg/L",
               "lake1", "lakewatch", "28-May-93",  "tp",   0.04, "mg/L",
               "lake1", "lakewatch", "29-Jun-93",  "tp",   0.04, "mg/L",
               "lake1", "lakewatch", "29-Jul-93",  "tp",   0.04, "mg/L",
               "lake1", "lakewatch", "20-Aug-93",  "tp",   0.03, "mg/L",
               "lake1", "lakewatch", "27-Sep-93",  "tp",   0.03, "mg/L",
               "lake1", "lakewatch", "20-Oct-93",  "tp",   0.04, "mg/L",
               "lake1", "lakewatch", "22-Nov-93",  "tp",   0.04, "mg/L",
               "lake1", "lakewatch", "22-Dec-93",  "tp",   0.05, "mg/L",
               "lake1", "lakewatch", "26-Jan-94",  "tp",   0.06, "mg/L",
               "lake1", "lakewatch", "23-Feb-94",  "tp",   0.05, "mg/L",
               "lake1", "lakewatch", "23-Mar-94",  "tp",   0.04, "mg/L",
               "lake1", "lakewatch", "27-Apr-94",  "tp",   0.03, "mg/L",
               "lake1", "lakewatch", "28-Jun-94",  "tp",   0.05, "mg/L",
               "lake1", "lakewatch",  "9-Jul-94",  "tp",   0.05, "mg/L",
               "lake1", "lakewatch",  "5-Oct-94",  "tp",   0.08, "mg/L",
               "lake1", "lakewatch",  "1-Nov-94",  "tp",   0.09, "mg/L",
               "lake1", "lakewatch", "22-Dec-94",  "tp",   0.11, "mg/L",
               "lake1", "lakewatch", "31-Jan-95",  "tp",    0.1, "mg/L",
               "lake1", "lakewatch", "16-Feb-95",  "tp",   0.08, "mg/L",
               "lake1", "lakewatch", "14-Mar-95",  "tp",   0.08, "mg/L",
               "lake1", "lakewatch", "13-Apr-95",  "tp",   0.06, "mg/L",
               "lake1", "lakewatch", "11-May-95",  "tp",   0.05, "mg/L",
               "lake1", "lakewatch", "20-Jun-95",  "tp",   0.03, "mg/L",
               "lake1", "lakewatch", "25-Jul-95",  "tp",   0.03, "mg/L",
               "lake1", "lakewatch", "21-Aug-95",  "tp",   0.17, "mg/L",
               "lake1", "lakewatch", "12-Sep-95",  "tp",   0.15, "mg/L",
               "lake1", "lakewatch", "16-Oct-95",  "tp",    0.1, "mg/L",
               "lake1", "lakewatch", "14-Nov-95",  "tp",    0.1, "mg/L",
               "lake1", "lakewatch", "12-Dec-95",  "tp",   0.06, "mg/L",
               "lake1", "lakewatch", "23-Jan-96",  "tp",   0.05, "mg/L",
               "lake1", "lakewatch", "15-Feb-96",  "tp",   0.07, "mg/L",
               "lake1", "lakewatch", "20-Mar-96",  "tp",   0.06, "mg/L"
               )

Created on 2020-11-12 by the reprex package (v0.3.0)

I made lake, source, parm and unit factors, and date as a date. I then wrote the following script, but again, not sure how to overlay annual geometric mean values over this plot.:

I also made new factors for year, month, and day:


tpexample <- tpexample %>%
  dplyr::mutate(date = ymd(date)) %>% 
  mutate_at(vars(date), funs(year, month, day))

Created on 2020-11-12 by the reprex package (v0.3.0)

And here is the code for the graph

graph2 <- tpexample %>%
  filter(lake=="lake1", parm =="tp") %>% 
  ggplot(aes(x=date, y=value, color=source)) +
  geom_point(size =2) +
  scale_x_date(date_breaks = "2 years", date_labels = "%Y") + 
  theme(legend.position = "bottom") 

This is the plot:

And this is the plot I want to create, with a hand-drawn green line and blue dots as an a example.

Thank you

(I think I finally did my reprex correctly....)

I would calculate a separate data frame with the geometric mean. I included the mean date in that data frame but you could set the date to January 1, or any other date you care to use.

library(dplyr)
library(lubridate)
library(ggplot2)
tpexample <- tpexample %>% mutate(date = dmy(date),
                                  year = year(date))
GeomMean <- tpexample %>% group_by(year) %>% 
  summarize(GMean = exp(mean(log(value))), MeanDate = mean(date))

ggplot() + geom_point(mapping = aes(date, value), data = tpexample) +
  geom_point(mapping = aes(MeanDate, GMean), data = GeomMean, size = 3, color = "red")

@FJCC Thank you!!!! This saved me, as I have to finish this up today. A few odd questions:

  1. I had to change from summarize to summarise. Maybe R knows what country I live in???
  2. My dates were brought in (from Excel) as ymd, so I had to change your mutate script. This was first time I have seen that, as usually it is brought in as dmy?
  3. The mean date works fine in this situation, but how would you set to a specific date, such as January 1 of that year?

Thanks again!
Cheers.

  1. I thought either spelling of summarise would work in any situation. I guess that isn't true but I have never looked into it. Both spellings work for me.
  2. From my limited experience importing from Excel, I would say the dates usually come in formatted as they are in Excel. If your dates came in as ymd, I would check whether they came in as characters or as a numeric dates. You might not have to transform them.
  3. Lubridate has a make_date() function. You could use the year column in GeoMean and make a new column with mutate.
GeoMean <- GeoMean %>% mutate(MyDate = make_date(year = year, month = 1, day = 1))

@FJCC--thanks again.

I am getting this error on your lubridate example--see below. I ran the script on my original data, and it added the correct Jan 1 column.

Geomean <- Geomean %>% 
  mutate(MyDate = make_date(year = year, month = 1, day = 1))
#> Error in Geomean %>% mutate(MyDate = make_date(year = year, month = 1, : could not find function "%>%"

Created on 2020-11-13 by the reprex package (v0.3.0)

The %>% pipe comes from the magrittr package and is automatically loaded with dplyr. Have you loaded one of those two packages?

@FJCC Yes, dplyr is loaded. Sometimes I have to put dplyr in front of mutate:, but I getting same error.

Geomean <- Geomean %>% 
 dplyr::mutate(MyDate = make_date(year = year, month = 1, day = 1))
#> Error in Geomean %>% dplyr::mutate(MyDate = make_date(year = year, month = 1, : could not find function "%>%"

Created on 2020-11-13 by the reprex package (v0.3.0)

Could it be the fact that the Geoman data frame has these columns:
year, GMean, MeanDate

?

@FJCC--Disregard my qeustion--it worked. Spelling error on my part!!

Thanks again

If you have to put dplyr:: in front of mutate for R to find mutate, that means that the dplyr package is not loaded. You can see what is loaded by running search(). Try running

library(dplyr)

and removing dplyr:: from your code.

@FJCC Thanks--I had an error a few months ago, and the solution was to put dplyr::mutate

Working now without the dplyr::

Take care, and thanks again--super helpful.

@FJCC--sorry to bother you again, but I found a big error.

This code:

GeomMean <- tpexample %>% group_by(year) %>% 
  summarise(GMean = exp(mean(log(value))), MeanDate = mean(date))
#> Error in tpexample %>% group_by(year) %>% summarise(GMean = exp(mean(log(value))), : could not find function "%>%"

Created on 2020-11-13 by the reprex package (v0.3.0)

Provides a column with Geometric Means, but the geometric means by year are incorrect--I checked by hand.

For 1992, it produces: 0.045, and it should be 0.1.

I even rewrote the code to use the geoMean function, and got the same answer. Is it in the "summarise"?

Here is the output I get from running code . I hand checked the 1992 value and it is correct.

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
tpexample <- tibble::tribble(
  ~lake,     ~source,       ~date, ~parm, ~value,  ~unit,
  "lake1", "lakewatch", "30-Jul-92",  "tp",   0.06, "mg/L",
  "lake1", "lakewatch", "18-Aug-92",  "tp",   0.07, "mg/L",
  "lake1", "lakewatch", "29-Sep-92",  "tp",   0.13, "mg/L",
  "lake1", "lakewatch", "29-Oct-92",  "tp",    0.1, "mg/L",
  "lake1", "lakewatch", "16-Nov-92",  "tp",   0.16, "mg/L",
  "lake1", "lakewatch", "16-Dec-92",  "tp",   0.13, "mg/L",
  "lake1", "lakewatch", "19-Jan-93",  "tp",   0.09, "mg/L",
  "lake1", "lakewatch", "10-Feb-93",  "tp",   0.09, "mg/L",
  "lake1", "lakewatch", "24-Mar-93",  "tp",   0.05, "mg/L",
  "lake1", "lakewatch", "28-Apr-93",  "tp",   0.05, "mg/L",
  "lake1", "lakewatch", "28-May-93",  "tp",   0.04, "mg/L",
  "lake1", "lakewatch", "29-Jun-93",  "tp",   0.04, "mg/L",
  "lake1", "lakewatch", "29-Jul-93",  "tp",   0.04, "mg/L",
  "lake1", "lakewatch", "20-Aug-93",  "tp",   0.03, "mg/L",
  "lake1", "lakewatch", "27-Sep-93",  "tp",   0.03, "mg/L",
  "lake1", "lakewatch", "20-Oct-93",  "tp",   0.04, "mg/L",
  "lake1", "lakewatch", "22-Nov-93",  "tp",   0.04, "mg/L",
  "lake1", "lakewatch", "22-Dec-93",  "tp",   0.05, "mg/L",
  "lake1", "lakewatch", "26-Jan-94",  "tp",   0.06, "mg/L",
  "lake1", "lakewatch", "23-Feb-94",  "tp",   0.05, "mg/L",
  "lake1", "lakewatch", "23-Mar-94",  "tp",   0.04, "mg/L",
  "lake1", "lakewatch", "27-Apr-94",  "tp",   0.03, "mg/L",
  "lake1", "lakewatch", "28-Jun-94",  "tp",   0.05, "mg/L",
  "lake1", "lakewatch",  "9-Jul-94",  "tp",   0.05, "mg/L",
  "lake1", "lakewatch",  "5-Oct-94",  "tp",   0.08, "mg/L",
  "lake1", "lakewatch",  "1-Nov-94",  "tp",   0.09, "mg/L",
  "lake1", "lakewatch", "22-Dec-94",  "tp",   0.11, "mg/L",
  "lake1", "lakewatch", "31-Jan-95",  "tp",    0.1, "mg/L",
  "lake1", "lakewatch", "16-Feb-95",  "tp",   0.08, "mg/L",
  "lake1", "lakewatch", "14-Mar-95",  "tp",   0.08, "mg/L",
  "lake1", "lakewatch", "13-Apr-95",  "tp",   0.06, "mg/L",
  "lake1", "lakewatch", "11-May-95",  "tp",   0.05, "mg/L",
  "lake1", "lakewatch", "20-Jun-95",  "tp",   0.03, "mg/L",
  "lake1", "lakewatch", "25-Jul-95",  "tp",   0.03, "mg/L",
  "lake1", "lakewatch", "21-Aug-95",  "tp",   0.17, "mg/L",
  "lake1", "lakewatch", "12-Sep-95",  "tp",   0.15, "mg/L",
  "lake1", "lakewatch", "16-Oct-95",  "tp",    0.1, "mg/L",
  "lake1", "lakewatch", "14-Nov-95",  "tp",    0.1, "mg/L",
  "lake1", "lakewatch", "12-Dec-95",  "tp",   0.06, "mg/L",
  "lake1", "lakewatch", "23-Jan-96",  "tp",   0.05, "mg/L",
  "lake1", "lakewatch", "15-Feb-96",  "tp",   0.07, "mg/L",
  "lake1", "lakewatch", "20-Mar-96",  "tp",   0.06, "mg/L"
)

tpexample <- tpexample %>% mutate(date = dmy(date),
                                  year = year(date))
GeomMean <- tpexample %>% group_by(year) %>% 
  summarize(GMean = exp(mean(log(value))), MeanDate = mean(date))
#> `summarise()` ungrouping output (override with `.groups` argument)
GeomMean
#> # A tibble: 5 x 3
#>    year  GMean MeanDate  
#>   <dbl>  <dbl> <date>    
#> 1  1992 0.102  1992-10-07
#> 2  1993 0.0461 1993-07-08
#> 3  1994 0.0577 1994-06-25
#> 4  1995 0.0739 1995-07-02
#> 5  1996 0.0594 1996-02-18

Created on 2020-11-13 by the reprex package (v0.3.0)

@FJCC Sorry to continue on this-- When I ran your script on the subset of my data, you are correct, I get the correct geomean values.

However, when I run on the full data, I get an incorrect value (0.04 as a geomean for 1992).

Is there a way to post the full data, other than what I did below?

@FJCC I think I figured it out. I was not selecting for "lake1"!!!

@FJCC --one last (hopefully) last thing. In your helpful post, you showed how to create and plot geometric means over a scatterplot for one lake ("lake1"). What if my data has several lakes, and I want to select out one lake at a time from the geomean data. Do I have to make a separate geomean data frame for each lake?

this is my code, derived from your code, and I am trying to create a plot for "lake1" and I have multiple lakes:


graph2 <- tpexample %>%
  filter(lake=="lake1", parm =="tp") %>% 
  ggplot() +
  geom_point(mapping =aes(x=date, y=value, color=source)) +
  geom_point(mapping = aes(MyDate, GMean), data = GeomMean, lake =="lake1", size = 3, color = "red") +
  geom_line(mapping = aes(MyDate, GMean), data = GeomMean, lake =="lake1",  size = 1, color = "green") +
  scale_x_date(date_breaks = "2 years", date_labels = "%Y") + 
  theme(legend.position = "bottom") 
#> Error in tpexample %>% filter(lake == "lake1", parm == "tp") %>% ggplot(): could not find function "%>%"
graph2 
#> Error in eval(expr, envir, enclos): object 'graph2' not found

Created on 2020-11-15 by the reprex package (v0.3.0)

And this is the output--it is putting multiple geomeans (from three different lakes):

thanks again

You could filter the GeomMean data frame inside of geom_point and geom_line.

library(dplyr)
library(ggplot2)
graph2 <- tpexample %>%
  filter(lake=="lake1", parm =="tp") %>% 
  ggplot() +
  geom_point(mapping =aes(x=date, y=value, color=source)) +
  geom_point(mapping = aes(MyDate, GMean), data = filter(GeomMean, lake =="lake1"), size = 3, color = "red") +
  geom_line(mapping = aes(MyDate, GMean), data = filter(GeomMean, lake =="lake1"),  size = 1, color = "green") +
  scale_x_date(date_breaks = "2 years", date_labels = "%Y") + 
  theme(legend.position = "bottom") 

If there are not too many lakes, you might be able to use facet_wrap to make a few plots at once.

graph2 <- tpexample %>%
  filter(parm =="tp") %>% 
  ggplot() +
  geom_point(mapping =aes(x=date, y=value, color=source)) +
  geom_point(mapping = aes(MyDate, GMean), data = GeomMean, size = 3, color = "red") +
  geom_line(mapping = aes(MyDate, GMean), data = GeomMean,  size = 1, color = "green") +
  scale_x_date(date_breaks = "2 years", date_labels = "%Y") + 
  theme(legend.position = "bottom")  +
facet_wrap(~lake)

@FJCC Fabulous! I just needed that help where to put the "filter".

I love the facet_wrap--I did not think R could figure out what geomean goes with which lake!

This just saved my week of work!!!!

thanks again

@FJCC--ugh... Sorry, this thing will never die. I created fabulous graphs thanks to you, but I am simply trying to put a legend title ("Geometric Mean") for my geometric mean line I created on these graphs. I tried multiple things such as scale_color_manual, but could not get the legend to show up.

How do I do this?

My graph:

My code:

graph3 <- ocoee %>%
  filter(parm =="tn") %>% 
  ggplot() +
  geom_point(mapping =aes(x=date, y=value, color=source)) +
  geom_point(mapping = aes(MyDate2, GMean2), data = GeomMean2, size = 1, color = "red") +
  geom_line(mapping = aes(MyDate2, GMean2), data = GeomMean2, size = 1, color = "purple") +
  scale_x_date(date_breaks = "2 years", date_labels = "%Y") + 
  labs(x="Year", y="TN (mg/L)")+ 
  theme(legend.position = "bottom") +
  facet_wrap(~lake, ncol=1, scales= "free", labeller = labeller(lake=lak.labs))
#> Error in ocoee %>% filter(parm == "tn") %>% ggplot(): could not find function "%>%"
graph3 
#> Error in eval(expr, envir, enclos): object 'graph3' not found

Created on 2020-11-16 by the reprex package (v0.3.0)

Legends are triggered when the aes() function has an aesthetic like fill or color that depends on a data value. I don't have time to experiment with this now but I suggest you add a column to GeomMean2 with a single value, maybe NA or a single space, and make the color of geom_line depend on that value. You will need to drop, I think, the color = "purple" setting that is outside of aes(). This is obviously a hack and someone else may have an elegant solution.

Sounds like @FJCC got you covered but you could probably uses stat_summary() with arguments as such, geom = "line", fun = exp(mean(log(value))) to calculate the geometric mean without creating a new dataframe. This is just getting fancy but can be useful in situations where you would have to make lots of smaller data frames. It also results with less objects in your environment (saves memory) as well as less overall code.