Production of a BSTS Mean Absolute Percentage Error (MAPE) Plot from a Bayesian Time Series Analysis with MCMC using ggplot() and bsts() packages

Kaikash777 · November 8, 2020, 6:49am

Problem:

I have a data frame called FID (see below) that contains two columns for Year & Month, and Sighting_Frequency (counts of birds).

The data frame contains 3 years of observations between 2015-2017 , indicating I have 36 months of data. I have run a Bayesian time series analysis with MCMC using the bsts() function in the bsts package (see the R-code below) by following the tutorial below.

I want to produce a holdout Mean Absolute Percentage Error (MAPE) Plot as seen in the diagram below, which illustrates the actual vs predicted values with credible intervals for the holdout period using the package ggplot().

I am getting stuck when I am attempting to produce the d2 data frame (see the tutorial and R-code below) because I keep on experiencing this error message:-

Error in data.frame(c(10^as.numeric(-colMeans(bsts.model$one.step.prediction.errors[-(1:burn),  : 
  arguments imply differing number of rows: 48, 32

I have been struggling to figure out the problem. If anyone can help me solve this issue, I would be deeply appreciative.

Many thanks in advance.

Tutorial

Plot I am trying to produce

R-code:
######################################
##Time Series Model using the bsts() function
#######################################

#Open packages for the time series analysis
library(lubridate)
library(bsts)
library(dplyr)
library(ggplot2)

##Create a time series object
myts2 <- ts(BSTS_Dataframe$Sightings_Frequency, start=c(2015, 1), end=c(2017, 12), frequency=12)

#upload the data into the windows() function
x <- window(myts2, start=c(2015, 01), end=c(2017, 12))
y <- log(x)

Run the bsts model

ss <- AddLocalLinearTrend(list(), y)
ss <- AddSeasonal(ss, y, nseasons = 3)

##Produce the bsts model using the bsts function
bsts.model <- bsts(y, state.specification = ss, family = "logit", niter = 100, ping = 0, seed = 123)

##Open plotting window
dev.new()

#plot the bsts.model
plot(bsts.model)

##Get a suggested number of burns
burn<-bsts::SuggestBurn(0.1, bsts.model)

#predict

p<-predict.bsts(bsts.model, horizon = 12, burn=burn, quantiles=c(.25, .975))

##Actual vs predicted
##Fitted vs predictions
##Acutal data and dates

d2 <- data.frame(
c(10^as.numeric(-colMeans(bsts.model$one.step.prediction.errors[-(1:burn),])+y), #fitted vs predictions
10^as.numeric(p$mean)),
as.numeric(BSTS_Dataframe$Sightings_Frequency), ## actual data and dates
as.Date(time(BSTS_Dataframe$Sightings_Frequency)))

######################################
This is the error message
######################################

Error in data.frame(c(10^as.numeric(-colMeans(bsts.model$one.step.prediction.errors[-(1:burn), :
arguments imply differing number of rows: 48, 32

######################################

##Name the d2 data frame columns
names(d2) <- c("Fitted", "Actual", "Date")

MAPE (mean absolute percentage error)

MAPE <- dplyr::filter(d2, year(Date)>2017) %>% dplyr::summarise(MAPE=mean(abs(Actual-Fitted)/Actual))

95% forecast credible interval

posterior.interval <- cbind.data.frame(
10^as.numeric(p$interval[1,]),
10^as.numeric(p$interval[2,]),
subset(d2, year(Date)>2017)$Date)
names(posterior.interval) <- c("LL", "UL", "Date")

Join intervals to the forecast

d3 <- left_join(d2, posterior.interval, by="Date")

Plot actual versus predicted with credible intervals for the holdout period

ggplot(data=d3, aes(x=Date)) +
geom_line(aes(y=Actual, colour = "Actual"), size=1.2) +
geom_line(aes(y=Fitted, colour = "Fitted"), size=1.2, linetype=2) +
theme_bw() + theme(legend.title = element_blank()) + ylab("") + xlab("") +
geom_vline(xintercept=as.numeric(as.Date("2017-12-01")), linetype=2) +
geom_ribbon(aes(ymin=LL, ymax=UL), fill="grey", alpha=0.5) +
ggtitle(paste0("BSTS -- Holdout MAPE = ", round(100*MAPE,2), "%")) +
theme(axis.text.x=element_text(angle = -90, hjust = 0))

**FID Data frame:**

structure(list(Year = structure(1:32, .Label = c("2015-01", "2015-02",
"2015-03", "2015-04", "2015-05", "2015-08", "2015-09", "2015-10",
"2015-11", "2015-12", "2016-01", "2016-02", "2016-03", "2016-04",
"2016-05", "2016-07", "2016-08", "2016-09", "2016-10", "2016-11",
"2016-12", "2017-01", "2017-02", "2017-03", "2017-04", "2017-05",
"2017-07", "2017-08", "2017-09", "2017-10", "2017-11", "2017-12"
), class = "factor"), Sightings_Frequency = c(36L, 28L, 39L,
46L, 5L, 22L, 10L, 15L, 8L, 33L, 33L, 29L, 31L, 23L, 8L, 9L,
40L, 41L, 40L, 30L, 30L, 44L, 37L, 41L, 42L, 20L, 7L, 27L, 35L,
27L, 43L, 38L)), class = "data.frame", row.names = c(NA, -32L
))

technocrat · November 8, 2020, 9:35pm

Attempt at answer ends here. Nothing to work from.

Kaikash777 · November 8, 2020, 10:00pm

Thank you for replying, it meant a lot! Could you please elaborate on your answer? I’m an R newbie and I feel confused by the predict equation. Many thanks

technocrat · November 8, 2020, 10:24pm

See the reprex FAQ. The idea is to reduce the friction in communicating between poster and answerers. Cutting and pasting the complete code with representative data goes a long-way to reducing misunderstanding and being able to suggest solutions. It's all about getting on the same page.

Using the reprex add-in in RStudio makes it just a matter of cut-and-paste for both sides. Be sure to start with a fresh R session and make sure that the reprex throws the error of interest. Critical is representative data. It doesn't have to be all your data, or even your data at all if it behaves the same way.

If you can do that, I'm happy to work with you on getting further than just offering "that what happens when you try to cbind two data frames that don't have the same number of rows."

Kaikash777 · November 20, 2020, 4:01pm

Hi Techocrat,

Thank you for replying to my question, and I found the article you supplied very interesting. I think I have already used a reprex method in this question because I used the function dput() to copy and paste my data frame called 'FID', in order for readers have a reproducible version. This can be found at the end of the question. Is this the type of reproducible data frame you are referring too?

I look forward to speaking with you soon,

Take care

technocrat · November 20, 2020, 11:33pm

Thanks. Assigning the dput to BSTS_Dataframe allows me to reproduce the error message, and 'll take a look. One of the advantages of using the reprex plugin is that the example is more or less guaranteed to run up to the same point as the error message and doesn't rely on someone parsing way to the end for the data.

technocrat · November 20, 2020, 11:52pm

Here's a reprex showing the root cause of the problem for the error, which is attempting to create a data frame from vectors with unequal lengths. Also, part4 at the end makes me think that these are not the dates you intend.

All R can be approached as school algebra: f(x) = y. The three objects (everything in R is an object) are x, what is at hand, y what is desired and f the function object that transforms one to another.

In this case think of d2 as y, even though it may only be intermediate. What do you want out of y? If it is to be another intermediate x, then what does that x require to do its job as an argument to a further f?

Come back with questions.

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(bsts)
#> Loading required package: BoomSpikeSlab
#> Loading required package: Boom
#> Loading required package: MASS
#> 
#> Attaching package: 'Boom'
#> The following object is masked from 'package:stats':
#> 
#>     rWishart
#> 
#> Attaching package: 'BoomSpikeSlab'
#> The following object is masked from 'package:stats':
#> 
#>     knots
#> Loading required package: zoo
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
#> Loading required package: xts
#> 
#> Attaching package: 'bsts'
#> The following object is masked from 'package:BoomSpikeSlab':
#> 
#>     SuggestBurn
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:xts':
#> 
#>     first, last
#> The following object is masked from 'package:MASS':
#> 
#>     select
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)

## Data should come before provided as an argument

BSTS_Dataframe <- structure(list(Year = structure(1:32, .Label = c(
  "2015-01", "2015-02",
  "2015-03", "2015-04", "2015-05", "2015-08", "2015-09", "2015-10",
  "2015-11", "2015-12", "2016-01", "2016-02", "2016-03", "2016-04",
  "2016-05", "2016-07", "2016-08", "2016-09", "2016-10", "2016-11",
  "2016-12", "2017-01", "2017-02", "2017-03", "2017-04", "2017-05",
  "2017-07", "2017-08", "2017-09", "2017-10", "2017-11", "2017-12"
), class = "factor"), Sightings_Frequency = c(
  36L, 28L, 39L,
  46L, 5L, 22L, 10L, 15L, 8L, 33L, 33L, 29L, 31L, 23L, 8L, 9L,
  40L, 41L, 40L, 30L, 30L, 44L, 37L, 41L, 42L, 20L, 7L, 27L, 35L,
  27L, 43L, 38L
)), class = "data.frame", row.names = c(NA, -32L))

## Create a time series object
myts2 <- ts(BSTS_Dataframe$Sightings_Frequency, start = c(2015, 1), end = c(2017, 12), frequency = 12)

## Upload the data into the windows() function
x <- window(myts2, start = c(2015, 01), end = c(2017, 12))
y <- log(x)

ss <- AddLocalLinearTrend(list(), y)
ss <- AddSeasonal(ss, y, nseasons = 3)

## Produce the bsts model using the bsts function
bsts.model <- bsts(y, state.specification = ss, family = "logit", niter = 100, ping = 0, seed = 123)

## Open plotting window
dev.new()

## Plot the bsts.model
plot(bsts.model)

## Get a suggested number of burns
burn <- bsts::SuggestBurn(0.1, bsts.model)

## Predict

p <- predict.bsts(bsts.model, horizon = 12, burn = burn, quantiles = c(.25, .975))

## Actual vs predicted
## Fitted vs predictions
## Acutal data and dates

# here comes the error
# Error in data.frame(c(10^as.numeric(-colMeans(bsts.model$one.step.prediction.errors[-(1:burn),  :
# arguments imply differing number of rows: 48, 32
# d2 <- data.frame(
#   c(10^as.numeric(-colMeans(bsts.model$one.step.prediction.errors[-(1:burn),])+y), #fitted vs predictions
#     10^as.numeric(p$mean)),
#   as.numeric(BSTS_Dataframe$Sightings_Frequency), ## actual data and dates
#   as.Date(time(BSTS_Dataframe$Sightings_Frequency)))

## Examine each element separatly

part1 <- 10^as.numeric(-colMeans(bsts.model$one.step.prediction.errors[-(1:burn), ]) + y)
part2 <- 10^as.numeric(p$mean)
part3 <- as.numeric(BSTS_Dataframe$Sightings_Frequency)
part4 <- as.Date(time(BSTS_Dataframe$Sightings_Frequency))

# get lengths -- only part3 and part4 are equal
# can't make a data frame out of unequal lengths

length(part1)
#> [1] 36
length(part2)
#> [1] 12
length(part3)
#> [1] 32
length(part4)
#> [1] 32

# while we're at it

part4
#>  [1] "1970-01-02" "1970-01-03" "1970-01-04" "1970-01-05" "1970-01-06"
#>  [6] "1970-01-07" "1970-01-08" "1970-01-09" "1970-01-10" "1970-01-11"
#> [11] "1970-01-12" "1970-01-13" "1970-01-14" "1970-01-15" "1970-01-16"
#> [16] "1970-01-17" "1970-01-18" "1970-01-19" "1970-01-20" "1970-01-21"
#> [21] "1970-01-22" "1970-01-23" "1970-01-24" "1970-01-25" "1970-01-26"
#> [26] "1970-01-27" "1970-01-28" "1970-01-29" "1970-01-30" "1970-01-31"
#> [31] "1970-02-01" "1970-02-02"

^{Created on 2020-11-20 by the reprex package (v0.3.0.9001)}

Kaikash777 · November 23, 2020, 12:54pm

Thank you for your feedback. Do you have any suggestions with how to fix this issue based on the steps in the tutorial (link above)? I am under the impression d2 binds the columns for the actual and fitted data with the dates in order to plot them onto a MAPE plot? Many thanks in advance for your advice

technocrat · November 23, 2020, 9:42pm

suppressPackageStartupMessages({
  library(bsts)
  library(dplyr)
  library(ggplot2)
})


BSTS_Dataframe <- structure(list(Year = structure(1:32, .Label = c(
  "2015-01", "2015-02",
  "2015-03", "2015-04", "2015-05", "2015-08", "2015-09", "2015-10",
  "2015-11", "2015-12", "2016-01", "2016-02", "2016-03", "2016-04",
  "2016-05", "2016-07", "2016-08", "2016-09", "2016-10", "2016-11",
  "2016-12", "2017-01", "2017-02", "2017-03", "2017-04", "2017-05",
  "2017-07", "2017-08", "2017-09", "2017-10", "2017-11", "2017-12"
), class = "factor"), Sightings_Frequency = c(
  36L, 28L, 39L,
  46L, 5L, 22L, 10L, 15L, 8L, 33L, 33L, 29L, 31L, 23L, 8L, 9L,
  40L, 41L, 40L, 30L, 30L, 44L, 37L, 41L, 42L, 20L, 7L, 27L, 35L,
  27L, 43L, 38L
)), class = "data.frame", row.names = c(NA, -32L))

myts2 <- ts(BSTS_Dataframe$Sightings_Frequency, start = c(2015, 1), end = c(2017, 12), frequency = 12)

x <- window(myts2, start = c(2015, 01), end = c(2017, 12))
y <- log(x)

ss <- AddLocalLinearTrend(list(), y)
ss <- AddSeasonal(ss, y, nseasons = 3)

bsts.model <- bsts(y, state.specification = ss, family = "logit", niter = 100, ping = 0, seed = 123)

plot(bsts.model)

burn <- bsts::SuggestBurn(0.1, bsts.model)

p <- predict(bsts.model, horizon = 12, burn = burn, quantiles = c(.25, .975))

### Actual versus predicted: Follow the example in the tutorial
d2 <- data.frame(
  # fitted values and predictions
  c(
    10^as.numeric(-colMeans(bsts.model$one.step.prediction.errors[-(1:burn), ]) + y),
    10^as.numeric(p$mean)
  ),
  # actual data and dates
  as.numeric(myts2),
  as.Date(time(myts2))
)
#> Error in data.frame(c(10^as.numeric(-colMeans(bsts.model$one.step.prediction.errors[-(1:burn), : arguments imply differing number of rows: 48, 36
names(d2) <- c("Fitted", "Actual", "Date")
#> Error in names(d2) <- c("Fitted", "Actual", "Date"): object 'd2' not found

MAPE <- filter(d2, year(Date) > 2014) %>% summarise(MAPE = mean(abs(Actual - Fitted) / Actual))
#> Error in filter(d2, year(Date) > 2014): object 'd2' not found

### 95% forecast credible interval
posterior.interval <- cbind.data.frame(
  10^as.numeric(p$interval[1, ]),
  10^as.numeric(p$interval[2, ]),
  subset(d2, year(Date) > 2014)$Date
)
#> Error in subset(d2, year(Date) > 2014): object 'd2' not found
names(posterior.interval) <- c("LL", "UL", "Date")
#> Error in names(posterior.interval) <- c("LL", "UL", "Date"): object 'posterior.interval' not found

### Join intervals to the forecast
d3 <- left_join(d2, posterior.interval, by = "Date")
#> Error in left_join(d2, posterior.interval, by = "Date"): object 'd2' not found

### Plot actual versus predicted with credible intervals for the holdout period
ggplot(data = d3, aes(x = Date)) +
  geom_line(aes(y = Actual, colour = "Actual"), size = 1.2) +
  geom_line(aes(y = Fitted, colour = "Fitted"), size = 1.2, linetype = 2) +
  theme_bw() +
  theme(legend.title = element_blank()) +
  ylab("") +
  xlab("") +
  geom_vline(xintercept = as.numeric(as.Date("1959-12-01")), linetype = 2) +
  geom_ribbon(aes(ymin = LL, ymax = UL), fill = "grey", alpha = 0.5) +
  ggtitle(paste0("BSTS -- Holdout MAPE = ", round(100 * MAPE, 2), "%")) +
  theme(axis.text.x = element_text(angle = -90, hjust = 0))
#> Error in ggplot(data = d3, aes(x = Date)): object 'd3' not found

^{Created on 2020-11-23 by the reprex package (v0.3.0.9001)}

Kaikash777 · November 24, 2020, 11:37am

Hey Technocrat! I just had someone answer my question on Stackoverflow! This additional code at the beginning before creating the time series object works (see below). Would it be possible to please explain in more detail why this extra code solved the problem? Many thanks in advance.

BSTS_Dataframe$Year <- lubridate::ymd(paste0(FID$Year,"-01"))

allDates <- seq.Date(
               min(FID$Year),
               max(FID$Year),
               "month")

FID <- FID %>% right_join(data.frame(Year = allDates), by = c("Year")) %>% dplyr::arrange(Year) %>%
                     tidyr::fill(Sightings_Frequency, .direction = "down")

##Create a time series object
myts2 <- ts(FID$Sightings_Frequency, start=c(2015, 1), end=c(2017, 12), frequency=12)

##Upload the data into the windows() function
x <- window(myts2, start=c(2015, 01), end=c(2016, 12))
y <- log(x)

##Produce a list for the object ss
ss <- list()

#ss <- AddLocalLinearTrend(list(), y)
ss <- AddSeasonal(ss, y, nseasons = 12)
ss <- AddLocalLevel(ss, y)
# bsts.model <- bsts(y, state.specification = ss, family = "poisson", niter = 2, ping=0, seed=1234)
# If these are poisson distributed, no need to use logit because it bounds reponse
# between 0-1
bsts.model <- bsts(y, state.specification = ss,  niter = 100, ping = 0, seed = 123)

##Open plotting window
dev.new()

##Plot the bsts.model
plot(bsts.model)

##Get a suggested number of burns
burn<-bsts::SuggestBurn(0.1, bsts.model)

##Predict

p<-predict.bsts(bsts.model, horizon = 12, burn=burn, quantiles=c(.25, .975))

p$mean

##Actual vs predicted

d2 <- data.frame(
  # fitted values and predictions
  c(exp(as.numeric(-colMeans(bsts.model$one.step.prediction.errors[-(1:burn),])+y)),  
    exp(as.numeric(p$mean))),
  # actual data and dates
  as.numeric(FID$Sightings_Frequency),
  as.Date(FID$Year))

names(d2) <- c("Fitted", "Actual", "Date")

### MAPE (mean absolute percentage error)
MAPE <- dplyr::filter(d2, lubridate::year(Date)>=2017) %>%
  dplyr::summarise(MAPE=mean(abs(Actual-Fitted)/Actual))

### 95% forecast credible interval
posterior.interval <- cbind.data.frame(
              exp(as.numeric(p$interval[1,])),
              exp(as.numeric(p$interval[2,])),
              tail(d2,12)$Date)

names(posterior.interval) <- c("LL", "UL", "Date")

### Join intervals to the forecast
d3 <- left_join(d2, posterior.interval, by="Date")

##Open plotting window
dev.new()

### Plot actual versus predicted with credible intervals for the holdout period
ggplot(data=d3, aes(x=Date)) +
  geom_line(aes(y=Actual, colour = "Actual"), size=1.2) +
  geom_line(aes(y=Fitted, colour = "Fitted"), size=1.2, linetype=2) +
  theme_bw() + theme(legend.title = element_blank()) + ylab("") + xlab("") +
  geom_vline(xintercept=as.numeric(as.Date("2017-12-01")), linetype=2) +
  geom_ribbon(aes(ymin=LL, ymax=UL), fill="grey", alpha=0.5) +
  ggtitle(paste0("BSTS -- Holdout MAPE = ", round(100*MAPE,2), "%")) +
  theme(axis.text.x=element_text(angle = -90, hjust = 0))

technocrat · November 24, 2020, 6:53pm

Looks like it put in plug values for the missing June observations

Kaikash777 · November 25, 2020, 2:08am

I understand now! Therefore, all the columns now contain the same number of rows as the missing values are filled. Thank you for your advice and help throughout this process. I am a newbie to time series analysis in R.