Outlier identification of time series using PCA and bivariate kernel density estimates in Rainbow package

jgarrigan · May 15, 2023, 8:21pm

I'm interested in applying outlier detection models to my database of temporal information to identify some interesting data points.

I'm following along to @robjhyndman examples in his vignettes for the tsfeatures package which can be found at the bottom of this page as well as his research paper which can be found here

# INSTALL PACKAGES
pacman::p_load(tsfeatures, dplyr, broom, tidyverse, tictoc, forecast, factoextra, dbscan, ggfortify, rainbow, cowplot)

# GET DATA
yahoo <- yahoo_data()

# CREATE TIME SERIES FEATURES
tic()
hwl <- bind_cols(
  tsfeatures(yahoo, c("acf_features", "entropy", "lumpiness", "flat_spots", "crossing_points")),
  tsfeatures(yahoo, "stl_features", s.window = "periodic", robust = TRUE),
  tsfeatures(yahoo, "max_kl_shift", width = 48),
  tsfeatures(yahoo, c("mean", "var"), scale = FALSE, na.rm = TRUE),
  tsfeatures(yahoo, c("max_level_shift", "max_var_shift"), trim = TRUE)
) %>%
  dplyr::select(
    mean, var, x_acf1, trend, linearity, curvature,
    seasonal_strength, peak, trough, entropy, lumpiness, spike, max_level_shift, max_var_shift, flat_spots,
    crossing_points, max_kl_shift, time_kl_shift
  )
toc()

# PERFORM PCA
hwl_pca <- hwl %>%
  na.omit() %>%
  prcomp(scale = TRUE)

# GET A SUMMARY OF THE PCA
summary(hwl_pca)

# SCREE PLOT OF THE PRINCIPAL COMPONENTS
fviz_eig(hwl_pca)

# PLOT THE FIRST TWO PRINCIPAL COMPONENTS AND THEIR LOADINGS
autoplot(hwl_pca,loadings = TRUE, loadings.label = TRUE)

# EXTRACT THE PCA SCORES
pc1 <- hwl_pca$x[,1]

pc2 <- hwl_pca$x[,2]

pc_matrix <- cbind(pc1, pc2)

According to this article from Rob's website he uses the Rainbow package to calculate the density rankings but it's at this stage that I struggle.

The rainbow package requires the two main PCs to be converted to functional data (either fds or fts format) and I believe the fds function does that which I've completed below:

x_var <- seq(1,nrow(pc_matrix))

functional_data <- rainbow::fds(x = x_var, y = pc_matrix, xname = "xvar", yname = "yvar")

fboxplot(data = functional_data, plot.type = "functional", alpha = 0.5)

When I run the fxboxplot function I get the following error:

[1] "not enough data points"
Error in if (pcbag$is.one.dim == TRUE) { : argument is of length zero

Checking the structure of the functional data input I can see that it contains vectors of length 1748 so I'm unsure what the error is referring to:

str(functional_data)
List of 4
 $ x    : int [1:1748] 1 2 3 4 5 6 7 8 9 10 ...
 $ y    : num [1:1748, 1:2] 1.58 -1.23 -2.73 -1.07 -3.09 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:2] "pc1" "pc2"
 $ xname: chr "xvar"
 $ yname: chr "yvar"
 - attr(*, "class")= chr "fds"

If anyone has experience with converting data to functional form I'd be very grateful for any help that could be shared.

robjhyndman · May 15, 2023, 11:03pm

Anomaly detection in the rainbow package involves taking functional data and computing functional principal components so that the data can be represented in a 2-d scatterplot of the first two PC scores. Then it uses some standard approaches to find outliers in the 2d space.

You create a similar 2d space from the time series features. So it doesn't make much sense to then try to create functional data. You already have the 2d space from which anomalies can be identified.

For example, here is a plot showing anomalies using HDRs:

# Load packages
library(tsfeatures)
library(tidyverse)
library(hdrcde)

# GET DATA
yahoo <- yahoo_data()

# CREATE TIME SERIES FEATURES
hwl <- bind_cols(
  tsfeatures(yahoo, c("acf_features", "entropy", "lumpiness", "flat_spots", "crossing_points")),
  tsfeatures(yahoo, "stl_features", s.window = "periodic", robust = TRUE),
  tsfeatures(yahoo, "max_kl_shift", width = 48),
  tsfeatures(yahoo, c("mean", "var"), scale = FALSE, na.rm = TRUE),
  tsfeatures(yahoo, c("max_level_shift", "max_var_shift"), trim = TRUE)
) %>%
  dplyr::select(
    mean, var, x_acf1, trend, linearity, curvature,
    seasonal_strength, peak, trough, entropy, lumpiness, spike, max_level_shift, max_var_shift, flat_spots,
    crossing_points, max_kl_shift, time_kl_shift
  )

# PERFORM PCA
hwl_pca <- hwl %>%
  na.omit() %>%
  prcomp(scale = TRUE)

# EXTRACT THE PCA SCORES
pc1 <- hwl_pca$x[, 1]
pc2 <- hwl_pca$x[, 2]

# Find outliers using HDRs
hdrscatterplot(pc1, pc2)

^{Created on 2023-05-16 with reprex v2.0.2}

jgarrigan · May 16, 2023, 2:19pm

Many thanks @robjhyndman, I appreciate the help.

system · June 27, 2023, 2:19pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.