Grouping in Anomaly Detection

Hi All,

I have been trying to do anomaly detection using the anomalize package.
The problem I face is with grouping variables. If there is only column with one distinct value I could be able to run the anomaly detection but if there is multiple time series involved with more than one grouping variable it ends up failing.
Here I have two grouping variables- Product and Location and I tried grouping the data by Product and Location and converted to a tibble object but when i pass the glimpse function it is still not identifying the groups.
I have attached a sample code, if anyone could please help me with this that would be really helpful.

setwd("~/Outlier/R")
Test.df2<-structure(list(Product = c("A1", "A1", "A1", "A1", "B1", "B1", "B1", "B1"), Location = c("A", "A", "A", "A", "A", "B", "A", "C"), Period = c("Sep 2017", "Oct 2017", "Nov 2017", "Dec 2017", "Sep 2017", "Oct 2017", "Nov 2017", "Dec 2017"), Units = c(5.399993, 9, 10, 9, 5, 1, 3, 1)), row.names = c(NA, 8L), class = "data.frame")
Test.df2$Period<-lubridate::my(Test.df2$Period)
Test.df2[is.na(Test.df2)]<-0
Test.df2<-Test.df2 %>% group_by(Product,Location)
Test.df2<-as_tibble(Test.df2)
glimpse(Test.df2)
Test.df2.anomaly<- Test.df2 %>% time_decompose(Units,merge = T,method = 'stl',frequency = "auto", trend = "auto")%>% anomalize(remainder,method = 'gesd',alpha = 0.05,max_anoms = 0.2) %>% time_recompose()

I think you are making no mistakes but the grouping is lost by the unnecessary as_tibble.
See annotations here and the full reprex below that.

Because you did not mention which package you use for the analysis of the anomalies I can not guess the impact of this.

Test.df2$Period<-lubridate::my(Test.df2$Period)
Test.df2[is.na(Test.df2)]<-0 # ???
Test.df2<-Test.df2 %>% group_by(Product,Location)
class(Test.df2) # already a tibble
print(Test.df2) # this shows it is a grouped tibble
glimpse(Test.df2) # and here
Test.df2<-as_tibble(Test.df2) # not necessary (already a tibble) but ...
class(Test.df2) # a tibble but no longer grouped
print(Test.df2) # as show here
glimpse(Test.df2) # and here
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

Test.df2<-structure(list(Product = c("A1", "A1", "A1", "A1", "B1", "B1", "B1", "B1"), 
                         Location = c("A", "A", "A", "A", "A", "B", "A", "C"), 
                         Period = c("Sep 2017", "Oct 2017", "Nov 2017", "Dec 2017", "Sep 2017", "Oct 2017", "Nov 2017", "Dec 2017"), 
                         Units = c(5.399993, 9, 10, 9, 5, 1, 3, 1)), row.names = c(NA, 8L), class = "data.frame")
Test.df2$Period<-lubridate::my(Test.df2$Period)
Test.df2[is.na(Test.df2)]<-0 # ???
Test.df2<-Test.df2 %>% group_by(Product,Location)
class(Test.df2) # already a tibble
#> [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
print(Test.df2) # this shows it is a grouped tibble
#> # A tibble: 8 x 4
#> # Groups:   Product, Location [4]
#>   Product Location Period     Units
#>   <chr>   <chr>    <date>     <dbl>
#> 1 A1      A        2017-09-01  5.40
#> 2 A1      A        2017-10-01  9   
#> 3 A1      A        2017-11-01 10   
#> 4 A1      A        2017-12-01  9   
#> 5 B1      A        2017-09-01  5   
#> 6 B1      B        2017-10-01  1   
#> 7 B1      A        2017-11-01  3   
#> 8 B1      C        2017-12-01  1
glimpse(Test.df2) # and here
#> Rows: 8
#> Columns: 4
#> Groups: Product, Location [4]
#> $ Product  <chr> "A1", "A1", "A1", "A1", "B1", "B1", "B1", "B1"
#> $ Location <chr> "A", "A", "A", "A", "A", "B", "A", "C"
#> $ Period   <date> 2017-09-01, 2017-10-01, 2017-11-01, 2017-12-01, 2017-09-01, 2~
#> $ Units    <dbl> 5.399993, 9.000000, 10.000000, 9.000000, 5.000000, 1.000000, ~
Test.df2<-as_tibble(Test.df2) # not necessary (already a tibble) but ...
class(Test.df2) # a tibble but no longer grouped
#> [1] "tbl_df"     "tbl"        "data.frame"
print(Test.df2) # as show here
#> # A tibble: 8 x 4
#>   Product Location Period     Units
#>   <chr>   <chr>    <date>     <dbl>
#> 1 A1      A        2017-09-01  5.40
#> 2 A1      A        2017-10-01  9   
#> 3 A1      A        2017-11-01 10   
#> 4 A1      A        2017-12-01  9   
#> 5 B1      A        2017-09-01  5   
#> 6 B1      B        2017-10-01  1   
#> 7 B1      A        2017-11-01  3   
#> 8 B1      C        2017-12-01  1
glimpse(Test.df2) # and here
#> Rows: 8
#> Columns: 4
#> $ Product  <chr> "A1", "A1", "A1", "A1", "B1", "B1", "B1", "B1"
#> $ Location <chr> "A", "A", "A", "A", "A", "B", "A", "C"
#> $ Period   <date> 2017-09-01, 2017-10-01, 2017-11-01, 2017-12-01, 2017-09-01, 2~
#> $ Units    <dbl> 5.399993, 9.000000, 10.000000, 9.000000, 5.000000, 1.000000, ~
Created on 2021-09-15 by the reprex package (v2.0.0)

@HanOostdijk Thank you. I have used the anomalize package to decompose, anomalize and recompose the time series data.
May I know what is the best way to group the data and pass the anomalize function to individual groups? If there is more than 10K+ product and location combinations does a normal group by works while passing the anomalize function to 10K+ production and location combinations?

Hello @arvind ,
the anomalize package is new for me. I have been experimenting a bit to get it working on dummy data.
I understand (but may be wrong) that the functions don't support grouped data, so in one way or another we have to loop over the various groups.

Maybe it will work for you in the way I did it, but have no opinion or experience about the 'best' way to do this.

I only got as far as the time_decompose part, but I suppose the remainder could be done in the same way

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(anomalize)
#> == Use anomalize to improve your Forecasts by 50%! =============================
#> Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
#> </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>

set.seed(2021)
Test.df2<-expand.grid(Product  = "A1", 
                      Location = c("A","B","C"), 
                      Period   = seq(as.Date("2000/1/1"), by = "month", length.out = 36)) %>%
  mutate(Units  = rnorm( n() ) ) %>%
  mutate(Units  = ifelse(Period==as.Date("2001/1/1"),4,Units)) %>%
  as_tibble() %>%
  nest_by(Product,Location) 

Test.df2.anomaly <- Test.df2 %>% 
  mutate(td = list(time_decompose(data,Units,merge = T,method = 'stl',frequency = "auto", trend = "auto")) ) %>% 
  select(-data) %>%
  tidyr::unnest(td)
#> Converting from tbl_df to tbl_time.
#> Auto-index message: index = Period
#> frequency = 12 months
#> trend = 12 months
#> Registered S3 method overwritten by 'quantmod':
#>   method            from
#>   as.zoo.data.frame zoo
#> Warning: `type_convert()` only converts columns of type 'character'.
#> - `df` has no columns of type 'character'
#> Converting from tbl_df to tbl_time.
#> Auto-index message: index = Period
#> frequency = 12 months
#> trend = 12 months
#> Warning: `type_convert()` only converts columns of type 'character'.
#> - `df` has no columns of type 'character'
#> Converting from tbl_df to tbl_time.
#> Auto-index message: index = Period
#> frequency = 12 months
#> trend = 12 months
#> Warning: `type_convert()` only converts columns of type 'character'.
#> - `df` has no columns of type 'character'

head(as.data.frame(Test.df2.anomaly))
#>   Product Location     Period      Units   observed     season       trend
#> 1      A1        A 2000-01-01 -0.1224600 -0.1224600 -0.4702341  0.02034267
#> 2      A1        A 2000-02-01  0.3596322  0.3596322  0.6328513 -0.06916940
#> 3      A1        A 2000-03-01  0.2617444  0.2617444  0.6603725 -0.15868147
#> 4      A1        A 2000-04-01  1.7299632  1.7299632 -0.1566939 -0.25227637
#> 5      A1        A 2000-05-01  0.1819954  0.1819954  0.5792449 -0.34587127
#> 6      A1        A 2000-06-01 -1.8414756 -1.8414756 -1.4435384 -0.42927376
#>     remainder
#> 1  0.32743144
#> 2 -0.20404961
#> 3 -0.23994669
#> 4  2.13893341
#> 5 -0.05137818
#> 6  0.03133656
Created on 2021-09-15 by the reprex package (v2.0.0)