correlation between time series/panel data

Tamim · May 6, 2020, 12:27am

Hi,

I have panel data/time series for 150 neighbourhoods and their annual crime rates for 8 years and my question is that what is the easiest way to do correlation/lagged correlations between crime rates in different neighbourhoods.

the data looks like:

neighbourhood   year1 year2 year3 year4  .  .  .  year8
       1        23.4   26.3  25.1  28.4 
       2        12.3   14.6  16.7  15.4
       3        20.8   22.3  21.8  23.3
       4        33.4   30.6  28.6  27.3
       .
       .
       .

robjhyndman · May 7, 2020, 11:40pm

The tsibble and feasts packages can be used to do this relatively easily.

First I'll create some data that is in the same format as yours.

library(tidyverse)
library(tsibble)
library(feasts)

df <- data.frame(
  neighbourhood = 1:150,
  year1 = 100 * runif(150),
  year2 = 100 * runif(150),
  year3 = 100 * runif(150),
  year4 = 100 * runif(150),
  year5 = 100 * runif(150),
  year6 = 100 * runif(150),
  year7 = 100 * runif(150),
  year8 = 100 * runif(150)
)

Now we will turn it into a tsibble with year as the index and neighbourhood as a key.

df <- df  %>%
  pivot_longer(year1:year8, names_to = "year", values_to = "crime") %>%
  mutate(
    year = as.numeric(stringr::str_sub(year, 5, 5))
  ) %>%
  as_tsibble(index = year, key = neighbourhood)
df
#> # A tsibble: 1,200 x 3 [1]
#> # Key:       neighbourhood [150]
#>    neighbourhood  year crime
#>            <int> <dbl> <dbl>
#>  1             1     1 32.7 
#>  2             1     2 13.0 
#>  3             1     3 70.6 
#>  4             1     4 96.1 
#>  5             1     5 28.4 
#>  6             1     6 43.3 
#>  7             1     7 73.2 
#>  8             1     8 37.4 
#>  9             2     1 95.3 
#> 10             2     2  4.57
#> # … with 1,190 more rows

The autocorrelations for all 150 neighbourhoods are now very easy to compute using ACF().

# Lagged correlations within neighbourhoods
df %>% ACF(crime, lag_max = 3)
#> # A tsibble: 450 x 3 [1]
#> # Key:       neighbourhood [150]
#>    neighbourhood   lag     acf
#>            <int> <lag>   <dbl>
#>  1             1     1 -0.0842
#>  2             1     2 -0.592 
#>  3             1     3  0.225 
#>  4             2     1 -0.421 
#>  5             2     2  0.0537
#>  6             2     3 -0.0429
#>  7             3     1  0.0206
#>  8             3     2 -0.626 
#>  9             3     3 -0.108 
#> 10             4     1  0.0885
#> # … with 440 more rows

Computing cross-correlations can be done using CCF() but it requires two variables. So first we have to extend our tsibble to be keyed by all pairs of neighbourhoods.

# Create long tibble containing all pairs of neighbourhoods
pairs <- df %>%
  mutate(n2 = neighbourhood) %>%
  expand(year, neighbourhood, n2) %>%
  left_join(df, by = c("year", "neighbourhood")) %>%
  rename(
    neighbourhoodA = neighbourhood,
    crimeA = crime,
    neighbourhood = n2
  ) %>%
  left_join(df, by = c("year", "neighbourhood")) %>%
  rename(
    crimeB = crime,
    neighbourhoodB = neighbourhood
  ) %>%
  as_tsibble(index = year, key = c(neighbourhoodA, neighbourhoodB))
pairs
#> # A tsibble: 180,000 x 5 [1]
#> # Key:       neighbourhoodA, neighbourhoodB [22,500]
#>     year neighbourhoodA neighbourhoodB crimeA crimeB
#>    <dbl>          <int>          <int>  <dbl>  <dbl>
#>  1     1              1              1   32.7  32.7 
#>  2     2              1              1   13.0  13.0 
#>  3     3              1              1   70.6  70.6 
#>  4     4              1              1   96.1  96.1 
#>  5     5              1              1   28.4  28.4 
#>  6     6              1              1   43.3  43.3 
#>  7     7              1              1   73.2  73.2 
#>  8     8              1              1   37.4  37.4 
#>  9     1              1              2   32.7  95.3 
#> 10     2              1              2   13.0   4.57
#> # … with 179,990 more rows

Finally, we can compute the cross-correlations for all 22,500 pairs of neighbourhoods.

# Lagged cross-correlations between neighbourhoods
pairs %>% CCF(crimeA, crimeB, lag_max = 3)
#> # A tsibble: 157,500 x 4 [1]
#> # Key:       neighbourhoodA, neighbourhoodB [22,500]
#>    neighbourhoodA neighbourhoodB   lag     ccf
#>             <int>          <int> <lag>   <dbl>
#>  1              1              1    -3  0.225 
#>  2              1              1    -2 -0.592 
#>  3              1              1    -1 -0.0842
#>  4              1              1     0  1     
#>  5              1              1     1 -0.0842
#>  6              1              1     2 -0.592 
#>  7              1              1     3  0.225 
#>  8              1              2    -3  0.191 
#>  9              1              2    -2  0.459 
#> 10              1              2    -1 -0.237 
#> # … with 157,490 more rows

^{Created on 2020-05-08 by the reprex package (v0.3.0)}

Tamim · May 8, 2020, 12:58am

robjhyndman, Thank you so much for taking the time to write this detailed answer. I really appreciate it. My final goal is to have a correlation matrix (a matrix of 150 by 150 or 22500 element) and then I want to transform this matrix into an adjacency matrix to draw a network of highly correlated neighbourhood. So is possible to have the laged-correlation results as matrix 150 by 150? I mean converting the long-shaped df to a correlation matrix

Thanks again

robjhyndman · May 8, 2020, 2:29am

There are 22500 correlations for each lag. My analysis did this for 7 lags (-3,-2,-1,0,1,2,3) so there are 157500 correlations computed. So I assume you want a 150x150 matrix for each lag.

Here is an example for lag 1, converted to a matrix with neighbourhoods as row and column names.

ccf <- pairs %>% CCF(crimeA, crimeB, lag_max = 3)

lag1 <- ccf %>%
  as_tibble() %>%
  filter(lag==1) %>%
  select(-lag) %>%
  pivot_wider(names_from="neighbourhoodB", values_from="ccf") %>%
  as.data.frame(row.names=neighbourhoodA) %>%
  select(-neighbourhoodA) %>%
  as.matrix(rownames.force=TRUE)

Tamim · May 10, 2020, 6:27pm

many thanks robjhyndman for your time. It worked well. Last question, I wanted to get the matrix upper/lower triangular I noticed that that the diagonal elements of the laged-correlation matrix are not Ones which I always see in correlation matrix, I mean the correlation between a neighbourhood with its self is not 1, is that normal?
Thanks

robjhyndman · May 10, 2020, 10:09pm

The lag k correlation between time series \{X_t\} and \{Y_t\} is given by
r_{XY}(k) = \text{Corr}(X_t, Y_{t-k})
In general, r_{YY}(k) \ne 1 unless k=0.

system · May 17, 2020, 10:09pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.