plot lm/spline coefficients

I have a time series data for neighbourhood and crime rate per year for about 300 observations (neighbourhoods) (example below).

Neighbourhood  year  rate
      5001        2009  43.5   
      5001        2010  34.7
      5001        2011  40.8
      5002        2009  28.9
      5002        2010  33.8
      5002        2011  24.4
      .           .     .
      .           .     .

I applied spline regression by group (for each neighbourhood) to plot their curves over time and ended up with 6 coefficients for each observations:

	CensusTract (Intercept)          bs(rep$year, 4)1   bs(rep$year, 4)2   bs(rep$year, 4)3   bs(rep$year, 4)4
 [1,]	      5001    36.14530           17.502968           7.3978890          13.9133907          5.70852015
 [2,]        5002    64.62910           19.849905         -16.5157307         -10.7476461        -20.38942429
 [3,]        5003    60.62435           30.698498           3.8573157          10.3343372          2.42415531
 [4,]        5004   117.41211           20.563574         -56.1543275         -42.1466082        -70.43030322
 [5,]        5005    65.11512            6.628532          -0.1630097         -13.9509698        -32.82296251
 [6,]        5006    71.56126           11.982026         -21.9261412         -20.3717788        -39.88968841
 [7,]        5007    51.69142           13.757720          -0.9959946          17.3522501          1.03887025

I run k-mean clustering to cluster spline coefficients and ended up with 3 clusters

Cluster means: 

  CensusTract   (Intercept)   bs(rep$quarter, 6)1   bs(rep$quarter, 6)2   bs(rep$quarter, 6)3   bs(rep$quarter, 6)4
1    5046.074    45.52149          23.9330434               7.38961              7.318444                 -0.938248
2    5033.760    63.23948          12.8276297        	   -14.20825              -15.074422               -23.002576
3    5031.500   113.91280          -0.8379427               -68.82647            -56.071503               -80.903656

My question is:

How can I plot these coefficients to see the 3 curves of these clusters?

If you have a data set with n observations and fit it to a spline with df degrees of freedom, the bs() function from the splines package returns a (n x df) matrix. The columns of the matrix represent the basis functions used to do the fit. Let's assume df = 4. If b1 represents the first column of the matrix, b2 is the second column, etc., you are fitting

y ~ b1 + b2 + b3 + b4 

Below is an example where I plot the both the fit as produced by the predict() function and a manually calculated set of predictions using the method outlined above. I got this method from the following Stack Overflow thread

library(splines)
set.seed(34556)
dat <- data.frame(x = seq(1, 10, 0.2), y = seq(1, 10, 0.2) + rnorm(46, 0, 2))
FIT <- lm( y ~ bs(x, df = 4 ), data =dat) 
summary(FIT)
#> 
#> Call:
#> lm(formula = y ~ bs(x, df = 4), data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.9201 -0.9719  0.1797  1.2219  4.7808 
#> 
#> Coefficients:
#>                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)      1.8029     1.2656   1.425  0.16185    
#> bs(x, df = 4)1   2.9686     2.6567   1.117  0.27032    
#> bs(x, df = 4)2  -0.3138     2.2283  -0.141  0.88870    
#> bs(x, df = 4)3  10.5438     2.3718   4.446 6.54e-05 ***
#> bs(x, df = 4)4   4.9131     1.6975   2.894  0.00606 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.003 on 41 degrees of freedom
#> Multiple R-squared:  0.5959, Adjusted R-squared:  0.5565 
#> F-statistic: 15.12 on 4 and 41 DF,  p-value: 1.131e-07

MAT <- bs(dat$x, df = 4) #get the matrix returned by bs()
head(MAT)
#>              1           2            3 4
#> [1,] 0.0000000 0.000000000 0.000000e+00 0
#> [2,] 0.1245981 0.002875171 2.194787e-05 0
#> [3,] 0.2323402 0.011149520 1.755830e-04 0
#> [4,] 0.3241481 0.024296296 5.925926e-04 0
#> [5,] 0.4009438 0.041788752 1.404664e-03 0
#> [6,] 0.4636488 0.063100137 2.743484e-03 0
X <- dat[[1]] #get the x values from dat for plotting later
Ypred <- MAT[, 1] * FIT$coefficients[2] +  
  MAT[, 2] * FIT$coefficients[3] +
  MAT[, 3] * FIT$coefficients[4] +
  MAT[, 4] * FIT$coefficients[5] + FIT$coefficients[1]

plot(dat$x, dat$y)
Pts <- seq(1,10, length.out = 100)
lines(Pts, predict(FIT, newdata = list(x = Pts)))
points(X, Ypred, col = "red")

Created on 2019-05-30 by the reprex package (v0.2.1)
You should be able to use your three sets of spline coefficients similarly with the matrix returned by bs() with your data.

EDIT:
I neater way to calculate Ypred is

Ypred <- MAT %*% FIT$coefficients[2:5] + FIT$coefficients[1]

Thank you FJCC so much for your detailed reply. (I've been trying for 4 days ): )
I'll try your suggestions and get back here.

Hi FJCC,
Thanks again for your hep and detailed answer.

When I run your code on my data I get this error:

"Error in as.vector(x, mode) :
cannot coerce type 'closure' to vector of type 'any'"

plot(detgam$quarter, detgam$rate)
Pts <- seq(1,10, length.out = 100)
lines(Pts, predict(FIT, newdata = list(x = Pts))) # it shows the error here.
points(X, Ypred, col = "red")

It might be a stupid error but I am still bigger in this world ):

Here is my data:

My main goal is to cluster spline coefficients of each neighbourhood and then plot the clustered coefficients

I suspect the error in that line is due to the fact that your FIT was not generated with an independent variable named x, as mine was with

FIT <- lm( y ~ bs(x, df = 4 ), data =dat) 

Judging from you plot() command, something like

lines(Pts, predict(FIT, newdata = list(quarter = Pts)))

or

lines(Pts, predict(FIT, newdata = list(Pts)))

should work.

Sorry I forgot to put my code:
I will try what you suggest. but here is the code:

Subset for one CensusTract(neighborhood) by quarter:

dat <- subset(fulldat, CensusTract == 5007)
head(dat)
FIT <- lm( rate ~ bs(quarter, df = 4 ), data =dat)
summary(FIT)

MAT <- bs(dat$quarter, df = 4)
head(MAT)

#get the x values from dat for plotting later
X <- detgam[[1]]
Ypred <- MAT[, 1] * FIT$coefficients[2] +
MAT[, 2] * FIT$coefficients[3] +
MAT[, 3] * FIT$coefficients[4] +
MAT[, 4] * FIT$coefficients[5] + FIT$coefficients[1]

plot(detgam$quarter, detgam$rate)
Pts <- seq(1,10, length.out = 100)
lines(Pts, predict(FIT, newdata = list(x = Pts))) #here is the error.
points(X, Ypred, col = "red")

Here is a modified version of your code. I made several changes.

library(splines)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date


fulldat <- read.csv("c:/users/fjcc/Documents/R/Play/fulldat.csv")
dat <- subset(fulldat, CensusTract == 5007)
dat <- dat %>% mutate(quarter = ymd(quarter))
head(dat)
#>   CensusTract    quarter  rate
#> 1        5007 2009-01-01 51.18
#> 2        5007 2009-04-01 59.99
#> 3        5007 2009-07-01 55.83
#> 4        5007 2009-10-01 57.59
#> 5        5007 2010-01-01 57.09
#> 6        5007 2010-04-01 60.00
FIT <- lm( rate ~ bs(quarter, df = 4 ), data =dat)
summary(FIT)
#> 
#> Call:
#> lm(formula = rate ~ bs(quarter, df = 4), data = dat)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -10.7133  -3.5360   0.0582   3.0774  13.9050 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)           55.8388     5.2731  10.589 2.57e-10 ***
#> bs(quarter, df = 4)1  -0.3403    11.2624  -0.030    0.976    
#> bs(quarter, df = 4)2  15.7921     9.9300   1.590    0.125    
#> bs(quarter, df = 4)3  -6.8076    10.0944  -0.674    0.507    
#> bs(quarter, df = 4)4  -6.3157     7.1517  -0.883    0.386    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.986 on 23 degrees of freedom
#> Multiple R-squared:  0.287,  Adjusted R-squared:  0.163 
#> F-statistic: 2.315 on 4 and 23 DF,  p-value: 0.08789

MAT <- bs(dat$quarter, df = 4)
head(MAT)
#>              1           2            3 4
#> [1,] 0.0000000 0.000000000 0.000000e+00 0
#> [2,] 0.1958968 0.007618045 9.750159e-05 0
#> [3,] 0.3492859 0.029216443 7.930853e-04 0
#> [4,] 0.4629917 0.062796399 2.721273e-03 0
#> [5,] 0.5394632 0.105693760 6.503734e-03 0
#> [6,] 0.5824479 0.154272771 1.259849e-02 0

#get the x values from dat for plotting later
X <- dat[[2]] #x axis comes from quarter column
# Ypred <- MAT[, 1] * FIT$coefficients[2] +
#   MAT[, 2] * FIT$coefficients[3] +
#   MAT[, 3] * FIT$coefficients[4] +
#   MAT[, 4] * FIT$coefficients[5] + FIT$coefficients[1]
Ypred <- MAT %*% FIT$coefficients[2:5] + FIT$coefficients[1] #Equiv to above but neater

plot(dat$quarter, dat$rate)
#Pts <- seq(1,10, length.out = 100)
lines(X, predict(FIT, newdata = list(quarter = X))) #here is the error.
points(X, Ypred, col = "red")

Created on 2019-05-31 by the reprex package (v0.2.1)

FJCC, I I can't thank you enough for taking all the time to help me and working on my code to run it. I really appreciate that. Now it's working and I learned a lot.

this is the last question ( I promise you) (:

My main goal is to find the pattern of the trajectory (curves) of the neighbourhoods.

I could not apply what you suggested on the clustered coefficients because I could not find the matrix as returned by bs().

# spline for each neighbourhood:

models <- dlply(fulldat, "CensusTract", function(fulldat) 
  lm(formula = fulldat$rate ~ bs(fulldat$quarter, df=4 )))

coeff <- ldply(models, coefficients)  

CensusTract (Intercept)          bs(rep$year, 4)1   bs(rep$year, 4)2   bs(rep$year, 4)3   bs(rep$year, 4)4
 [1,]	      5001    36.14530           17.502968           7.3978890          13.9133907          5.70852015
 [2,]        5002    64.62910           19.849905         -16.5157307         -10.7476461        -20.38942429
 [3,]        5003    60.62435           30.698498           3.8573157          10.3343372          2.42415531
 [4,]        5004   117.41211           20.563574         -56.1543275         -42.1466082        -70.43030322
 [5,]        5005    65.11512            6.628532          -0.1630097         -13.9509698        -32.82296251
 [6,]        5006    71.56126           11.982026         -21.9261412         -20.3717788        -39.88968841
 [7,]        5007    51.69142           13.757720          -0.9959946          17.3522501          1.03887025

# Then I clustered these, (not sure if it's the right choice to cluster).

clust1 <- kmeans(coeff, centers = 3, nstart = 20)

AND then ended up with these clustered coefficients:

Cluster means: 

  CensusTract   (Intercept)   bs(rep$quarter, 6)1   bs(rep$quarter, 6)2   bs(rep$quarter, 6)3   bs(rep$quarter, 6)4
1    5046.074    45.52149          23.9330434               7.38961              7.318444                 -0.938248
2    5033.760    63.23948          12.8276297        	   -14.20825              -15.074422               -23.002576
3    5031.500   113.91280          -0.8379427               -68.82647            -56.071503               -80.903656

I could not find the matrix that returned by bs() for these coefficients, so could not apply what you suggested to plot these clustered coefficients.

The matrix returned by bs() depends only on the values the independent variable, in this case, quarter. You should be able to make the matrix with

MAT <- bs(fulldat$quarter, df=4 )

then use the values from a single row of your matrix of coefficients to calculate predicted y values. Maybe

y5001 <- MAT %*% coeff[1, 3:6] + coeff[1, 2]

if I understand correctly what coeff is and if sub setting that data.frame can work with matrix multiplication.

Edit: Oops, I was looking at the wrong part of your post. Multiply MAT by the first row of your cluster means coefficients.

Ypred1 <- MAT %*% ClusterMeans[1, 3:6] +  ClusterMeans[1,2]

Also, you seem to have included the census track "number" in the clustering, which I doubt you want to do.

Thanks FJCC, apologise for taking your time and bothering.

you are right, I remove CensusTract from clustering.

sorry for bothering you but everything went well until the line() that gives this error: "Error in xy.coords(x, y) : 'x' and 'y' lengths differ"

as shown below:

MAT2 <- bs(fulldat$quarter, df=4 ) #this result 756 4 matrix because I have 27 CensusTract with 28 quarters for each Tract.

# convert the clustered values to matrix clustx:

Ypred1 <- MAT2 %*% clustx[1, 2:5] +  clustx[1,1]

plot(dat$quarter, detgam$rate)

lines(dat$quarter, Ypred1, col = "red") 
# this gives error "'x' and 'y' lengths differ" I believe this because Ypred1 is 756 and dat$quarter is 28. 

But
When the change x to fulldat$quarter it works but shows strange line:

Now that you say it, it is obvious that making MAT from the full data set will cause a problem because of the 756 values. Try making a vector of the unique values of quarter and make MAT2 from that.

MAT2 <- bs(unique(fulldat$quarter), df = 4)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.