Conceptual Understanding of the MARS algorithim

mars

#1

Hi,

I'm currently reading Applied Predictive Modelling specifically related to the MARS algorithm. I have a couple of questions I hope someone can help me with

The algorithm talks about hinges where the data is effectively partitioned in two per predictor. A linear regression is then run on each of the partitions split by the hinge. In the book and in the Elements of Statistical Learning the examples (well the diagrams, I got lost in the notation), It shows the predictor space only being partitioned in two. I found another website where it is partitioned into multiple spaces. The algorithm to my untrained eye looks very similar to Regression Splines only with pruning. Is this the case?

Does MARS allow the building of Smoothing Splines?

Finally Applied Predictive Modelling (pg 149) talks about interactions and how the algorithm adds more hinge functions within the partitioned data. From the book (I can remove this aspect if it violates copyright).

The search procedure attempts to find hinge functions C and D that, when
multiplied by A, result in an improvement in the model; in other words, the
model would have terms for A, A×B and A×C.

My question on this is what happened to hinge function D?

It might be just a case that I'm mixing up apples and oranges but i would like to get a grasp of the conceptual process of the model before using it in R. If you have any recommendations of blog posts or research papers which focus on the the concepts with lots of examples i would be very grateful

Thank you all for your time.

References

ISLR - Regression Splines pg 271
ESLR - MARS Multivariate Adaptive Regression Splines pg 321


#2

It is similar to basic splines but not the same. These are single knot linear splines (so no attempt at being continuous) and they are derived in a sequential (and completely different) manner.

If you can tolerate more of my writing, there is more written here.

MARS can create multiple segments when the same predictor is split on multiple times. With multiple predictors (in a second degree split), it can isolate a specific region in 2D space. There are more training materials here.

That's a typo (that nobody has yet submitted for the errata page).

It should read A, A×C and A×D.

One thing that helped me learn the details is to increase the verbosity of the earth::earth function. For example:

> earth(mpg ~ ., data = mtcars, trace = 4, degree = 2)
Call: earth(formula=mpg~., data=mtcars, trace=4, degree=2)

x[32,10]:
              cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4       6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710      4  108  93 3.85 2.320 18.61  1  1    4    1
...             6  258 110 3.08 3.215 19.44  1  0    3    1
Volvo 142E      4  121 109 4.11 2.780 18.60  1  1    4    2

y[32,1]:
     mpg
1   21.0
2   21.0
3   22.8
... 21.4
32  21.4

Forward pass: minspan 5 endspan 10    x[32,10] 2.5 kB    bx[32,21] 5.25 kB

         GRSq    RSq     DeltaRSq Pred     PredName         Cut  Terms   Par Deg
1      0.0000 0.0000                    (Intercept)
2      0.8012 0.8602       0.8602    2         disp         145  2   3         1 
4      0.7952 0.8823      0.02215   10         carb           1< 4         2   2 
6      0.7278 0.9031      0.02075    9         gear           4  5   6         1 
8      0.5941 0.9230      0.01997    3           hp         123  7   8         1 
10     0.1754 0.9380      0.01497    4         drat        3.15  9   10        1 
12    -0.4433 0.9459     0.007927    7           vs           0< 11        6   2 
14    -2.5203 0.9551     0.009193    6         qsec        14.5< 12        3   2 
16   -31.2057 0.9665      0.01136    7           vs           0< 13        9   2 reject (negative GRSq)

Reached minimum GRSq -10 at 15 terms, 12 terms used (GRSq -31)
After forward pass GRSq -31.206 RSq 0.966
Forward pass complete: 15 terms, 12 terms used

Subset size        GRSq     RSq  DeltaGRSq nPreds  Terms (col nbr in bx)
          1      0.0000  0.0000     0.0000      0  1
          2      0.6590  0.7118     0.6590      1  1 3
chosen    3      0.8283  0.8792     0.1694      2  1 3 4
          4      0.8135  0.8928    -0.0148      3  1 3 4 5
          5      0.8231  0.9188     0.0096      6  1 4 8 11 12
          6      0.8024  0.9296    -0.0208      6  1 4 5 8 11 12
          7      0.7659  0.9376    -0.0365      6  1 3 4 5 8 11 12
          8      0.7089  0.9448    -0.0570      7  1 3 4 5 8 10 11 12
          9      0.6113  0.9511    -0.0976      7  1 2 3 4 5 8 10 11 12
         10      0.4003  0.9549    -0.2110      7  1 2 3 4 5 6 8 10 11 12
         11     -0.1989  0.9551    -0.5991      7  1 2 3 4 5 6 8 9 10 11 12
         12     -2.5203  0.9551    -2.3214      7  1 2 3 4 5 6 7 8 9 10 11 12

Prune method "backward" penalty 3 nprune null: selected 3 of 12 terms, and 2 of 10 preds
After pruning pass GRSq 0.828 RSq 0.879
Selected 3 of 12 terms, and 2 of 10 predictors
Termination condition: GRSq -10 at 12 terms
Importance: disp, carb, cyl-unused, hp-unused, drat-unused, wt-unused, qsec-unused, vs-unused, am-unused, ...
Number of terms at each degree of interaction: 1 1 1
GCV 6.436613    RSS 135.9734    GRSq 0.828338    RSq 0.8792471

#3

Hi @Max

Sorry for the delay in getting back to you. I am happy to submit to the errata page if you need it.

I had a read of the other notes and your book that you linked to and it certaintly makes it clearer for me. Just a couple of final questions if you have the time

  • How do you determine the number of hinges. Is this ascertained through cross validation?

  • The hinges are strictly for modelling linear data? So a curved line would not be possible if i am understanding the concept correctly?

  • In the linked to book you mention each hinge is determined by h(x)=xI(x>0) where I is an indicator function that is zero when x is greater than zero and zero otherwise. Looking at the example further down should the indicator function be 1 when the x>0 criteria is met?

  • Finally unrelated to the topic, I notice at the top of the book you intend to publish the book in paper form but the page numbers will be different. Is it possible to reference the online book if i use some of some of your explanations/recommendations for feature engineering

Thank you again for your time it really is great when this stuff starts clicking


#4

Usually. caret does an external CV to do this whereas most package use the generalized CV statistic to make this determination.

They can do a piecewise linear approximation to nonlinear curves. If there is a significant nonlinear trend for a predictor, it will usually do repeated hinge features on that variable to approximate it well.

Also, most implementations also test to see if a simple linear term does better than the best set of nonlinear features.

They usually encode them as I(x - 3) and I(3 - x) so the math works for both cases.

You can but I would wait until the final version of the online book is finished. I don't think that the links will change but I can't be certain.

No problem!


#5

Thank you @Max for your answer. It really helped