I'm reading the excellent book Introduction to Feature Engineering... (Kuhn & Johnson) and am confused by the example with the Ames housing data showing how to use Two-Stage modeling when building models with interactions (that have lots of base variables).
High level, I understand the idea as:
- Identify which base predictors are important (e.g. by identifying predictors remaining after using lasso regression)
- Create all pairwise interactions from selected variables
- Input base predictors and interactions into another model that will again do variable selection (e.g. lasso) to get final model
What I am confused by is how the modeling of the error in the 2nd model building stage (as described at the top of the example) fits into the
ames modeling example. Was this just an explanatory note -- as it seems like in both step 1 and 3 when the models are being built it would just have a target of
Sale_Price in both cases, rather than
Sale_Price in the former and
Error in the latter, correct?
ames data say we are predicting
Sale_Price and have 6 initial variables
Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + Gr_Liv_Area + Lot_Area+ Year_Sold
Build lasso model with 6 base variables as input. Say lasso selects 3 variables
Create interactions (based on strong hereditary principle
Step 3 (HERE's WHERE I'M CONFUSED):
Build a new lasso model using selected main effects and corresponding interactions with target of
Sale_Price? I.e. model:
Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + Gr_Liv_Area + Lot_Area+ Year_Sold + Year_Built*Year_Sold + Year_Sold*Lot_Area + Year_Built*Lot_Area
(This seems to be what is done in the [code])(https://github.com/topepo/FES/blob/master/07_Detecting_Interaction_Effects/7_04_The_Brute-Force_Approach_to_Identifying_Predictive_Interactions/ames_glmnet.R)
OR is there a step where the resulting interaction terms are modeled against the error, per the opening example of this section and the comment at the bottom about using modeling error in the classification context. E.g.
error_mod1 = Sale_Price - pred_mod1
Build lasso regression for:
error_mod1 ~ Year_Built*Year_Sold + Year_Sold*Lot_Area + Year_Built*Lot_Area
... and then follow-up step...?
My guess is the former is correct, though just wanted to make sure?
P.s. let me know of the github page for the book (or other location) would be a better place to post this question.