I'm reading the excellent book Introduction to Feature Engineering... (Kuhn & Johnson) and am confused by the example with the Ames housing data showing how to use Two-Stage modeling when building models with interactions (that have lots of base variables).
High level, I understand the idea as:
- Identify which base predictors are important (e.g. by identifying predictors remaining after using lasso regression)
- Create all pairwise interactions from selected variables
- Input base predictors and interactions into another model that will again do variable selection (e.g. lasso) to get final model
What I am confused by is how the modeling of the error in the 2nd model building stage (as described at the top of the example) fits into the ames
modeling example. Was this just an explanatory note -- as it seems like in both step 1 and 3 when the models are being built it would just have a target of Sale_Price
in both cases, rather than Sale_Price
in the former and Error
in the latter, correct?
E.g. for ames
data say we are predicting Sale_Price
and have 6 initial variables Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + Gr_Liv_Area + Lot_Area+ Year_Sold
Step 1:
Build lasso model with 6 base variables as input. Say lasso selects 3 variables Year_Built
Year_Sold
, and Lot_Area
,
Step 2:
Create interactions (based on strong hereditary principleYear_Built*Year_Sold
, Year_Sold*Lot_Area
, Year_Built*Lot_Area
Step 3 (HERE's WHERE I'M CONFUSED):
Build a new lasso model using selected main effects and corresponding interactions with target of Sale_Price
? I.e. model:
Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + Gr_Liv_Area + Lot_Area+ Year_Sold + Year_Built*Year_Sold + Year_Sold*Lot_Area + Year_Built*Lot_Area
(This seems to be what is done in the [code])(https://github.com/topepo/FES/blob/master/07_Detecting_Interaction_Effects/7_04_The_Brute-Force_Approach_to_Identifying_Predictive_Interactions/ames_glmnet.R)
OR is there a step where the resulting interaction terms are modeled against the error, per the opening example of this section and the comment at the bottom about using modeling error in the classification context. E.g.
error_mod1 = Sale_Price - pred_mod1
Build lasso regression for:
error_mod1 ~ Year_Built*Year_Sold + Year_Sold*Lot_Area + Year_Built*Lot_Area
... and then follow-up step...?
My guess is the former is correct, though just wanted to make sure?
P.s. let me know of the github page for the book (or other location) would be a better place to post this question.