Two stage modeling example in "Feature Engineering.." (Kuhn & Johnson)

I'm reading the excellent book Introduction to Feature Engineering... (Kuhn & Johnson) and am confused by the example with the Ames housing data showing how to use Two-Stage modeling when building models with interactions (that have lots of base variables).

High level, I understand the idea as:

  1. Identify which base predictors are important (e.g. by identifying predictors remaining after using lasso regression)
  2. Create all pairwise interactions from selected variables
  3. Input base predictors and interactions into another model that will again do variable selection (e.g. lasso) to get final model

What I am confused by is how the modeling of the error in the 2nd model building stage (as described at the top of the example) fits into the ames modeling example. Was this just an explanatory note -- as it seems like in both step 1 and 3 when the models are being built it would just have a target of Sale_Price in both cases, rather than Sale_Price in the former and Error in the latter, correct?

E.g. for ames data say we are predicting Sale_Price and have 6 initial variables Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + Gr_Liv_Area + Lot_Area+ Year_Sold

Step 1:
Build lasso model with 6 base variables as input. Say lasso selects 3 variables Year_Built Year_Sold, and Lot_Area,

Step 2:
Create interactions (based on strong hereditary principleYear_Built*Year_Sold, Year_Sold*Lot_Area, Year_Built*Lot_Area

Step 3 (HERE's WHERE I'M CONFUSED):

Build a new lasso model using selected main effects and corresponding interactions with target of Sale_Price? I.e. model:

Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + Gr_Liv_Area + Lot_Area+ Year_Sold + Year_Built*Year_Sold + Year_Sold*Lot_Area + Year_Built*Lot_Area

(This seems to be what is done in the [code])(https://github.com/topepo/FES/blob/master/07_Detecting_Interaction_Effects/7_04_The_Brute-Force_Approach_to_Identifying_Predictive_Interactions/ames_glmnet.R)

OR is there a step where the resulting interaction terms are modeled against the error, per the opening example of this section and the comment at the bottom about using modeling error in the classification context. E.g.

error_mod1 = Sale_Price - pred_mod1

Build lasso regression for:
error_mod1 ~ Year_Built*Year_Sold + Year_Sold*Lot_Area + Year_Built*Lot_Area

... and then follow-up step...?

My guess is the former is correct, though just wanted to make sure?

P.s. let me know of the github page for the book (or other location) would be a better place to post this question.

So there's no universally right answer here and I'll mostly come at this from a practical perspective.

You're approach (modeling the error) is reasonable and a good idea if the underlying model didn't do any type of intrinsic feature selection. So, if this were ordinary, least-square linear regression, you could make a strong argument that this is the best approach.

My preference for the second stage, when a regularized method is used, is to have both sets of terms regularized simultaneously. Here's why: suppose you modeled the error as a function of only the interactions. You would select out some (esp when the number of them is large) and this would help whittle them down. That's great until you have to put them back in a model with all of the other predictors. Now they get regularized in a completely different way and the selection process might not choose the same interactions terms.

Basically, since the regularization term is a function of all of the model parameters, you want them regularized together as much as possible.

You can use this argument to suggest that we shouldn't do a second stage then. I think that we mentioned trying a single stage but it selected an infeasibly huge number of interactions terms so we came up with a more manual heuristic of the two-stage approach. This happened when I tried it on several examples.

(Aside: There is a paper out there about using the lasso to select interactions that contends with this. There is an R package but the interface is so bad that we deliberately left it out (despite being a great theoretic approach). Running our example through that package left me consulting the source code for a long time to even make sure that I was processing the results correctly (and I've been doing this a while).)

Either is fine with me. Next time you might want to use @Max so I'll see it sooner (theoretically).

2 Likes

Thanks @Max very helpful!

In the modeling the error case (e.g. for pure linear regression) I assume the final step (if ultimately making predictions and not just exploring residuals) then would be estimating a model with both selected main effects and selected interactions. E.g.

Step 1:
Use feasible solution algorithm (FSA), recursive feature elimination (RFE), or some other method to select among base variables.
Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + Gr_Liv_Area + Lot_Area+ Year_Sold

Say again Year_Built, Year_Sold , and Lot_Area are selected.

Step 2:
Create interactions (e.g. based on strong hereditary principle): Year_Built*Year_Sold , Year_Sold*Lot_Area , Year_Built*Lot_Area

Step 3:
Use FSA, RFE, ... to build model on error with interactions:

error_mod1 = Sale_Price - pred_mod1
error_mod1 ~ Year_Built*Year_Sold + Year_Sold*Lot_Area + Year_Built*Lot_Area

Say only Year_Built*Year_Sold is selected.

Step 4:
Build simple linear regression for:
Sale_Price ~ Year_Sold + Lot_Area + Year_Built + Year_Built*Year_Sold

Closing note:
In a lot of these cases (with many many variables) though figure I lean heavily towards using penalized regression techniques (as covered in prior post) anyways.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.