Rule Fit with Tidymodels `Rules`

john.smith · September 2, 2022, 1:33pm

Hi,

I am following along on the rules article on tidymodels. There is a section which talks about redundant rules more specifically:

Looking at the rules, there are examples of some rules, such as:

( bill_depth_mm >= 14.1499996 ) & 
( flipper_length_mm <  227 ) & 
( flipper_length_mm <  228.5 ) & 
( flipper_length_mm >= 197.5 ) & 
( flipper_length_mm >= 224.5 )"

which can be simplified to have fewer conditions:

( bill_depth_mm >= 14.1499996 ) & 
( flipper_length_mm <  227 ) & 
( flipper_length_mm >= 197.5 ) &

I was curious. Is there a way to automatically find the redundant rules and remove them?

Thank you for your time
Cheers
John

nirgrahamuk · September 2, 2022, 2:56pm

I'm not aware that you could solve this easily with any one liner ; but it seems to be a question of writing code that can order strings based on the order of the final numbers in them, and the presence /absence of symbols representing inequalities, so with some effort you could write a function to do this.

Max · September 5, 2022, 11:22pm

The final model is a (generalized) linear model so, although the rules have a lot of overlap, they are not redundant.

Each has a model coefficient, so multiple uses of the same variable with different cut-points is modeling a 1 degree of freedom spline function.

This is very similar to what MARS does and it enables your model to use nonlinear relationships between predictors and the outcome.

tl;dr it's a feature and not a bug.

john.smith · September 6, 2022, 8:24am

Hi @Max,

Thanks very much for taking the time out to answer this question.
Im not sure I am following. If the algorithm generates a rule below, are not two rules redundant here?

( bill_depth_mm >= 14.1499996 ) & 
( flipper_length_mm <  227 ) & 
( flipper_length_mm <  228.5 ) & 
( flipper_length_mm >= 197.5 ) & 
( flipper_length_mm >= 224.5 )

I ask because my use case is that we have to fill out an excel sheet for a tool that basically picks items for inspections. We only have one slot for each attribute so in the case of classifying penguins above we would only be able to have one value for bill_depth_mm, flipper_length_mm and flipper_length_mm

Again thank you for your time

All the best

Max · September 7, 2022, 12:49pm

These might be redundant if they were used in a tree-based model but ruleFit adds them to a linear model along with the original predictor. This allows it to model the predictors used in the splits in nonlinear ways.

It ends up being similar to a really crude spline model. Similar approaches are discussed Feature Engineering and Selection.

For the example above, a linear regression shows the nonlinearity although it is not very strong in this example

library(broom)
library(ggplot2)

data(penguins, package = "modeldata")

penguins <- penguins[complete.cases(penguins), ]

f <- body_mass_g ~ flipper_length_mm + I( flipper_length_mm <  227 ) + 
  I( flipper_length_mm <  228.5 ) + I( flipper_length_mm >= 197.5 ) + 
  I( flipper_length_mm >= 224.5 )

pen_fit <- lm(f, data = penguins)
grid <- data.frame(flipper_length_mm = seq(170, 234, by = 1 / 4))
pen_res <- augment(pen_fit, newdata = grid)

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_point(alpha = 1 /2 ) + 
  geom_line(data = pen_res, aes(y = .fitted), col = "red") +
  geom_vline(xintercept = c(227, 228.5, 197.5, 224.5), lty = 3) +
  theme_bw()

^{Created on 2022-09-07 with reprex v2.0.2}

john.smith · September 8, 2022, 5:26am

Ah ok, i see now

Thanks very much for taking the time out to answer

system · September 15, 2022, 5:27am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.