rpart$where output interpretation

  1. How do you interpret the rpart$where output?

  2. How do you interpret rpart, when there is an offset variable in a poisson distrubution (example number of injuries, and hours worked)

  • Example:
    tree <- rpart(fprmula = cbind(hours worked, number of injuries), data = training set, method = "poisson"....)

how is the "formula" interpreted in this case? How would the output of the decision tree differ if just one target variable was used for the "formula"

The where component of the rpart() output tells you in which leaf of the tree each observation ended. Leaves are identified by the row number of the rpart$frame data frame. Below is a simple example. Each observation is in category A or B. The two predictive variables are X1 and X2. If both of them have high values or both have low values, the observation has an Outcome of A. If one is high and the other is low, the Outcome is B.

library(rpart)
#> Warning: package 'rpart' was built under R version 3.5.3
library(rpart.plot)
#> Warning: package 'rpart.plot' was built under R version 3.5.3
suppressPackageStartupMessages(library(dplyr))
set.seed(34562)
df <- data.frame(Outcome = rep(c("A", "B"), each = 10),
                 X1 = c(rnorm(8, 1, 1), rnorm(2, 6, 1), 
                        rnorm(8, 6, 1), rnorm(2, 1, 1)),
                 X2 = c(10,10,10,10, 10,10,10,10,20, 20, 10,10,10,10, 10,10,10,10,20,20))
tree <- rpart(formula = Outcome ~ X1 + X2, data = df, method = "class",
              control = rpart.control(minsplit = 5))
rpart.plot(tree, extra = 1)

tree$frame %>% select(var, n, dev)
#>      var  n dev
#> 1     X1 20  10
#> 2     X2 10   2
#> 4 <leaf>  8   0
#> 5 <leaf>  2   0
#> 3     X2 10   2
#> 6 <leaf>  2   0
#> 7 <leaf>  8   0
tree$where
#>  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
#>  3  3  3  3  3  3  3  3  6  6  7  7  7  7  7  7  7  7  4  4

Created on 2019-05-31 by the reprex package (v0.2.1)
Looking at the selected columns of tree$frame, there are four leaves in rows 3, 4, 6 and 7. n is the number of observations in the leaf and dev is the number of incorrectly assigned observations Don't be confused by the leaf id numbers in the order 1,2,4,5,3,6,7. Those are not row numbers! Looking at the output of tree$where, you can see that the first 8 observations ended up in row 3. That is the leaf with low X1 and low X2 at the far left of the plot. You can tell that by looking at the values in df. The next two observations end up in row 6 of tree$frame; high X1 and high X2 with 2 observations.

very helpful, thank you! any insights about question # 2

I have never worked with rpart poisson models and I cannot address your second question. There is a section on Poisson Regression in the package vignette but I can't say how useful that would be.

1 Like