Poisson model decision tree - offsets

Example:
we have two variables

  1. number of hours
  2. number of injuries

I'm using rpart to make a poisson model decision tree

decision tree <- rpart (formula = cbind(number of hours, number of injuries) ~., .....)

In this case, will the output of the tree show the number of hours, or number of injuries at each node?

Does this help?

library(rpart)
#> Warning: package 'rpart' was built under R version 3.5.3
library(rpart.plot)  #rep(c(2, 1, 0.4, 0.2), 24)
#> Warning: package 'rpart.plot' was built under R version 3.5.3
set.seed(3465)
#Make data with lambda of 1 and 10, 
#A+C and B+D have lambda = 1, B+C and A+D have lambda = 10
df <- data.frame(Time = rep(1, 800), Cnt = c(rpois(200, 1), rpois(200, 10),
                                            rpois(200, 10), rpois(200, 1)),
                 Attr1 = rep(c("A", "B"), each = 400),
                 Attr2 = c(rep(c("C", "D"), each = 200), rep(c("C", "D"), each = 200)))
#Fit Cnt only
tree1 <- rpart(Cnt ~ Attr1 + Attr2, data = df, method = "poisson")
rpart.plot(tree1)

##Fit Time and Cnt
tree2 <- rpart(as.matrix(df[, c("Time", "Cnt")]) ~ df[["Attr1"]] + df[["Attr2"]], 
               method = "poisson")
rpart.plot(tree2)


#double time value
df$Time <- 2 
##Fit only Cnt
tree3 <- rpart(Cnt ~ Attr1 + Attr2, data = df, method = "poisson")
rpart.plot(tree3)

##Fit Time and Cnt
tree4 <- rpart(as.matrix(df[, c("Time", "Cnt")]) ~ df[["Attr1"]] + df[["Attr2"]], 
               method = "poisson")
rpart.plot(tree4, extra = 1)

Created on 2019-06-01 by the reprex package (v0.2.1)

1 Like

This is really helpful.

Does this mean:

  • tree1 node of "5.4" is the mean of 5.4 per 1 unit of time
  • tree4 node of of "2.7" is the mean of 5.4 per 2 units of time

Since tree1 is fit simply against the Cnt, with no explicit reference to time, I would say the 5.4 is in units of per bin. Notice that tree3 is identical, though the Time has doubled in tree3 but the fit has no knowledge of that.

For tree4, the fit does know about the time and the 2.7 means per unit time. That is why the value has halved relative to tree2; the Cnt is unchanged but the time has doubled.

How does the model know that it's Count per unit time, and not time per unit count.

From page 43 of the vignette I linked earlier:

The y variable for Poisson partitioning may be a two column matrix containing the observation time in column 1 and the number of events in column 2,or it may be a vector of event counts alone.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.