Confidence Intervals in Boosted Regression Trees

Hello,
I'm working with boosted regression trees and I'm trying to calculate confidence intervals using bootstrap sampling.
The script I'm using is the following:

fun=function(x){
  x=gbm.step(data=Dados, gbm.x = 2:21, gbm.y = 1,
             family = "gaussian", tree.complexity = 2,
             learning.rate = 0.001, bag.fraction = 0.5)
  y=plot.gbm(x,i.var = 9,return.grid = TRUE)
  return(y)
}

library(boot)
air.boot <- boot(Dados, fun, R = 2, sim = "parametric")

After runing this, the following error appears

Error in t.star[r, ] <- res[[r]] : 
  incorrect number of subscripts on matrix

I was hoping that the output of this script would be partial dependence plots already with confidence intervals or at least, a list of those values.
I apreciate every help, thank you in advance!

Hi @Beatriz!

I'm not familiar with gbm.step function, but what seems odd to me is that you're not using the argument x anywhere in your function fun. Is that supposed to be the case?

Also, note that this is not the way boot works. The dataset is supposed to change in each replication, and your code gives no such possibility. Here's the description of the statistic argument of boot, where you're using fun:

statistic A function which when applied to data returns a vector containing the statistic(s)
of interest. When sim = "parametric", the first argument to statistic must
be the data. For each replicate a simulated dataset returned by ran.gen will be
passed. In all other cases statistic must take at least two arguments. The first
argument passed will always be the original data. The second will be a vector
of indices, frequencies or weights which define the bootstrap sample. Further,
if predictions are required, then a third argument is required which would be a
vector of the random indices used to generate the bootstrap predictions. Any
further arguments can be passed to statistic through the ... argument.

Another point is that your code fails itself. The line gbm.step(data=Dados, gbm.x = 2:21, gbm.y = 1, family = "gaussian", tree.complexity = 2, learning.rate = 0.001, bag.fraction = 0.5) itself leads to errors using the dataset you provided. Please check that function call.

It'll be helpful for people on this community if you can provide a REPRoducible EXample of your problem? It provides more specifics of your problem, and it helps others to understand what problem you are facing.

If you don't know how to do it, take a look at this thread:

I read the description of the statistic argument of boot and I was confused, I thought it should create the same function as when running the gbm model normally.
However, each time these models run, the results are different, but I would like to be able to apply the boot function as it should be.

data.frame(stringsAsFactors=FALSE,
             Big = c(0, 0, 0.333333333333333, 0, 0.222222222222222),
      Tree_Cover = c("O", "MC", "SO", "SO", "SO"),
    Shrubs_Cover = c("SC", "SO", "MC", "MC", "MC"),
     Grass_Cover = c("O", "MC", "SO", "SO", "SO"),
      Naked_Soil = c(3, 3, 2, 4, 4),
     Tree_Height = c("L", "L", "L", "S", "L"),
   Shrubs_Height = c("T", "H", "H", "T", "H"),
    Grass_Height = c("L", "L", "L", "L", "L"),
         Class_1 = c(0.117021276595745, 0.0957446808510638, 0.448275862068966,
                     0.241379310344828, 0.174418604651163),
         Class_2 = c(1.27659574468085, 0.98936170212766, 0.436781609195402, 1,
                     0.930232558139535),
          ASPECT = c(249.43603515625, 180, 331.714447021484, 234.409713745117,
                     352.326171875),
             LST = c(35.2816772460938, 34.1432838439941, 33.1653289794922,
                     31.4487133026123, 32.2971000671387),
      STD_NDVI_B = c(0.01500074858112, 0.014454394578934, 0.011638629995286,
                     0.013905543973669, 0.019954839814454),
       SHANNON_B = c(0.689610832538761, 0.346549997727076, 0.692177726555679,
                     0.071274728544297, 0.425752818584442),
    TREE_COVER_B = c(18.25, 40.5384615384615, 40.28, 48.2, 37.36),
         SLOPE_B = c(4.79538814510618, 9.47576512893041, 14.4894498870486,
                     6.86611137219838, 17.1000535033998),
           DEM_B = c(37.5714285714286, 33.4761904761905, 46.6190476190476,
                     71.6666666666667, 73.7142857142857),
          NDVI_B = c(0.262836583350834, 0.307793959190971, 0.29465841149029,
                     0.25947109727483, 0.239990872889757),
              DB = c(84.9651798863, 91.8427150336, 94.7049892021,
                     96.0842022325, 8.1394943871),
              OB = c(13.466707799, 1.1228884385, 1.9836147833, 0.010425148,
                     11.9405545364),
               G = c(0.0018932598, 5.4682419431, 1.7454539994, 2.3386234488,
                     78.3532818967)
)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.