issue with gbm.step

Hi, I'm running the function gbm.step() from dismo package on the following dataset:

'data.frame':	427205 obs. of  17 variables:
 $ Nsamples  : num  92.2 92.2 92.2 92.2 92.2 92.2 92.2 92.2 92.2 92.2 ...
 $ R         : num  44.9 44.9 44.9 44.9 44.9 ...
 $ P50       : num  0.845 0.847 0.846 0.846 0.846 ...
 $ unc_reg   : Factor w/ 465 levels "ur1","ur10","ur100",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ HasRes    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ use       : num  9.54 9.54 9.54 9.54 9.54 ...
 $ acc       : num  1.49 1.49 1.49 1.49 1.49 ...
 $ tmp       : num  2.45 2.45 2.45 2.45 2.45 ...
 $ irg       : num  1.76 1.76 1.76 1.76 1.76 ...
 $ PgExt     : num  3.41 3.41 3.41 3.41 3.41 ...
 $ PgInt     : num  3.74 3.74 3.74 3.74 3.74 ...
 $ ChExt     : num  4.22 4.22 4.22 4.22 4.22 ...
 $ ChInt     : num  5.57 5.57 5.57 5.57 5.57 ...
 $ Ca        : num  3.68 3.68 3.68 3.68 3.68 ...
 $ veg       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Region.num: num  3 3 3 3 3 3 3 3 3 3 ...
 $ Region    : num  3 3 3 3 3 3 3 3 3 3 ...
 - attr(*, "na.action")= 'omit' Named int [1:33915] 85075 85078 85084 85088 85090 85091 85092 85095 85099 85101 ...
  ..- attr(*, "names")= chr [1:33915] "13379" "477" "101000" "14103" ...

using the following command

    myBRT2<- gbm.step(data = DFbrt_df2, 
                     gbm.x = ColIndexCov, 
                     gbm.y = ColIndexResp,
                     tree.complexity = 3,
                     learning.rate = 0.000005,
                     n.trees = 50,
                     family = "bernoulli",
                     n.folds = 4,
                     fold.vector = DFbrt_df2$Region.num,
                     step.size = 5,
                     verbose = F,
                     silent = T )

but it returns an empty vector. If I reduce the amount of data and I run the function several times, a small amount of them returns the result, otherwise always an empty vector. There is anyone who knows why this happens?

Thanks for your help!

showing str() of your data is not the best way to share a portion of your data, as there is no easy way to reproduce the data for other users. Consider an alternative such as combining head() with dput().

What is ColIndexCov, ColIndexResp ?

I'm not sure why you mention that the return is an empty vector. The return would be expected to be a complex structure of class gbm wouldnt it ?

Sorry, here a better presentation of my data

head(DFbrt_df2)
  Nsamples     R       P50 unc_reg HasRes      use      acc      tmp      irg    PgExt    PgInt    ChExt
1     92.2 44.92 0.8449981     ur1      1 9.535665 1.491690 2.447730 1.761640 3.408592 3.739075 4.219147
2     92.2 44.92 0.8468543     ur1      1 9.535665 1.491690 2.447730 1.761640 3.408592 3.739075 4.219147
3     92.2 44.92 0.8458765     ur1      1 9.535665 1.491690 2.447730 1.761640 3.408592 3.739075 4.219147
4     92.2 44.92 0.8460921     ur1      1 9.535665 1.491690 2.447730 1.761640 3.408592 3.739075 4.219147
5     92.2 44.92 0.8463000     ur1      1 9.535665 1.491690 2.447730 1.761640 3.408592 3.739075 4.219147
6     92.2 44.92 0.8455794     ur1      1 9.690327 1.286499 2.448238 1.854695 3.606503 3.882413 4.417110
     ChInt       Ca veg Region.num Region
1 5.572977 3.680983   0          3      3
2 5.572977 3.680983   0          3      3
3 5.572977 3.680983   0          3      3
4 5.572977 3.680983   0          3      3
5 5.572977 3.680983   0          3      3
6 5.758398 3.607605   0          3      3

ColIndexCov and ColIndexResp are respectively the index of the columns of my covariates and my response variable. More precisely:

> colnames(DFbrt_df2)[ColIndexCov]
 [1] "use"   "acc"   "tmp"   "irg"   "PgExt" "PgInt" "ChExt" "ChInt" "Ca"    "veg"  
> ColIndexCov
 [1]  6  7  8  9 10 11 12 13 14 15
> colnames(DFbrt_df2)[ColIndexResp]
[1] "HasRes"
> ColIndexResp
[1] 5

Regarding your second question, my problem is exactly what you mentioned: the output should be a complex structure of class gbm, but instead is an empty vector and I can't figure out why.

no.. consider the question... How can I refer to this 'data' in my code when I investigate your issue ?
I recommended dput()

You're right. But I have 427205 observations and the dput() output is impossible to reproduce here. There is another way to better share my data?
You can use the following code to reproduce a data set similar to mine, but it's not the same thing because it's not the same dataset

 HasRes <- rbinom(n = 427205, size = 1, prob = 0.5217)
    use <- rnorm(n = 427205, mean = 8.735, sd = 1.158753)
    acc <- rnorm(n = 427205, mean = 1.637 , sd = 0.6275683)
    tmp <- rnorm(n = 427205, mean = 2.450  , sd = 0.01050098)
    irg <- rnorm(n = 427205, mean = 1.0245  , sd = 0.6520517)
    pgExt <- rnorm(n = 427205, mean = 3.039  , sd = 1.126698)
    pgInt <- rnorm(n = 427205, mean = 2.594  , sd = 1.534927)
    ChExt <- rnorm(n = 427205, mean = 3.569  , sd = 1.169632)
    ChInt <- rnorm(n = 427205, mean = 4.158  , sd = 1.447912)
    Ca <- rnorm(n = 427205, mean = 2.383  , sd = 1.189579)
    veg <- rnorm(n = 427205, mean = 0.5522   , sd = 0.6824301)
    Region <- sample(1:4, 427205, replace = T )
    DFbrt_df2 <-
      data.frame(
        HasRes = HasRes,
        use = use,
        acc = acc,
        tmp = tmp,
        irg = irg,
        pgExt = pgExt,
        pgInt = pgInt,
        ChExt = ChExt,
        ChInt = ChInt,
        Ca = Ca,
        veg = veg,
        Region.num = Region
      )
    ColIndexCov <- 2:(ncol(DFbrt_df2)-1)
    ColIndexResp <- 1

Thanks so much for your help!

this can be addressed by combing the use of head and dput (or some form of samping and dput)..

However, in this case I think if you made the gbm.step() call non-silent, you would read potentially useful feedback

,silent = T )

It will probably advise adjusting step-size and/or learning rate.

I set silent equal to F previously, it advised to adjust for step-size and learning rate and I did it
but the output is still an empty vector.
I shared my dataset on dropbox at this link https://www.dropbox.com/home/giulia%20brunelli?preview=DFbrt_df2.RData

load("DFbrt_df2.RData")
    step_size <- round(seq.int(from = 1, to = 50, length.out = 10))
    learning_rate <- seq(from = 0.0005, to = 0.00000005, length.out = 10)
    myBRT <- list()
    for (i in 1:10) {
      myBRT[[i]]<-list()
      for (j in 1:10) {
        myBRT[[i]][[j]] <- gbm.step(data = DFbrt_df2, 
                                    gbm.x = ColIndexCov, 
                                    gbm.y = ColIndexResp,
                                    tree.complexity = 3,
                                    learning.rate = learning_rate[i],
                                    n.trees = 50,
                                    family = "bernoulli",
                                    n.folds = 4,
                                    fold.vector = DFbrt_df2$Region.num,
                                    step.size = step_size[j],
                                    verbose = F,
                                    silent = F )
      }
} 

I can't access dropbox, as its blocked on my corporate LAN.
However using your orginal code, modified to have adjustible sample size

library(dismo)
sizeparm <- 42720 
HasRes <- rbinom(n = sizeparm, size = 1, prob = 0.5217)
use <- rnorm(n = sizeparm, mean = 8.735, sd = 1.158753)
acc <- rnorm(n = sizeparm, mean = 1.637 , sd = 0.6275683)
tmp <- rnorm(n = sizeparm, mean = 2.450  , sd = 0.01050098)
irg <- rnorm(n = sizeparm, mean = 1.0245  , sd = 0.6520517)
pgExt <- rnorm(n = sizeparm, mean = 3.039  , sd = 1.126698)
pgInt <- rnorm(n = sizeparm, mean = 2.594  , sd = 1.534927)
ChExt <- rnorm(n = sizeparm, mean = 3.569  , sd = 1.169632)
ChInt <- rnorm(n = sizeparm, mean = 4.158  , sd = 1.447912)
Ca <- rnorm(n = sizeparm, mean = 2.383  , sd = 1.189579)
veg <- rnorm(n = sizeparm, mean = 0.5522   , sd = 0.6824301)
Region <- sample(1:4, sizeparm, replace = T )
DFbrt_df2 <-
  data.frame(
    HasRes = HasRes,
    use = use,
    acc = acc,
    tmp = tmp,
    irg = irg,
    pgExt = pgExt,
    pgInt = pgInt,
    ChExt = ChExt,
    ChInt = ChInt,
    Ca = Ca,
    veg = veg,
    Region.num = Region
  )
ColIndexCov <- 2:(ncol(DFbrt_df2)-1)
ColIndexResp <- 1
myBRT2<- gbm.step(data = DFbrt_df2, 
                  gbm.x = ColIndexCov, 
                  gbm.y = ColIndexResp,
                  tree.complexity = 3,
                  learning.rate = 0.000005,
                  n.trees = 50,
                  family = "bernoulli",
                  n.folds = 4,
                  fold.vector = DFbrt_df2$Region.num,
                  step.size = 5,
                  verbose = F,
                  silent = FALSE )

I received an error advising to adjust learning rate and/or step.size.
I adjusted step size to 1 and it worked without error.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.