Hi there! Thanks as ever for all the incredible work that's gone into creating the tidymodels framework, can't convey how useful it's been to my research!

My question is about using xgboost - specifically how can I access the predictions/fit to the training data of the underlying model being trained (without using predict).

To clarify what I mean, when fitting a Random Forest model, I can explore the fitted model (rf_fit in the reprex below ) and its predictions on the training data in two ways.

  • Using predict() - calling predict(rf_fit, cells, type = "prob". (Method 1).
  • Getting predictions from rf_fit directly (rf_fit$fit$predictions) (Method 2).

These result in different predictions for reasons that have been clarified here.

In this case, I'm particularly interested in the equivalent of rf_fit$fit$predictions (i.e. Method 2) for boosted regression trees and my xgb_fit object. My questions are two-fold:

  • Where in xgb_fit are the predictions from the trained model? (I.e. where is the equivalent of rf_fit$fit$predictions that we get for random forest models)? Or, what do I need to add to get those predictions outputted?
  • If the above is possible, how should I interpret these predictions? Are they different from calling predict? If so, what do they represent (I gather out-of-bag estimates are non-trivial for boosted regression trees)?

(Basically, I'd like the predictions from the model that produced the training_logloss error at iteration 1000 of xgb_fit$fit$evaluation_log).

# Load required libraries 
library(tidymodels); library(modeldata) 
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

# Set seed 
set.seed(123)

# Load in data
data(cells, package = "modeldata")

# Define Random Forest Model
rf_mod <- rand_forest(trees = 1000) %>% 
  set_mode("classification") %>%
  set_engine("ranger") 

# Define BRT Model
xgb_mod <- boost_tree(trees = 1000) %>%
  set_mode("classification") %>%
  set_engine("xgboost",
             objective = 'binary:logistic',
             eval_metric = 'logloss')

# Fit the models to training data
rf_fit <- rf_mod %>% 
  fit(class ~ ., data = cells)
xgb_fit <- xgb_mod %>% 
  fit(class ~ ., data = cells)
xgb_fit$fit$evaluation_log
#>       iter training_logloss
#>    1:    1         0.542353
#>    2:    2         0.443275
#>    3:    3         0.382232
#>    4:    4         0.333377
#>    5:    5         0.303415
#>   ---                      
#>  996:  996         0.001918
#>  997:  997         0.001917
#>  998:  998         0.001917
#>  999:  999         0.001916
#> 1000: 1000         0.001915

# Examine output predictions on training data for RANDOM FOREST Model 
rf_whole <- predict(rf_fit, cells, type = "prob") # predictions based on whole fitted model
rf_oob <- head(rf_fit$fit$predictions) # predictions based on out of bag samples

## these are different to each other as we would expect
rf_whole$.pred_PS[1]
#> [1] 0.9229111
rf_oob[1, "PS"]
#>        PS 
#> 0.8503902

# Examine output predictions on training data for BOOSTED REGRESSION TREE Model
xgb_whole <- predict(xgb_fit, cells, type = "prob")
reprex
#> Error in eval(expr, envir, enclos): object 'reprex' not found

Created on 2021-10-05 by the reprex package (v2.0.1)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.