xgboost extract trained model predictions

cwhittaker1000 · October 5, 2021, 3:24pm

Hi there! Thanks as ever for all the incredible work that's gone into creating the tidymodels framework, can't convey how useful it's been to my research!

My question is about using xgboost - specifically how can I access the predictions/fit to the training data of the underlying model being trained (without using predict).

To clarify what I mean, when fitting a Random Forest model, I can explore the fitted model (rf_fit in the reprex below ) and its predictions on the training data in two ways.

Using predict() - calling predict(rf_fit, cells, type = "prob". (Method 1).
Getting predictions from rf_fit directly (rf_fit$fit$predictions) (Method 2).

These result in different predictions for reasons that have been clarified here.

In this case, I'm particularly interested in the equivalent of rf_fit$fit$predictions (i.e. Method 2) for boosted regression trees and my xgb_fit object. My questions are two-fold:

Where in xgb_fit are the predictions from the trained model? (I.e. where is the equivalent of rf_fit$fit$predictions that we get for random forest models)? Or, what do I need to add to get those predictions outputted?
If the above is possible, how should I interpret these predictions? Are they different from calling predict? If so, what do they represent (I gather out-of-bag estimates are non-trivial for boosted regression trees)?

(Basically, I'd like the predictions from the model that produced the training_logloss error at iteration 1000 of xgb_fit$fit$evaluation_log).

# Load required libraries 
library(tidymodels); library(modeldata) 
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

# Set seed 
set.seed(123)

# Load in data
data(cells, package = "modeldata")

# Define Random Forest Model
rf_mod <- rand_forest(trees = 1000) %>% 
  set_mode("classification") %>%
  set_engine("ranger") 

# Define BRT Model
xgb_mod <- boost_tree(trees = 1000) %>%
  set_mode("classification") %>%
  set_engine("xgboost",
             objective = 'binary:logistic',
             eval_metric = 'logloss')

# Fit the models to training data
rf_fit <- rf_mod %>% 
  fit(class ~ ., data = cells)
xgb_fit <- xgb_mod %>% 
  fit(class ~ ., data = cells)
xgb_fit$fit$evaluation_log
#>       iter training_logloss
#>    1:    1         0.542353
#>    2:    2         0.443275
#>    3:    3         0.382232
#>    4:    4         0.333377
#>    5:    5         0.303415
#>   ---                      
#>  996:  996         0.001918
#>  997:  997         0.001917
#>  998:  998         0.001917
#>  999:  999         0.001916
#> 1000: 1000         0.001915

# Examine output predictions on training data for RANDOM FOREST Model 
rf_whole <- predict(rf_fit, cells, type = "prob") # predictions based on whole fitted model
rf_oob <- head(rf_fit$fit$predictions) # predictions based on out of bag samples

## these are different to each other as we would expect
rf_whole$.pred_PS[1]
#> [1] 0.9229111
rf_oob[1, "PS"]
#>        PS 
#> 0.8503902

# Examine output predictions on training data for BOOSTED REGRESSION TREE Model
xgb_whole <- predict(xgb_fit, cells, type = "prob")
reprex
#> Error in eval(expr, envir, enclos): object 'reprex' not found

^{Created on 2021-10-05 by the reprex package (v2.0.1)}

system · October 26, 2021, 3:24pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.