Hi there! Thanks for all your work creating the tidymodels framework, it's been invaluable to my research!
I'm getting different predictions and different results for model performance when using
predict(), despite applying them to the same (training) dataset, and I'm struggling to understand why. I'm sure it relates to nuances between the two that I've not understood, but I'm kind of stumped - any help would be massively appreciated!
Below's my attempt at a reproducible example - I'm using the cells dataset and training a random-forest on the data (
rf_fit). The object
rf_fit$fit$predictions is one of the sets of predictions I assess the accuracy of. I then use
rf_fit to make predictions on the same data via the
predict() function (yielding
rf_training_pred, the other set of predictions I assess the accuracy of).
My question is - why are these sets of predictions different from each other? And why are they so different?
I presume something must be going on under the hood I'm not aware off, but I'd expected these to be identical, as I'd assumed that
fit() trained a model (and has some predictions associated with this trained model) and then
predict() takes that exact model and just re-applies it to (in this case) the same data - hence the predictions of both should be identical.
What am I missing? Any suggestions or help in understanding would be hugely appreciated - thank you!
# Load required libraries library(tidymodels); library(modeldata) #> Registered S3 method overwritten by 'tune': #> method from #> required_pkgs.model_spec parsnip # Set seed set.seed(123) # Split up data into training and test data(cells, package = "modeldata") # Define Model rf_mod <- rand_forest(trees = 1000) %>% set_engine("ranger") %>% set_mode("classification") # Fit the model to training data and then predict on same training data rf_fit <- rf_mod %>% fit(class ~ ., data = cells) rf_training_pred <- rf_fit %>% predict(cells, type = "prob") # Evaluate accuracy data.frame(rf_fit$fit$predictions) %>% bind_cols(cells %>% select(class)) %>% roc_auc(truth = class, PS) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 roc_auc binary 0.903 rf_training_pred %>% bind_cols(cells %>% select(class)) %>% roc_auc(truth = class, .pred_PS) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 roc_auc binary 1.00
Created on 2021-09-25 by the reprex package (v2.0.1)