select the best model

supermarco · March 30, 2024, 2:49pm

Hello ! I'm playing with R estimations.

I want to call the best training model among ten others:

Bestmodel<- which.max(models$r2)

Bestmodel

[1] ols

+My main/first problem is that I would like to have the estimation printed instead of having only the name of the fit « ols ». Please how to do?

+My second question is to know if there is a function to call in Caret package to extract the estimated parameters of the best model like coef(Bestmodel)

+My last question is large: do you think the best training model (trained sample) has more chances to be the best in prediction (test sample)?

Thanks a lot !

AlexisW · April 1, 2024, 9:24pm

No, the specifics depend on the context, but in general, your training error will continue decreasing with more parameters, while the testing error will have a "sweet spot", decreasing first then increasing (overfitting). That's why you should select the best model on a testing or validation error, not on the training error. Typically cross-validation can be used.

In addition, once you've chosen the best model (based on its cross-validation error), it's common to re-train it using the whole dataset (training + validation), to yield a final model (keeping an additional test sample not used for model selection to evaluate the generalization error of the final model).

I'm not familiar with {caret}, but it should have all the functionality to extract the best model, see here, and in ?train the return value (specifically bestTune).

supermarco · April 2, 2024, 9:54am

Hello

Thank you very much for clear explanation !
I read all the discussions on the two links thanks really.
I will see the clear difference between validation/test set.

It was a very important question. My most basic queston remained unsolved although it is very basic. I was just asking how to extract the best model, because when i call the best model, R send me a response like for example [1] model_5 that i record in "best-model", but when i type best-model on the script i get still [1] model_5 instead of the estimate of the model 5.

Again thank you all.

supermarco · April 3, 2024, 12:11pm

Please let me put it simply.
I fit several OLS models (model1, model2 etc.which are all recorded in MODELS).
I write a function to select the model with highest R2: which.max(MODELS$r2).
I would like to see the estimation of the best model when i call it throught the which.max function (not only the name of the model to be displayed).
Thanks a lot.

AlexisW · April 3, 2024, 4:17pm

Taking the example from the link above:

library(caret)
#> Warning: package 'caret' was built under R version 4.3.3
#> Loading required package: ggplot2
#> Loading required package: lattice
set.seed(998)

iris_ran <- iris[sample(nrow(iris)), ]

inTraining <- createDataPartition(iris_ran$Species, p = .75, list = FALSE)
training <- iris_ran[ inTraining,]
testing  <- iris_ran[-inTraining,]


fitControl <- trainControl(## 10-fold CV
  method = "cv",
  number = 10)




gbmFit1 <- train(Species ~ ., data = training, 
                 method = "gbm", 
                 trControl = fitControl,
                 verbose = FALSE)


gbmFit1$bestTune
#>   n.trees interaction.depth shrinkage n.minobsinnode
#> 1      50                 1       0.1             10

predict(gbmFit1, newdata = head(testing))
#> [1] virginica setosa    setosa    virginica virginica setosa   
#> Levels: setosa versicolor virginica

head(testing)
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 106          7.6         3.0          6.6         2.1 virginica
#> 47           5.1         3.8          1.6         0.2    setosa
#> 32           5.4         3.4          1.5         0.4    setosa
#> 122          5.6         2.8          4.9         2.0 virginica
#> 142          6.9         3.1          5.1         2.3 virginica
#> 49           5.3         3.7          1.5         0.2    setosa

^{Created on 2024-04-03 with reprex v2.0.2}

So model$bestTune gives you the parameters of the best model, model$results accesses all the tested model metrics.

If MODELS is a list that you created with an sapply() or for loop, then which.max() should give you the index, that you can use to extract from the list: MODELS[[ which.max(MODELS$r2) ]]

I'm not sure how you created MODELS and what class() it has.

supermarco · April 4, 2024, 11:14am

Thank you very much for your response. I agree.
But how to find the same metrics when calling the best model among several models. Below an basic fictive example.

head(MODELS)
A tibble: 5 x 1
Model r2

1 model1 1.00
2 model2 0.98
3 model3 0.97
4 model4 0.96
5 model5 0.95

best <- noquote(results[which.max(na.omit(results)$r2),1])
best

model1
=> would like the estimation from "model1" not the name "model1".

Thanks a lot

AlexisW · April 4, 2024, 2:22pm

Can you paste the result of head(MODELS) between backquotes:

```
head(MODELS)
```

It's unclear what format your data is in: is it a tibble with a single column which is a character string containing "model1 1.00", "model2 0.98", ...? In that case which.max() should fail (because the max of a character string is meaningless). Or do you have 2 columns, contrary to the header that says:

How did you run the models? Are the models themselves stored somewhere? What's the result of these:

class(MODELS)
dim(MODELS)
names(MODELS)
class( MODELS[[1]] )
length( MODELS[[1]] )

supermarco · April 4, 2024, 3:24pm

Hello

Thank you for your message. I'm sorry about the confusion, i did not report the code, instead i invent some code lines to explain the situation basically.

Below is the response:

head(MODELS)
A tibble: 6 x 3
Model mean n

1 model1 1.00 1
2 model2 0.99 1
3 model3 0.98 1
4 model4 0.97 1
5 model15 0.96 1
6 model16 0.95 1

class(MODELS)
[1] "tbl_df" "tbl" "data.frame"
dim(MODELS)
[1] 93 3
names(MODELS)
[1] "Model" "mean" "n"
class( MODELS[[1]] )
[1] "character"
length( MODELS[[1]] )
[1] 93

Then the best model selection:

best <- noquote(tba3[which.max(na.omit(MODELS)$mean),1])
best
A tibble: 1 x 1
Model

1 model1

AlexisW · April 4, 2024, 3:59pm

So it looks like you didn't save the models themselves: your first column is character, so only contains the name of the model, the other 2 columns only contain values.

The model itself should be a "big" and complex object, contained in the results of the train() function. In the page I linked above, it's the case of gbmFit3 which contains the details of the fit.

supermarco · April 5, 2024, 8:12am

Thank you very much. You are right it is a string. when i call the "best" model, a string appears such like:

[1] model1
BUT when i type the same "model1", the entire estimation appears as it corresponds to the complex object you refer to.

supermarco · April 5, 2024, 8:54am

Please let me put it differently. Imagine your caret code with two models, say 'gbm' and 'lm'. Now call the model with the highest R2. Can you do it? Thank you.

nirgrahamuk · April 5, 2024, 9:01am

to programatically get the model, use get() or mget() functions

supermarco · April 5, 2024, 9:10am

Right. Thank you. I did it but when i write : get(best_model), i have an error message telling me that the object (best_model) does not exist, which is true because it is a string that correspond to for example 'model1' which an object.

supermarco · April 5, 2024, 9:18am

I think it all about converting string in object name. Perhaps it should be string but matrix or something else to get() function works.

nirgrahamuk · April 5, 2024, 9:35am

Seems like you dont know what your objects are called.... This is impossible for us to solve without seeing your code.

supermarco · April 5, 2024, 9:51am

Thank you and soory about that. Please let me see again how to solve this issue based your numerous comments.
Please not that i have a second issue that i posted on which sampling approach is best for time series. Many thanks.