When ranking the models with tune package, the metrics are averaged over the samples assuming equal size (and linearity).
That's ok for equal size slices and linear metrics like MSE or MAE but it s dubious for unequal sizes and/or non linear metrics like AUC.
Say I do time-walk forward splits by month and my last slice of data has much less data in the validation set, it will benefit to the lucky models that performed well on this last data instead of properly weighting by the slice size for example.
I think the averages computed by collect_metrics should be at least weighted by the size of the sample. (at best recomputed from the predictions over the union of all validation sets.)