averaging metrics on different size samples is wrong

When ranking the models with tune package, the metrics are averaged over the samples assuming equal size (and linearity).

That's ok for equal size slices and linear metrics like MSE or MAE but it s dubious for unequal sizes and/or non linear metrics like AUC.

Say I do time-walk forward splits by month and my last slice of data has much less data in the validation set, it will benefit to the lucky models that performed well on this last data instead of properly weighting by the slice size for example.

I think the averages computed by collect_metrics should be at least weighted by the size of the sample. (at best recomputed from the predictions over the union of all validation sets.)

for a long term solution the place to file feature requests is Issues · tidymodels/tune (github.com)

in the short term perhaps you can try the collect_metrics()' summarize = FALSE option; and apply your own summarisation ?

1 Like

Yes good idea!
I did not start an issue as I did not know if this was handled somewhere I did not see.
The problem with your solution is that it's more work for me! :slight_smile:
I tried to use workflowsets::rank_results and that's where it calls collect_metrics internally so I can't modify the call here.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.