Still make training and test sets when using {spatialsample}?

Hi @julia, thank you for maintaining the {spatialsample} package!

I've been reading through its vignettes and am trying to figure out if I still want to generate training and test sets before doing resampling as in Figure 10.1 of your book. That said, in order for the test set to be useful, you would want to account for spatial auto-correlation when making the initial split by using a stratification variable, but none exists which is why we're using spatial methods for resampling in the first place.

So it seems like the answer is no, don't do an initial split into training and test sets before the spatial resampling. Instead, split into analysis and holdout sets after generating your resampled folds, and hope that's good enough to give representative picture of model performance?

Thank for your patience on this! I really like the writing that the mlr3 folks have done on this topic. You can read here, and also in the book Geocomputation with R. The main argument is that something like spatial resampling will give you a better estimate of performance than if you evaluate on a randomly heldout test set.

This does mean that you probably won't want to use last_fit() to fit one time to the whole training set and evaluate one time on the testing set (because we expect that to be too optimistic). Instead you would use your spatial resampling results to estimate your performance and just plain old fit() your best (tuned) model to your training set. You can read about using fit() on workflows.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.