Data partitioning for spatial data


I am constructing different configurations of a Random Forest in order to investigate the influence of well-design variables and location, on the first-year production volumes of shale oil wells, within a given area in the US. In the different model configurations, I control for location in different ways, to show how the influence of well-design variables may be biased when the spatial resolution of the models is inadequate. Here, location acts as a proxy for geological properties/reservoir quality.

I have a dataset of ~4500 wells, with 6 variables. The response is the first-year production volume, and the predictors are three different well-design variables in addition to longitude and latitude.

I have been researching and putting some thought into the subject of data partitioning when working with spatial data. For instance, in this chapter of Lovelace et al. (, they highlight the importance of spatial cross-validation (CV): "Randomly splitting spatial data can lead to training points that are neighbors in space with test points. Due to spatial autocorrelation, test and training datasets would not be independent in this scenario, with the consequence that CV fails to detect possible overfitting. Spatial CV alleviates this problem and is the central theme of this chapter."

Further, they illustrate how a spatial partitioning may differ from a random partitioning:

They then show the results for the classification problem at hand, where the AUC from conventional CV is about 0,05 higher than for the spatial CV.

The point is that due to spatial autocorrelation (near things are more related than distant things), you will end up with some observations in the training set that are very similar to observations in the test set if the proximity of observations is not accounted for when splitting the data. This may cause "information leakage" between the sets.

My question is, does this information leakage necessarily pose a problem? I figure that this and the similarity of observations is something that may just as well be representative of the problem at hand, and therefore make the performance assessment more representative of a real-life application of the model. I understand that a spatially disjoint test set yields a more representative performance assessment of your model if it should be used for predicting on a completely new and distant area. But if you want to assess a model's predictive performance with respect to a mix of near and distant locations, wouldn't a random split be more reasonable?

Hoping for some input here, thanks!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.