stratification should not be performed on continuous variables, its undefined in that context.
You could choose a cutoff(s) to discretise Sale Price into strata, and then attempt to achieve a balanced sample with respect to that. for example
see how the source data set
has a close to even balance of TRUE/FALSE on our variable sp_gt_160000; 1467 - 1463
the no strata splits vary a lot; 349-384
compared to the stratified splits; 367-366
That's what rsample does with numeric strata. Admittedly, it is not documented well in the help files but is described here with an example:
For regression problems, the outcome data can be artificially binned into quartiles and then stratified sampling conducted four separate times. This is an effective method for keeping the distributions of the outcome similar between the training and test set.
I have confirmed that the distribution is similar for test and train with and without strata.
Is this because of the large number of data (all data = large,test data = large, because keep population distribution)?
Is there a good way to check the effect of strata on continuous variables?