What is the meaning of strata in rsample?

I learned that the strata in rsample is what does the uniform sample.

However, I compared them with the following code and cannot see any difference.
What is different about the results?

# no strata
set.seed(123)
ames_split <- initial_split(ames, prob = 0.80)
ames_train <- training(ames_split) %>% mutate(id = "train")
ames_test  <-  testing(ames_split) %>% mutate(id = "test") 
bind_df <- bind_rows(ames_train,ames_test)

bind_df %>% 
  ggplot(aes(x = Sale_Price,fill=id)) + 
  geom_histogram(bins = 50,position = "identity", alpha = 0.8)

bind_df %>% 
  ggplot(aes(x = Sale_Price,fill=id)) + 
  geom_histogram(bins = 50,position = "fill", alpha = 0.8)

# set strata
set.seed(123)
ames_split <- initial_split(ames, prob = 0.80, strata = Sale_Price)
ames_train <- training(ames_split) %>% mutate(id = "train")
ames_test  <-  testing(ames_split) %>% mutate(id = "test") 
bind_df <- bind_rows(ames_train,ames_test)

bind_df %>% 
  ggplot(aes(x = Sale_Price,fill=id)) + 
  geom_histogram(bins = 50,position = "identity", alpha = 0.8)

bind_df %>% 
  ggplot(aes(x = Sale_Price,fill=id)) + 
  geom_histogram(bins = 50,position = "fill", alpha = 0.8)

stratification should not be performed on continuous variables, its undefined in that context.
You could choose a cutoff(s) to discretise Sale Price into strata, and then attempt to achieve a balanced sample with respect to that. for example


library(tidyverse)
library(tidymodels)
data("ames")

ames2 <- mutate(ames,sp_gt_160000=Sale_Price>160000)

# no strata
set.seed(123)
ames_split1 <- initial_split(ames2, prob = 0.80)
ames_train1<- training(ames_split1) %>% mutate(id = "train")
ames_test1  <-  testing(ames_split1) %>% mutate(id = "test") 
bind_df_no_strata <- bind_rows(ames_train1,ames_test1)


# set strata
set.seed(123)
ames_split2 <- initial_split(ames2, prob = 0.80, strata = sp_gt_160000 )
ames_train2 <- training(ames_split2) %>% mutate(id = "train")
ames_test2  <-  testing(ames_split2) %>% mutate(id = "test") 
bind_df_strata <- bind_rows(ames_train2,ames_test2)


table(ames2$sp_gt_160000)
table(bind_df_no_strata$id,bind_df_no_strata$sp_gt_160000)
table(bind_df_strata$id,bind_df_strata$sp_gt_160000)
#> table(ames2$sp_gt_160000)

FALSE  TRUE 
 1467  1463 
> table(bind_df_no_strata$id,bind_df_no_strata$sp_gt_160000)
       
        FALSE TRUE
  test    349  384
  train  1118 1079
> table(bind_df_strata$id,bind_df_strata$sp_gt_160000)
       
        FALSE TRUE
  test    367  366
  train  1100 1097

see how the source data set
has a close to even balance of TRUE/FALSE on our variable sp_gt_160000; 1467 - 1463
the no strata splits vary a lot; 349-384
compared to the stratified splits; 367-366

That's what rsample does with numeric strata. Admittedly, it is not documented well in the help files but is described here with an example:

For regression problems, the outcome data can be artificially binned into quartiles and then stratified sampling conducted four separate times. This is an effective method for keeping the distributions of the outcome similar between the training and test set.

Thank you for correcting me !
I should get into the tidymodels world more...

I have confirmed that the distribution is similar for test and train with and without strata.
Is this because of the large number of data (all data = large,test data = large, because keep population distribution)?

Is there a good way to check the effect of strata on continuous variables?

Visually or you could use the Kolmogorov-Smirnov test to compare distributions. I don't particularly like that method but that's what it is for.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.