Group-Wise Random Sampling

aaronbaggett · August 15, 2018, 3:11pm

Hi all:

Using the tidyverse, how might I randomly sample without replacement the iris data frame, grouped by Species, so that I end up with 50 unique groups of 3 where each species is represented?

In other words, what's the best approach to repeat the chunk below 50 times without replacement so that each combination of three is unique?

iris %>% 
  group_by(Species) %>% 
  sample_n(1)

davis · August 15, 2018, 4:45pm

There might be a better way, but my first thought was stratified vfold crossvalidation using rsample, where you only care about the assessment set.

library(dplyr)
library(rsample)
library(purrr)
library(tidyr)

ex <- vfold_cv(iris, v = 50, strata = "Species")

# Extract only the assessment set
ex$resamples <- map(ex$splits, assessment)

ex
#> #  50-fold cross-validation using stratification 
#> # A tibble: 50 x 3
#>    splits       id     resamples           
#>    <list>       <chr>  <list>              
#>  1 <S3: rsplit> Fold01 <data.frame [3 x 5]>
#>  2 <S3: rsplit> Fold02 <data.frame [3 x 5]>
#>  3 <S3: rsplit> Fold03 <data.frame [3 x 5]>
#>  4 <S3: rsplit> Fold04 <data.frame [3 x 5]>
#>  5 <S3: rsplit> Fold05 <data.frame [3 x 5]>
#>  6 <S3: rsplit> Fold06 <data.frame [3 x 5]>
#>  7 <S3: rsplit> Fold07 <data.frame [3 x 5]>
#>  8 <S3: rsplit> Fold08 <data.frame [3 x 5]>
#>  9 <S3: rsplit> Fold09 <data.frame [3 x 5]>
#> 10 <S3: rsplit> Fold10 <data.frame [3 x 5]>
#> # ... with 40 more rows

unnest(ex, resamples)
#> # A tibble: 150 x 6
#>    id     Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>    <chr>         <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#>  1 Fold01          5           3            1.6         0.2 setosa    
#>  2 Fold01          5.5         2.4          3.7         1   versicolor
#>  3 Fold01          6.7         3.3          5.7         2.1 virginica 
#>  4 Fold02          4.9         3.6          1.4         0.1 setosa    
#>  5 Fold02          5.8         2.7          3.9         1.2 versicolor
#>  6 Fold02          6.4         3.1          5.5         1.8 virginica 
#>  7 Fold03          5.1         3.5          1.4         0.3 setosa    
#>  8 Fold03          6.8         2.8          4.8         1.4 versicolor
#>  9 Fold03          7.2         3            5.8         1.6 virginica 
#> 10 Fold04          5.1         3.3          1.7         0.5 setosa    
#> # ... with 140 more rows

# Extract the row IDs that each assessment set corresponds to
ex$row_ids <- map(ex$splits, ~setdiff(1:150, .x$in_id))

# Still 150 rows, confirms that the V fold cross validation
# was without replacement
ex %>%
  unnest(resamples, row_ids) %>%
  distinct(row_ids)
#> # A tibble: 150 x 1
#>    row_ids
#>      <int>
#>  1      26
#>  2      82
#>  3     125
#>  4      38
#>  5      83
#>  6     138
#>  7      18
#>  8      77
#>  9     130
#> 10      24
#> # ... with 140 more rows

Created on 2018-08-15 by the reprex package (v0.2.0).

aaronbaggett · August 15, 2018, 7:16pm

That's it! Thanks much.