How to guarantee a proper number of random numbers generated using sample with group_by?

cstjohn · January 5, 2023, 5:56am

I'm attempting to create a random selection by group based on a table and having trouble even describing what I want to do, which makes finding help difficult. If anyone can sort this out, I would be very appreciative.

I have two grouping variables and would like them to have a random number from 1:33. But, there must be 4 of each number from 1:33 for each group.

library(tidyverse)
df <- tibble(
  action = rep(c("A", "B", "C", "D", "E", "F"), each = 110),
  region = rep(rep(c("North", "South", "East", "West", "Central"), each = 22),6)
)

For each action + region, I want some random number from 1:33. But then, when grouped by region and random number, I want each group to be size 4.

set.seed(33)
df %>% 
  group_by(action, region) %>% 
  mutate(ran_num = sample(1:33, 22, replace = FALSE)) %>% 
  group_by(ran_num) %>% 
  count(region, ran_num) %>% 
  group_by(n) %>% 
  count(sort = TRUE)

That gets me pretty close, but the groups range from size 1 to 4. Is there any way to create the distribution to force equal amounts per this random group? Or perhaps build the tibble a different way to solve it from another angle?

Thanks!

DavoWW · January 6, 2023, 7:04am

Hi @cstjohn,

I think this does what you are asking; however, I'm not sure why you are using each=22 in your code, so may have got the wrong end of the stick:

# Since number of actions * number of regions = 30
# 30 random numbers (from 1:33) * groups of 4 = 120
df <- data.frame(rand_num = rep(sample(1:33, 30, replace = FALSE), each=4),
                 action = rep(c("A", "B", "C", "D", "E", "F"), each = 4, times = 5),
                 region = rep(c("North", "South", "East", "West", "Central"), each = 4, times=6))

head(df, n=24)
#>    rand_num action  region
#> 1        23      A   North
#> 2        23      A   North
#> 3        23      A   North
#> 4        23      A   North
#> 5        21      B   South
#> 6        21      B   South
#> 7        21      B   South
#> 8        21      B   South
#> 9         1      C    East
#> 10        1      C    East
#> 11        1      C    East
#> 12        1      C    East
#> 13        4      D    West
#> 14        4      D    West
#> 15        4      D    West
#> 16        4      D    West
#> 17       11      E Central
#> 18       11      E Central
#> 19       11      E Central
#> 20       11      E Central
#> 21       29      F   North
#> 22       29      F   North
#> 23       29      F   North
#> 24       29      F   North
dim(df)
#> [1] 120   3

with(df, table(action, region))
#>       region
#> action Central East North South West
#>      A       4    4     4     4    4
#>      B       4    4     4     4    4
#>      C       4    4     4     4    4
#>      D       4    4     4     4    4
#>      E       4    4     4     4    4
#>      F       4    4     4     4    4
with(df, table(rand_num, region))
#>         region
#> rand_num Central East North South West
#>       1        0    4     0     0    0
#>       2        0    0     0     0    4
#>       4        0    0     0     0    4
#>       5        0    0     0     0    4
#>       6        0    0     0     4    0
#>       7        0    0     4     0    0
#>       8        0    0     0     4    0
#>       10       0    0     0     0    4
#>       11       4    0     0     0    0
#>       12       0    0     0     0    4
#>       13       0    0     4     0    0
#>       14       0    4     0     0    0
#>       15       0    0     0     0    4
#>       16       0    4     0     0    0
#>       17       0    0     4     0    0
#>       18       0    0     0     4    0
#>       19       0    4     0     0    0
#>       20       0    0     0     4    0
#>       21       0    0     0     4    0
#>       22       4    0     0     0    0
#>       23       0    0     4     0    0
#>       24       0    4     0     0    0
#>       25       0    0     4     0    0
#>       26       4    0     0     0    0
#>       28       4    0     0     0    0
#>       29       0    0     4     0    0
#>       30       4    0     0     0    0
#>       31       0    4     0     0    0
#>       32       0    0     0     4    0
#>       33       4    0     0     0    0

^{Created on 2023-01-06 with reprex v2.0.2}

nirgrahamuk · January 6, 2023, 11:40am

Isn't it just


df %>% 
  group_by(region) %>% 
  mutate(ran_num = sample(rep(1:33,4), 132, replace = FALSE))  %>% 
  count(region, ran_num)

?

system · February 17, 2023, 11:41am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.