Making Combinations of Items

Suppose I have the following lists of factor:

factor_1 = c("A1", "A2", "A3")
factor_2 = c("B1", "B2")
factor_3 = c("C1", "C2", "C3", "C4")
factor_4 = c("D1", "D2", "D3")

I made the following data frame that contains all (3 * 2 * 4 * 3 = ) 72 combinations of these factors:

data_exp <- expand.grid(factor_1, factor_2, factor_3, factor_4) 
data_exp$id = 1:nrow(data_exp)

> head(data_exp)
  Var1 Var2 Var3 Var4 id
1   A1   B1   C1   D1  1
2   A2   B1   C1   D1  2
3   A3   B1   C1   D1  3
4   A1   B2   C1   D1  4
5   A2   B2   C1   D1  5
6   A3   B2   C1   D1  6

I want to randomly split this data (data_exp) into 3 datasets such that each row only appears in one of these datasets - furthermore, these 3 datasets do not have to be the same size. I tried to do this with the following code.

First, I randomly generate 3 random numbers corresponding to the number of rows for each of these datasets, such that these 3 random numbers add to 72:

# https://stackoverflow.com/questions/24845909/generate-n-random-integers-that-sum-to-m-in-r

rand_vect <- function(N, M, sd = 1, pos.only = TRUE) {
  vec <- rnorm(N, M/N, sd)
  if (abs(sum(vec)) < 0.01) vec <- vec + 1
  vec <- round(vec / sum(vec) * M)
  deviation <- M - sum(vec)
  for (. in seq_len(abs(deviation))) {
    vec[i] <- vec[i <- sample(N, 1)] + sign(deviation)
  }
  if (pos.only) while (any(vec < 0)) {
    negs <- vec < 0
    pos  <- vec > 0
    vec[negs][i] <- vec[negs][i <- sample(sum(negs), 1)] + 1
    vec[pos][i]  <- vec[pos ][i <- sample(sum(pos ), 1)] - 1
  }
  vec
}

r = rand_vect(3, 72)
[1] 26 23 23

Next, I tried to create these datasets using these random numbers:

data_1 = data_exp[sample(nrow(data_exp), r[1]), ]
data_2 = data_exp[sample(nrow(data_exp), r[2]), ]
data_3 = [sample(nrow(data_exp), r[3]), ]
  • The problem with this approach is that data_1, data_2, data_3 have common rows, and not all the rows from data_exp are necessarily present within data_1, data_2, data_3 .

Is there a way to fix this problem?

Thank you!

Hope this is of some use to you

factor_1 = c("A1", "A2", "A3")
factor_2 = c("B1", "B2")
factor_3 = c("C1", "C2", "C3", "C4")
factor_4 = c("D1", "D2", "D3")

data_exp <- expand.grid(factor_1, factor_2, factor_3, factor_4)
data_exp$id = 1:nrow(data_exp)

set.seed(1234)
idx <- sample(3, size = nrow(data_exp), replace = TRUE, prob = c(0.33, 0.33,0.34))
df1 <- data_exp[idx == 1,]
df2 <- data_exp[idx == 2,]
df3 <- data_exp[idx == 3,]

1 Like

Thank you so much! Ideally I would like the number of tows in df1, df2, and df3 to be fully random and still add up to nrow(data_exp) ...is this possible? Thank you so much!

Hi
I checked before posting, it is random and rows in all three add up to the no. of rows in original dataset

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.