Simulating Random Draws from a "Hat"

I am working with the R programming language.

Suppose I have the following 10 variables (num_var_1, num_var_2, num_var_3, num_var_4, num_var_5, factor_var_1, factor_var_2, factor_var_3, factor_var_4, factor_var_5):

set.seed(123)


num_var_1 <- rnorm(1000, 10, 1)
num_var_2 <- rnorm(1000, 10, 5)
num_var_3 <- rnorm(1000, 10, 10)
num_var_4 <- rnorm(1000, 10, 10)
num_var_5 <- rnorm(1000, 10, 10)

factor_1 <- c("A","B", "C")
factor_2 <- c("AA","BB", "CC")
factor_3 <- c("AAA","BBB", "CCC", "DDD")
factor_4 <- c("AAAA","BBBB", "CCCC", "DDDD", "EEEE")
factor_5 <- c("AAAAA","BBBBB", "CCCCC", "DDDDD", "EEEEE", "FFFFFF")

factor_var_1 <- as.factor(sample(factor_1, 1000, replace=TRUE, prob=c(0.3, 0.5, 0.2)))
factor_var_2 <-  as.factor(sample(factor_2, 1000, replace=TRUE, prob=c(0.5, 0.3, 0.2)))
factor_var_3 <-  as.factor(sample(factor_3, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.2, 0.1)))
factor_var_4 <-  as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.1, 0.1, 0.1)))
factor_var_5 <-  as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.3, 0.2, 0.1, 0.1, 0.1)))

id = 1:1000

my_data = data.frame(id,num_var_1, num_var_2, num_var_3, num_var_4, num_var_5, factor_var_1, factor_var_2, factor_var_3, factor_var_4, factor_var_5)


> head(my_data)
  id num_var_1 num_var_2 num_var_3 num_var_4  num_var_5 factor_var_1 factor_var_2 factor_var_3 factor_var_4 factor_var_5
1  1  9.439524  5.021006  4.883963  8.496925  11.965498            B           AA          AAA         CCCC         AAAA
2  2  9.769823  4.800225 12.369379  6.722429  16.501132            B           AA          AAA         AAAA         AAAA
3  3 11.558708  9.910099  4.584108 -4.481653  16.710042            C           AA          BBB         AAAA         CCCC
4  4 10.070508  9.339124 22.192276  3.027154  -2.841578            B           CC          DDD         BBBB         AAAA
5  5 10.129288 -2.746714 11.741359 35.984902 -10.261096            B           AA          AAA         DDDD         DDDD
6  6 11.715065 15.202867  3.847317  9.625850  32.053261            B           AA          CCC         BBBB         EEEE

My Question: I am interested in selecting a random number of variables from this data - and taking random subsets from these variables. (And then repeating this process many times). For example - I would like to record such a list:

  • Iteration 1: num_var_2 > 12, factor_var_1 = "A, C", factor_var_4 = "BBBB, DDDD, EEEE"
  • Iteration 2: num_var_1 >0, num_var_3 <10, factor_var_2 = "AA, BB, CC", factor_var_3 = "AAA", factor_var_5 = "CCCCC, DDDDD"
  • Iteration 3: num_var_2 <5, num_var_5 <10, factor_var_1 = "B", factor_var_3 = "AAA"
  • Iteration 4 : factor_var_4 = "BBBB"

etc.

I can perform the above manually, but this would take a long time (e.g. 10 iterations). Is there a way to automate this process and in the end, just output this kind of list :

  Iteration                                                                                               Condition
         1                                    num_var_2 > 12, factor_var_1 = A, C, factor_var_4 = BBBB, DDDD, EEEE
         2 num_var_1 >0, num_var_3 <10, factor_var_2 = AA, BB, CC, factor_var_3 = AAA, factor_var_5 = CCCCC, DDDDD
         3                                       num_var_2 <5, num_var_5 <10, factor_var_1 = B, factor_var_3 = AAA
         4                                                                                     factor_var_4 = BBBB

Can someone please show me how to do this?

Thanks!

Can you please clarify the following?

  1. Where is drawing from hat, as mentioned in title, related to the problem?
  2. As your output, do you need the conditions, i.e. which variables to keep and how to filter? Or, do you want the final subset of data given condition as input? Or, do you want to generate both variable subset and filter conditions programmatically and then store the corresponding subset?
  3. Can the filters be based on variables that are not lart of current subset?

(It seems to me that your requirement is very similar to random forest, except the SRSWR part is not being used)

1 Like

@ Yarnabrina : Thank you for your reply!

  1. The "hat" in this case refers to the data set "my_data"
 summary(my_data)

   num_var_1        num_var_2        num_var_3         num_var_4         num_var_5       factor_var_1 factor_var_2 factor_var_3 factor_var_4 factor_var_5
 Min.   : 5.980   Min.   :-6.221   Min.   :-17.899   Min.   :-22.450   Min.   :-21.730   A:308        AA:503       AAA:517      AAAA:508     AAAA:393    
 1st Qu.: 9.323   1st Qu.: 6.799   1st Qu.:  3.175   1st Qu.:  3.596   1st Qu.:  2.727   B:488        BB:293       BBB:192      BBBB:198     BBBB:253    
 Median : 9.999   Median :10.131   Median :  9.725   Median : 10.228   Median :  9.565   C:204        CC:204       CCC:189      CCCC:119     CCCC:112    
 Mean   : 9.986   Mean   :10.087   Mean   :  9.694   Mean   : 10.049   Mean   :  9.771                             DDD:102      DDDD: 77     DDDD:127    
 3rd Qu.:10.664   3rd Qu.:13.405   3rd Qu.: 16.367   3rd Qu.: 16.459   3rd Qu.: 16.443                                          EEEE: 98     EEEE:115    
 Max.   :12.923   Max.   :27.350   Max.   : 41.034   Max.   : 40.088   Max.   : 40.576
  1. For the output, I would like to keep the "conditions" - I do not actually need the rows of data associated with the conditions. The output should be a "2 rows x 1 column" table that looks like this:
 Iteration                                                                                               Condition
         1                                    num_var_2 > 12, factor_var_1 = A, C, factor_var_4 = BBBB, DDDD, EEEE
         2 num_var_1 >0, num_var_3 <10, factor_var_2 = AA, BB, CC, factor_var_3 = AAA, factor_var_5 = CCCCC, DDDDD
         3                                       num_var_2 <5, num_var_5 <10, factor_var_1 = B, factor_var_3 = AAA
         4                                                                                     factor_var_4 = BBBB

I am not sure I understand question 3)?

Thank you so much!

I was asking about the filters. You want to filter on both rows and columns, and my question was whether those two need to match or not.

Here is a very naive attempt to answer the question. I'm sure it can be improved a lot, I'll look forward to that solution by you or someone else.

# data setup

set.seed(123)

num_var_1 <- rnorm(1000, 10, 1)
num_var_2 <- rnorm(1000, 10, 5)
num_var_3 <- rnorm(1000, 10, 10)
num_var_4 <- rnorm(1000, 10, 10)
num_var_5 <- rnorm(1000, 10, 10)

factor_1 <- c("A","B", "C")
factor_2 <- c("AA","BB", "CC")
factor_3 <- c("AAA","BBB", "CCC", "DDD")
factor_4 <- c("AAAA","BBBB", "CCCC", "DDDD", "EEEE")
factor_5 <- c("AAAAA","BBBBB", "CCCCC", "DDDDD", "EEEEE", "FFFFFF")

factor_var_1 <- as.factor(sample(factor_1, 1000, replace=TRUE, prob=c(0.3, 0.5, 0.2)))
factor_var_2 <-  as.factor(sample(factor_2, 1000, replace=TRUE, prob=c(0.5, 0.3, 0.2)))
factor_var_3 <-  as.factor(sample(factor_3, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.2, 0.1)))
factor_var_4 <-  as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.1, 0.1, 0.1)))
factor_var_5 <-  as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.3, 0.2, 0.1, 0.1, 0.1)))

id <- 1:1000

my_data <- data.frame(id, num_var_1, num_var_2, num_var_3, num_var_4, num_var_5, factor_var_1, factor_var_2, factor_var_3, factor_var_4, factor_var_5)

# utility functions

generate_function_to_get_filter_condition <- function(filter_threshold, filter_type = c("<", ">", "<=", ">=", "==", "!=", "in")) {
  filter_type <- match.arg(filter_type)
  filter_threshold_as_string <- deparse1(filter_threshold)

  get_filter_condition <- function(column_name) {
    switch(filter_type,
      "<" = paste0(column_name, " < ", filter_threshold_as_string),
      ">" = paste0(column_name, " > ", filter_threshold_as_string),
      "<=" = paste0(column_name, " <= ", filter_threshold_as_string),
      ">=" = paste0(column_name, " >= ", filter_threshold_as_string),
      "==" = paste0(column_name, " == ", filter_threshold_as_string),
      "!=" = paste0(column_name, " != ", filter_threshold_as_string),
      "in" = paste0(column_name, " %in% ", filter_threshold_as_string)
    )
  }

  return(get_filter_condition)
}

generate_nominal_factor_column_condition <- function(factor_levels) {
  number_of_levels_to_keep <- sample.int(length(factor_levels), 1)
  levels_to_keep <- sample(factor_levels, number_of_levels_to_keep, FALSE)

  return(generate_function_to_get_filter_condition(levels_to_keep, "in"))
}

generate_ordinal_factor_column_condition <- function(factor_levels) {
  condition_type <- sample(c("<", ">", "<=", ">=", "==", "!=", "in"), 1)

  if(condition_type == "<") {
    condition_level <- sample(tail(factor_levels, -1), 1)
  } else if(condition_type == ">") {
    condition_level <- sample(head(factor_levels, -1), 1)
  } else if(condition_type == "in") {
    number_of_levels <- sample(length(factor_levels), 1)
    condition_level <- sample(factor_levels, number_of_levels, FALSE)
  } else {
    condition_level <- sample(factor_levels, 1)
  }

  return(generate_function_to_get_filter_condition(condition_level, condition_type))
}

generate_factor_column_condition <- function(factor_column_values) {
  factor_levels <- levels(factor_column_values)

  if(is.ordered(factor_column_values)) {
    return(generate_ordinal_factor_column_condition(factor_levels))
  }

  return(generate_nominal_factor_column_condition(factor_levels))
}

generate_numeric_column_condition <- function(numeric_column_values) {
  condition_type <- sample(c("<", ">", "<=", ">="), 1)
  condition_cutoff <- runif(1, min(numeric_column_values) + .Machine$double.eps, max(numeric_column_values) - .Machine$double.eps)

  return(generate_function_to_get_filter_condition(condition_cutoff, condition_type))
}

generate_column_condition <- function(column_name, column_values) {
  if(is.factor(column_values)) {
    return(generate_factor_column_condition(column_values)(column_name))
  }

  return(generate_numeric_column_condition(column_values)(column_name))
}

# main functions

generate_data_conditions <- function(dataset) {
  columns <- names(dataset)

  number_of_columns_to_keep <- sample.int(length(columns), 1)
  columns_to_keep <- sample(columns, number_of_columns_to_keep, FALSE)

  number_of_columns_to_filter <- sample.int(length(columns), 1)
  columns_to_filter <- sample(columns, number_of_columns_to_filter, FALSE)

  filter_conditions <- vapply(columns_to_filter, \(filter_column) generate_column_condition(filter_column, dataset[[filter_column]]), character(1), USE.NAMES = FALSE)

  return(c(subset_columns=paste0(columns_to_keep, collapse = " , "), subset_conditions=paste0(filter_conditions, collapse = " , ")))
}

get_subset_details <- function(dataset, number_of_replications) {
  results <- replicate(number_of_replications, generate_data_conditions(dataset), simplify = TRUE)
  return(as.data.frame(t(results)))
}

# time demonstration
tic <- Sys.time()
my_data_result_time <- as.data.frame(t(replicate(100, generate_data_conditions(my_data))))
toc <- Sys.time()
print(toc - tic)

# result demonstration
my_data_result_small <- as.data.frame(t(replicate(2, generate_data_conditions(my_data))))
print((my_data_result_small))

Hope this helps.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.