Playing "Guess Who" in R?

I am working with the R programming language.

I am trying to create a game similar to the "Guess Who Game" (Guess Who? - Wikipedia) - a game in which players try to narrow down an in-game character based down a series of guesses.

Here is a dataset I simulated that contains the "counts" for athletes having different characteristics:

hair_color = factor(c("black", "brown", "blonde", "bald"))
glasses = factor(c("yes", "no", "contact lenses"))
sport = factor(c("football", "basketball", "tennis"))
gender = factor(c("male", "female", "other"))

problem = expand.grid(var1 = hair_color, var2 = glasses, var3 = sport, var4 = gender)
problem$counts = as.integer(rnorm(108, 20,5))
dataset = problem

    var1 var2     var3 var4 counts
1  black  yes football male     22
2  brown  yes football male     16
3 blonde  yes football male     12
4   bald  yes football male     22
5  black   no football male     14
6  brown   no football male     19

I then wrote a function that lets the user select rows from this dataset corresponding to a certain profile of characteristics:

   my_function <- function(dataset, var1 = NULL, var2 = NULL, var3 = NULL, var4 = NULL) {
    
    # Create a logical vector to store the rows that match the specified criteria
    selection <- rep(TRUE, nrow(dataset))
    
    # Filter rows based on the specified levels of var1
    if (!is.null(var1)) {
        selection <- selection & dataset$var1 %in% var1
    }
    
    # Filter rows based on the specified levels of var2
    if (!is.null(var2)) {
        selection <- selection & dataset$var2 %in% var2
    }
    
    # Filter rows based on the specified levels of var3
    if (!is.null(var3)) {
        selection <- selection & dataset$var3 %in% var3
    }
    
    # Filter rows based on the specified levels of var4
    if (!is.null(var4)) {
        selection <- selection & dataset$var4 %in% var4
    }
    
    # Select the rows that match the specified criteria
    selected_rows <- dataset[selection, ]
    
    # Return the selected rows
    return(selected_rows)
}

And now, to call the function - select all rows where : the hair is "BLACK OR BROWN" AND the glasses are YES:

head(my_function(dataset, var1 = c("black", "brown"), var2 = c("yes")))

    var1 var2       var3   var4 counts
1  black  yes   football   male     22
2  brown  yes   football   male     16
13 black  yes basketball   male     14
14 brown  yes basketball   male      9
25 black  yes     tennis   male     13

Another example, to call the function - select all rows where : the hair is "BLACK OR BROWN" AND the glasses are NO:

    head(my_function(dataset, var1 = c("black", "brown"), var2 = c("no")))

     var1 var2       var3   var4 counts
5   black   no   football   male     14
6   brown   no   football   male     19
17  black   no basketball   male     17
18  brown   no basketball   male     27

This leads me to my question - suppose I wanted to know the following: What is the (conditional) probability that an athlete wears glasses, given that they have black or brown hair?

Manually, I could answer the question like this:

a = my_function(dataset, var1 = c("black", "brown"), var2 = c("yes"))
b = my_function(dataset, var1 = c("black", "brown"), var2 = c("no"))
prob_yes = sum(a$counts) / (sum(a$counts) + sum(b$counts))
prob_no = sum(b$counts) / (sum(a$counts) + sum(b$counts))

> prob_yes
[1] 0.481203

> prob_no
[1] 0.518797

I was wondering if I could somehow extend this function to the general sense - suppose I wanted my function to take inputs as:

  • Which variables and which levels of these variables (e.g. - not all variables need to be selected)
  • Which variable (single variable) should the conditional probability be calculated on (e.g. "glasses")

And as an output:

  • All probabilities for all variables of this variable should be calculated

As an example - the desired function could be called like this:

my_function(dataset, input_var_list = c(var1 = c("black", "brown"), var3 = c("football")),  conditional_var = c("var2"))

And this desired function would return:

  • The probability of wearing glasses given that the athlete has black/brown hair and plays football
  • The probability of not wearing glasses given that the athlete has black/brown hair and plays football

Can someone please help me re-write this function?

Thanks!

In the function below, if cond_var is not specified, a data frame of matching rows is returned. If cond_var is specified, a data frame showing the conditional probability of each cond_var option is returned. Does this get to your intended outcome?

library(tidyverse)

hair_color = factor(c("black", "brown", "blonde", "bald"))
glasses = factor(c("yes", "no", "contact lenses"))
sport = factor(c("football", "basketball", "tennis"))
gender = factor(c("male", "female", "other"))

problem = expand.grid(var1 = hair_color, var2 = glasses, var3 = sport, var4 = gender)
problem$counts = as.integer(rnorm(108, 20,5))
dataset = problem

my_function = function(data, var1 = NULL, var2 = NULL, var3 = NULL, var4 = NULL,
                       cond_var = NULL) {
  
  if(!is.null(var1)) {
    data = data[data$var1 %in% var1, ]
  }
  
  if(!is.null(var2)) {
    data = data[data$var2 %in% var2, ]
  }
  
  if(!is.null(var3)) {
    data = data[data$var3 %in% var3, ]
  }
  
  if(!is.null(var4)) {
    data = data[data$var4 %in% var4, ]
  }
  
  if(!is.null(cond_var)) {
    target = which(names(data) == cond_var)
    
    denominator = sum(data$counts)
    
    data = data %>%
      rename_at(target, ~paste('cond')) %>%
      group_by(cond) %>%
      summarise(n = sum(counts),
                .groups = 'drop') %>%
      mutate(total = denominator,
             pct = n/denominator) %>%
      rename_at(1, ~paste0(cond_var))
  }
  
  data
}

# without a conditional variable specified
my_function(dataset,
            var1 = c("black", "brown"),
            var2 = NULL,
            var3 = c("football"),
            var4 = NULL,
            cond_var = NULL)
#>     var1           var2     var3   var4 counts
#> 1  black            yes football   male     19
#> 2  brown            yes football   male     23
#> 5  black             no football   male     17
#> 6  brown             no football   male     15
#> 9  black contact lenses football   male     29
#> 10 brown contact lenses football   male     16
#> 37 black            yes football female     20
#> 38 brown            yes football female     13
#> 41 black             no football female     30
#> 42 brown             no football female     21
#> 45 black contact lenses football female     16
#> 46 brown contact lenses football female     18
#> 73 black            yes football  other     23
#> 74 brown            yes football  other     15
#> 77 black             no football  other     20
#> 78 brown             no football  other     21
#> 81 black contact lenses football  other     18
#> 82 brown contact lenses football  other     23

# with a conditional variable specified
my_function(dataset, 
            var1 = c("black", "brown"),
            var2 = NULL,
            var3 = c("football"),
            var4 = NULL,
            cond_var = 'var2')
#> # A tibble: 3 × 4
#>   var2               n total   pct
#>   <fct>          <int> <int> <dbl>
#> 1 contact lenses   120   357 0.336
#> 2 no               124   357 0.347
#> 3 yes              113   357 0.317

Created on 2022-12-24 with reprex v2.0.2

1 Like

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.