Creating Functions with unctions with Dynamic Structures

Suppose there is a classroom of students - each student flips the same coin many times (the students don't flip the coin the same number of times). Here is a simulate dataset to represent this example:

library(tidyverse)
library(dplyr)

set.seed(123)
ids = 1:100
student_id = sample(ids, 1000, replace = TRUE)
coin_result = sample(c("H", "T"), 1000, replace = TRUE)
my_data = data.frame(student_id, coin_result)

my_data =  my_data[order(my_data$student_id),]
  • I want to count the number of "3 Flip Sequences" recorded by each student (e.g. Student 1 got HHHTH : HHH 1 time, HHT 1 time, HTH 1 time)
  • And the probability of the 3rd Flip based on the previous 2 flips (e.g. in general, over all students, the probability of a H following HH was 0.54)

Here is some R code that performs these tasks:

results = my_data %>%
  group_by(student_id) %>%
  summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2)), .groups = 'drop') %>%
  filter(!is.na(Sequence)) %>%
  count(Sequence)


final = results %>%
    mutate(two_seq = substr(Sequence, 1, 2)) %>%
    group_by(two_seq) %>%
    mutate(third = substr(Sequence, 3, 3)) %>%
    group_by(two_seq, third) %>%
    summarize(sums = sum(n)) %>%
    mutate(prob = sums / sum(sums))

My Question: Suppose I want to now extend this problem to "4 Flip Sequences" (e.g. probability of H given HHH) - I can manually extend this code:

results = my_data %>%
  group_by(student_id) %>%
  summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2), lead(coin_result, 3)), .groups = 'drop') %>%
  filter(!is.na(Sequence)) %>%
  count(Sequence)

final = results %>%
    mutate(three_seq = substr(Sequence, 1, 3)) %>%
    group_by(three_seq) %>%
    mutate(fourth = substr(Sequence, 4, 4)) %>%
    group_by(three_seq, fourth) %>%
    summarize(sums = sum(n)) %>%
    mutate(prob = sums / sum(sums))

Is it possible to convert the above code into a function such that I can repeat this for arbitrary combinations? For example:

results <- function(i) {return(my_data %>%
  group_by(student_id) %>%
  summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, i+1), lead(coin_result, i+2) .....### insert code here ####), .groups = 'drop') %>%
  filter(!is.na(Sequence)) %>%
  count(Sequence))}

final <- function(i) 
return(results %>%
    mutate(three_seq = substr(Sequence, 1, i)) %>%
    group_by(three_seq) %>%
    mutate(fourth = substr(Sequence, i+1, i+1)) %>%
    group_by(three_seq, fourth) %>%
    summarize(sums = sum(n)) %>%
    mutate(prob = sums / sum(sums)))
}

I am not sure how exactly I would do this, seeing as the first function would require to be "dynamically changed" depending on the value of "i".

Can someone please show me how to do this?

Thanks!

Below is one way to write a function that achieves the desired output. Rather than using lead(), the function creates one large string of outcomes for each student_id and takes substrings of that value based on row number.

get_probs = function(flips) {
  
  results = my_data %>%
    group_by(student_id) %>%
    mutate(Sequence = str_c(coin_result, collapse = '')) %>%
    mutate(Sequence = str_sub(Sequence, row_number(), row_number() + flips - 1)) %>% 
    ungroup() %>%
    filter(!is.na(Sequence) & nchar(Sequence) == flips) %>%
    count(Sequence)
  
  out = results %>%
    mutate(prior = substr(Sequence, 1, flips - 1)) %>%
    group_by(prior) %>%
    mutate(final = substr(Sequence, flips, flips)) %>%
    ungroup() %>%
    group_by(prior, final) %>%
    summarise(sums = sum(n), .groups = 'drop') %>%
    group_by(prior) %>%
    mutate(prob = sums / sum(sums)) %>%
    ungroup()
  
  out
}

get_probs(2)
#> # A tibble: 4 × 4
#>   prior final  sums  prob
#>   <chr> <chr> <int> <dbl>
#> 1 H     H       242 0.522
#> 2 H     T       222 0.478
#> 3 T     H       227 0.521
#> 4 T     T       209 0.479

get_probs(3)
#> # A tibble: 8 × 4
#>   prior final  sums  prob
#>   <chr> <chr> <int> <dbl>
#> 1 HH    H       112 0.514
#> 2 HH    T       106 0.486
#> 3 HT    H       108 0.537
#> 4 HT    T        93 0.463
#> 5 TH    H        97 0.5  
#> 6 TH    T        97 0.5  
#> 7 TT    H        93 0.497
#> 8 TT    T        94 0.503

get_probs(4)
#> # A tibble: 16 × 4
#>    prior final  sums  prob
#>    <chr> <chr> <int> <dbl>
#>  1 HHH   H        54 0.557
#>  2 HHH   T        43 0.443
#>  3 HHT   H        54 0.557
#>  4 HHT   T        43 0.443
#>  5 HTH   H        47 0.528
#>  6 HTH   T        42 0.472
#>  7 HTT   H        45 0.549
#>  8 HTT   T        37 0.451
#>  9 THH   H        46 0.523
#> 10 THH   T        42 0.477
#> 11 THT   H        46 0.541
#> 12 THT   T        39 0.459
#> 13 TTH   H        38 0.481
#> 14 TTH   T        41 0.519
#> 15 TTT   H        39 0.470
#> 16 TTT   T        44 0.530

Created on 2023-02-01 with reprex v2.0.2.9000

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.