Counting Coin Flips for Multiple Students

I have this dataset over here - different students flip a coin a different number of times:

set.seed(123)
ids = 1:100
student_id = sample(ids, 1000, replace = TRUE)
coin_result = sample(c("H", "T"), 1000, replace = TRUE)
my_data = data.frame(student_id, coin_result)

my_data =  my_data[order(my_data$student_id),]

I want to count the number of "3 sequence" coin flips sequences for each student.

I know how to do this for the entire dataset at once:


results = my_data$coin_result

n_sequences <- function(n, results) {
  helper <- function(i, n) if (n < 1) "" else sprintf(
    "%s%s", 
    helper(i, n - 1), 
    results[i + n - 1]
  )
  result <- data.frame(
    table(
      sapply(
        1:(length(results) - n + 1),
        function(i) helper(i, n)
      )
    )
  )
  colnames(result) <- c("Sequence", "Frequency")
  result
}


n_sequences(3, results)

  Sequence Frequency
1      HHH       140
2      HHT       129
3      HTH       132
4      HTT       119
5      THH       129
6      THT       121
7      TTH       119
8      TTT       109

Now, I am trying to perform similar calculations - but for individual students - and then grouped over all students. That is, I want the "counter" to restart every time a new student starts flipping the coin. Thus, this would allow me to find out the total number of times "HHH" appears for all students individually.

I thought of a very slow and inefficient way to do this:

 library(dplyr)

 my_list = list()

for (i in 1:length(unique(ids))) {
    tryCatch({
        frame_i = my_data[my_data$student_id == i,]
        results_i = frame_i$coin_result
        results = results_i
        results_i = n_sequences(3, results)
        final_i = cbind(student_id = i, results_i)
        my_list[[i]] = final_i
        #print(final_i)
    }, error = function(e) {})
}


goal = do.call(rbind.data.frame, my_list)

summary = goal %>% group_by(Sequence) %>% summarise(sums = sum(Frequency))

> summary
# A tibble: 8 x 2
  Sequence  sums
  <fct>    <int>
1 HTT         93
2 TTH         93
3 HHH        112
4 HHT        106
5 HTH        108
6 THH         97
7 TTT         94
8 THT         97

Even if my approach is correct - I have a feeling that running this loop for big datasets (e.g. when there over 1 million student_id) will take a long time to run.

Can someone please suggest a more efficient way to solve this problem?

Thanks!

Note: I am not sure the n_sequence() function can work if any student in the data frame has fewer than "n" sequences - e.g n_sequences(n =5, results) . This is why I added a tryCatch() statement to override such occurrences.

I would do this sort of thing

library(tidyverse)
library(slider)

# solution 1
slide_chr(.x = my_data$coin_result,
          .f = ~paste0(.x,collapse = ""),
          .before = 1L,.after = 1L,
          .complete = TRUE) |> na.omit() |> 
  enframe(name = NULL, value="Sequence") |> 
  group_by_all() |> count(name = "Frequency")


# solution 2
(indv <- my_data |> group_by(student_id) |>
  summarise(coin_results=slide_chr(coin_result,
                               paste0,collapse="",.before=2L,.complete=TRUE)) |> na.omit() |>
  group_by(student_id,
           coin_results) |> count(name ="Frequency"))

group_by(indv,coin_results) |> summarise(
  n=n(),
  sm=sum(Frequency))
)
1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.