Calculating the Probability of Failing an Exam in R

I have the following dataset - students ("id") take an exam multiple times, they either pass ("1") or fail ("0"). The data looks something like this:

id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)


my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]

my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL

      id results date_exam_taken exam_number
63018  1       0      2001-08-15           1
72324  1       1      2002-09-03           2
98866  1       0      2003-01-13           3
56137  1       1      2005-06-15           4
77746  1       0      2007-06-26           5
21438  1       0      2011-09-23           6

I then transformed the data into the following format:

library(tidyr)

my_data = my_data %>% 
  pivot_wider(id, names_from = "exam_number", values_from = "results")

# A tibble: 10,000 x 24
      id   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`  `11`  `12`  `13`  `14`  `15`  `16`  `17`  `18`  `19`  `20`  `21`  `22`  `23`
   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1     0     1     0     1     0     0     0     1     0     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 2     2     1     0     1     1     0     0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 3     3     1     0     1     0     1     1     1     1     0     1     1     1     0     0     0     1     1     1    NA    NA    NA    NA    NA
 4     4     1     1     0     0     0     1     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 5     5     1     0     1     0     0     1     0     0     0     0     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 6     6     1     1     0     1     1     0     0     1     0     0     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 7     7     0     0     1     1     0     1     1     0     1     0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 8     8     0     1     0     1     0     1     0     1     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 9     9     0     0     0     0     0     0     1     1     0     1     0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
10    10     0     0     1     1     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# ... with 9,990 more rows

Now, suppose I have the following sequences:

my_grid= expand.grid(0:1, 0:1, 0:1)
n = nrow(my_grid)
n = c(1:n)

my_grid$sequence = paste("sequence", n)
my_grid$seq = paste0(my_grid$Var1, my_grid$Var2, my_grid$Var3)

     Var1 Var2 Var3 sequence seq
1    0    0    0 sequence 1  000
2    1    0    0 sequence 2  100
3    0    1    0 sequence 3  010
4    1    1    0 sequence 4  110
5    0    0    1 sequence 5  001
6    1    0    1 sequence 6  101
7    0    1    1 sequence 7  011
8    1    1    1 sequence 8  111

GOAL: Within the entire dataset, I want to find out the number of times each sequence appears (at the row level). For example, given that a student in this population failed two consecutive tests (e.g. failed tests 4&5, failed test 1&2) - what is the probability that such a student will also fail the next test?

I tried to approach this problem as follows - I took the exam scores of each students and concatenated them into a single string, and made this into a new row. This should make it easier to recognize a desired pattern:

my_list = list()
for (i in 1:length(1:nrow(my_data)))
{
 val_i = paste(my_data[i,-1],collapse="")
print(val_i)
 my_list[[i]] = val_i
}

my_data$cols <- my_list

my_fun <- function(seq, data){
return(lengths(gregexpr(seq, data)))
}

PROBLEM: Then, I tried to apply this function to obtain the final counts - but I am getting this error:

#PROBLEM
my_grid$counts = mapply(my_fun, c(my_grid$seq), my_data$cols)
Error in input[i, ] : incorrect number of dimensions 

Ideally, I am looking for the final result to look something like this (from here, I could simply calculate the conditional probabilities):

# FINAL RESULT
  Var1 Var2 Var3   sequence seq counts
1    0    0    0 sequence 1 000    ...
2    1    0    0 sequence 2 100    ...
3    0    1    0 sequence 3 010    ...
4    1    1    0 sequence 4 110    ...
5    0    0    1 sequence 5 001    ...
6    1    0    1 sequence 6 101    ...
7    0    1    1 sequence 7 011    ...
8    1    1    1 sequence 8 111    ...

QUESTION: Can someone please show me what I am doing wrong and what I can do to fix this?

Thanks!

  • NOTE 1: Instead of using a function, I tried to do this with a for loop.

Here is the code I wrote:

my_list = list()
for (i in 1:length(my_grid$seq))
{
    seq_i = my_grid$seq[i]
    val_i = sum(lengths(gregexpr(seq_i, my_data$cols)))
    print(c(i, seq_i, val_i))
}

[1] "1"     "000"   "11255"
[1] "2"     "100"   "12743"
[1] "3"     "010"   "12145"
[1] "4"     "110"   "12676"
[1] "5"     "001"   "12765"
[1] "6"     "101"   "12085"
[1] "7"     "011"   "12672"
[1] "8"     "111"   "11201"

But for some reason, I don't think this is correct (i.e. counts look rather high)?

  • NOTE 2: I am also trying to make sure that the conditional probabilities are calculated using individual students scores and not by "clumping" all student scores together.

E.g.

student 1 = 1,1,0,0,1,0,0
student 2 = 1,0,0,1,1,1,0

It would be incorrect to combine the scores of both of these students into a single string "1,1,0,0,1,0,0, 1,0,0,1,1,1,0" and then calculate the frequency counts - I would like to calculate these counts at the student level and then add them up together.

Maybe this works?

library(dplyr, warn.conflicts = FALSE)
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)


my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]

my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL


s <- c(1,1,0)

# create a list of vectors
my_grid = expand.grid(0:1, 0:1, 0:1) %>% purrr::pmap(c)


result <- purrr::map_dfr(my_grid,  function(s) {
 my_data %>% 
   group_by(id) %>% 
   mutate(indicator = (results == s[3]) & lag(results) == s[2] & lag(results, 2) == s[1]) %>% 
   ungroup() %>% 
   summarise(counts = sum(indicator, na.rm = T)) %>% 
   mutate(Var1 = s[1], Var2 = s[2], Var3 = s[3], seq = paste(s, collapse = "")) %>% 
   select(Var1, Var2, Var3, seq, counts)
})


result
#> # A tibble: 8 × 5
#>    Var1  Var2  Var3 seq   counts
#>   <int> <int> <int> <chr>  <int>
#> 1     0     0     0 000     9829
#> 2     1     0     0 100     9965
#> 3     0     1     0 010     9954
#> 4     1     1     0 110    10066
#> 5     0     0     1 001    10062
#> 6     1     0     1 101    10028
#> 7     0     1     1 011    10083
#> 8     1     1     1 111    10019

Created on 2022-12-10 with reprex v2.0.2

1 Like

@ablack3 : Thank you so much for your answer!

I had two questions about the code you provided:

Question 1:
I tried running your code on a specific example and got the following answer:

library(dplyr, warn.conflicts = FALSE)
set.seed(123)
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)


my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]

my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL


s <- c(1,1,0)

# create a list of vectors
my_grid = expand.grid(0:1, 0:1, 0:1) %>% purrr::pmap(c)


result <- purrr::map_dfr(my_grid,  function(s) {
    my_data %>% 
        group_by(id) %>% 
        mutate(indicator = (results == s[3]) & lag(results) == s[2] & lag(results, 2) == s[1]) %>% 
        ungroup() %>% 
        summarise(counts = sum(indicator, na.rm = T)) %>% 
        mutate(Var1 = s[1], Var2 = s[2], Var3 = s[3], seq = paste(s, collapse = "")) %>% 
        select(Var1, Var2, Var3, seq, counts)
})

This produces the following answer:

> result
# A tibble: 8 x 5
   Var1  Var2  Var3 seq   counts
  <int> <int> <int> <chr>  <int>
1     0     0     0 000    10032
2     1     0     0 100    10135
3     0     1     0 010    10079
4     1     1     0 110     9904
5     0     0     1 001    10012
6     1     0     1 101     9976
7     0     1     1 011     9851
8     1     1     1 111    10015

I did the following comparison:

> dim(my_data)
[1] 100000      4
> 
> sum(result$counts)
[1] 80004

I was just curious, should the number of rows in the original dataset and the number of combinations be the same? I was not sure about this.

Thanks!

Question 2:

Suppose I wanted to check all combinations at the 4-th level, is the following modification correct?

# some arbitrary combination to be checked
s <- c(1,1,1,1)

# create a list of vectors (add one more layer here)
my_grid = expand.grid(0:1, 0:1, 0:1, 0:1) %>% purrr::pmap(c)

# Other modifications are added here
result <- purrr::map_dfr(my_grid,  function(s) {
 my_data %>% 
   group_by(id) %>% 
   mutate(indicator = (results == s[4]) & lag(results) == s[3]  & lag(results) == s[2] & lag(results, 2) == s[1]) %>% 
   ungroup() %>% 
   summarise(counts = sum(indicator, na.rm = T)) %>% 
   mutate(Var1 = s[1], Var2 = s[2], Var3 = s[3], Var4 = s[4], seq = paste(s, collapse = "")) %>% 
   select(Var1, Var2, Var3,  Var4, seq, counts)
})

The results then look something like this:

> result
# A tibble: 16 x 6
    Var1  Var2  Var3  Var4 seq   counts
   <int> <int> <int> <int> <chr>  <int>
 1     0     0     0     0 0000   10032
 2     1     0     0     0 1000   10135
 3     0     1     0     0 0100       0
 4     1     1     0     0 1100       0
 5     0     0     1     0 0010       0
 6     1     0     1     0 1010       0
 7     0     1     1     0 0110   10079
 8     1     1     1     0 1110    9904
 9     0     0     0     1 0001   10012
10     1     0     0     1 1001    9976
11     0     1     0     1 0101       0
12     1     1     0     1 1101       0
13     0     0     1     1 0011       0
14     1     0     1     1 1011       0
15     0     1     1     1 0111    9851
16     1     1     1     1 1111   10015

Have I done this correctly?

Thank you so much!

Consider using rle()

# create two vectors of scores
set.seed(42)
scores_1 <- sample(c(1,0,NA),100,replace = TRUE)
set.seed(222)
scores_2 <- sample(c(1,0,NA),100,replace = TRUE)

# remove NAs
scores_1 <- scores_1[which(!is.na(scores_1))]
scores_2 <- scores_2[which(!is.na(scores_2))]

# trim to a multiple of 3
length(scores_1) %% 3
#> [1] 1
length(scores_2) %% 3
#> [1] 1
scores_1 <- scores_1[-length(scores_1)]
scores_2 <- scores_2[-length(scores_2)]

scores_1
#>  [1] 1 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0
#> [39] 0 1 0 0 0 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 1
#> [77] 0 1
scores_2
#>  [1] 1 0 0 0 0 1 0 1 0 1 0 0 1 1 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0
#> [39] 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 1 1 0 0 0 1

# find sequences

seq_1 <- rle(scores_1)
seq_2 <- rle(scores_2)
seq_1
#> Run Length Encoding
#>   lengths: int [1:31] 4 3 3 3 5 2 1 6 2 2 ...
#>   values : num [1:31] 1 0 1 0 1 0 1 0 1 0 ...
seq_2
#> Run Length Encoding
#>   lengths: int [1:33] 1 4 1 1 1 1 1 2 2 1 ...
#>   values : num [1:33] 1 0 1 0 1 0 1 0 1 0 ...

# tally results 
success_1 <- length(which(seq_1$value == 0 & seq_1$lengths == 2))
failure_1 <- length(which(seq_1$value == 0 & seq_1$lengths == 3))

success_2 <- length(which(seq_2$value == 0 & seq_2$lengths == 2))
failure_2 <- length(which(seq_2$value == 0 & seq_2$lengths == 3))

# count trials
trial_1 <- success_1 + failure_1
trial_2 <- success_2 + failure_2

# failure proportion
outcomes_1 <- failure_1 / trial_1
outcomes_2 <- failure_2 / trial_2
outcomes_1
#> [1] 0.4285714
outcomes_2
#> [1] 0.5

Created on 2022-12-10 by the reprex package (v2.0.1)

I wouldn't expect that these would be equal based on my understanding of the question.

We can use lag to identify rows that match the given pattern. Then just sum the indicator variables. The same approach works for a sequence of length 4.

Here is a revised example that shows how this works.

library(dplyr, warn.conflicts = FALSE)

data.frame(id = rep(1:4, each = 5), results = sample(0:1, 20, TRUE)) %>% 
  group_by(id) %>% 
  mutate(s001 = lag(results, 2) == 0 & lag(results, 1) == 0 & lag(results, 0) == 1) %>% 
  mutate(s010 = lag(results, 2) == 0 & lag(results, 1) == 1 & lag(results, 0) == 0) %>% 
  mutate(s011 = lag(results, 2) == 0 & lag(results, 1) == 1 & lag(results, 0) == 1) %>% 
  mutate(s111 = lag(results, 2) == 1 & lag(results, 1) == 1 & lag(results, 0) == 1) %>% 
  mutate(s1111 = lag(results, 3) == 1 & lag(results, 2) == 1 & lag(results, 1) == 1 & lag(results, 0) == 1) 
#> # A tibble: 20 × 7
#> # Groups:   id [4]
#>       id results s001  s010  s011  s111  s1111
#>    <int>   <int> <lgl> <lgl> <lgl> <lgl> <lgl>
#>  1     1       0 FALSE NA    FALSE FALSE FALSE
#>  2     1       1 NA    FALSE FALSE FALSE FALSE
#>  3     1       0 FALSE TRUE  FALSE FALSE FALSE
#>  4     1       0 FALSE FALSE FALSE FALSE FALSE
#>  5     1       1 TRUE  FALSE FALSE FALSE FALSE
#>  6     2       1 NA    FALSE NA    NA    NA   
#>  7     2       1 FALSE FALSE NA    NA    NA   
#>  8     2       1 FALSE FALSE FALSE TRUE  NA   
#>  9     2       1 FALSE FALSE FALSE TRUE  TRUE 
#> 10     2       1 FALSE FALSE FALSE TRUE  TRUE 
#> 11     3       0 FALSE NA    FALSE FALSE FALSE
#> 12     3       1 NA    FALSE FALSE FALSE FALSE
#> 13     3       0 FALSE TRUE  FALSE FALSE FALSE
#> 14     3       1 FALSE FALSE FALSE FALSE FALSE
#> 15     3       1 FALSE FALSE TRUE  FALSE FALSE
#> 16     4       0 FALSE NA    FALSE FALSE FALSE
#> 17     4       0 FALSE FALSE FALSE FALSE FALSE
#> 18     4       0 FALSE FALSE FALSE FALSE FALSE
#> 19     4       1 TRUE  FALSE FALSE FALSE FALSE
#> 20     4       1 FALSE FALSE TRUE  FALSE FALSE

Created on 2022-12-12 with reprex v2.0.2

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.