I have the following dataset - this data represents students (e.g. id = 1, id = 2, id = 3) who took an exam at different dates, and the result that they got (0 = pass, 1 = fail).
library(data.table)
my_data = data.table( structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), results = c(0,
0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0,
1), date_exam_taken = structure(c(12889, 12943, 15445, 15528,
17840, 10623, 10680, 11186, 11971, 12826, 13744, 13805, 14904,
15089, 15815, 16883, 17511, 17673, 11500, 12743, 14906, 15675,
16774), class = "Date"), exam_number = c(1L, 2L, 3L, 4L, 5L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 1L, 2L,
3L, 4L, 5L)), row.names = c(NA, 23L), class = "data.frame"))
> head(my_data)
id results date_exam_taken exam_number
1: 1 0 2005-04-16 1
2: 1 0 2005-06-09 2
3: 1 1 2012-04-15 3
4: 1 1 2012-07-07 4
5: 1 1 2018-11-05 5
6: 2 0 1999-02-01 1
Using the following code in R, I was able to count the number of "3 exam transitions" - that is, I was able to count the number of times each student experienced :
- "pass, pass, pass"
- "pass, pass, fail"
- etc
- "fail, fail, fail"
The R code looks something like this:
my_data$current_exam = shift(my_data$results, 0)
my_data$prev_exam = shift(my_data$results, 1)
my_data$prev_2_exam = shift(my_data$results, 2)
# Count the number of exam results for each record
out <- my_data[!is.na(prev_exam), .(tally = .N), by = .(id, current_exam, prev_exam, prev_2_exam)]
out = na.omit(out)
> head(out)
id current_exam prev_exam prev_2_exam tally
1: 1 1 0 0 1
2: 1 1 1 0 1
3: 1 1 1 1 1
4: 2 0 1 1 3
Now, I want to calculate the probability of the student pass/failing the current exam, conditional on the results of the previous exam and the second previous exam.
I thought the best way to do this was to first perform an aggregation:
library(dplyr)
agg = out %>% group_by(current_exam, prev_exam, prev_2_exam) %>% summarise(total = sum(tally))
> agg
# A tibble: 6 x 4
# Groups: current_exam, prev_exam [3]
current_exam prev_exam prev_2_exam total
<dbl> <dbl> <dbl> <int>
1 0 1 0 1
2 0 1 1 4
3 1 0 0 1
4 1 0 1 5
5 1 1 0 4
6 1 1 1 6
From here, I am trying to look for an efficient way to calculate all conditional probabilities (i.e. P(current exam = 0 | prev_exam = 0 & prev_2_exam = 0)).
I figured out how to do this manually:
# prob (current = 1, given prev = 1, 2nd_prev =1
p1 = agg[ agg$current_exam == 1 & agg$prev_exam == 1 & agg$prev_2_exam == 1,]
p2 = agg[ agg$current_exam == 0 & agg$prev_exam == 1 & agg$prev_2_exam == 1,]
final_prob_1_1_1 = sum(p1$total)/(sum(p1$total) + sum(p2$total))
But is there some easier way to do this? Is there some DPLYR function that can "look back" and count all combinations until the second last column and calculate all the conditional probabilities?
In the end - I am looking to get an output with 8 rows that looks something like this:
second_prev_prev current_exam probs
11 1 prob1
11 0 prob2
10 1 prob3
10 0 prob4
01 1 prob5
01 0 prob6
00 1 prob7
00 0 prob8
Thanks!
Note: My attempt - is this correct?
# my own attempt
> agg %>%
group_by(prev_exam, prev_2_exam) %>%
mutate(probability = total / sum(total))
# A tibble: 6 x 5
# Groups: prev_exam, prev_2_exam [4]
current_exam prev_exam prev_2_exam total probability
<dbl> <dbl> <dbl> <int> <dbl>
1 0 1 0 1 0.2
2 0 1 1 4 0.4
3 1 0 0 1 1
4 1 0 1 5 1
5 1 1 0 4 0.8
6 1 1 1 6 0.6a