# Conditional Probabilities in R

I have the following dataset:

``````my_data = structure(list(Sequence = structure(1:8, .Label = c("HTT", "TTH",
"HHH", "HHT", "HTH", "THH", "TTT", "THT"), class = "factor"),
sums = c(93L, 93L, 112L, 106L, 108L, 97L, 94L, 97L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -8L))

> my_data
# A tibble: 8 x 2
Sequence  sums
<fct>    <int>
1 HTT         93
2 TTH         93
3 HHH        112
4 HHT        106
5 HTH        108
6 THH         97
7 TTT         94
8 THT         97
``````

Using the information within the SUMS column, I want to find out the probability of the third flip being "H" vs "T" conditional on the earlier sequence (e.g. H given HH, H given TH, T given TT, etc.).**

I tried to do this with the DPLYR library:

``````    library(dplyr)
my_data %>%
mutate(two_seq = substr(Sequence, 1, 2)) %>%
group_by(two_seq) %>%
mutate(third = substr(Sequence, 3, 3)) %>%
group_by(two_seq, third) %>%
summarize(sums = sum(sums)) %>%
mutate(prob = sums / sum(sums))
``````

Here is the output of my code:

```````summarise()` has grouped output by 'two_seq'. You can override using the `.groups` argument.
# A tibble: 8 x 4
# Groups:   two_seq 
two_seq third  sums  prob
<chr>   <chr> <int> <dbl>
1 HH      H       112 0.514
2 HH      T       106 0.486
3 HT      H       108 0.537
4 HT      T        93 0.463
5 TH      H        97 0.5
6 TH      T        97 0.5
7 TT      H        93 0.497
8 TT      T        94 0.503
``````

Can someone please tell me if I have done this correctly?

Thanks!

You get the right answer but the process has unnecessary steps and only works because summarize() groups its output by two_seq. Here is a comparison of your code and a simplified version.

``````my_data = structure(list(Sequence = structure(1:8, .Label = c("HTT", "TTH",
"HHH", "HHT", "HTH", "THH", "TTT", "THT"), class = "factor"),
sums = c(93L, 93L, 112L, 106L, 108L, 97L, 94L, 97L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -8L))
library(dplyr)

my_data
#> # A tibble: 8 × 2
#>   Sequence  sums
#>   <fct>    <int>
#> 1 HTT         93
#> 2 TTH         93
#> 3 HHH        112
#> 4 HHT        106
#> 5 HTH        108
#> 6 THH         97
#> 7 TTT         94
#> 8 THT         97

#Original code
my_data %>%
mutate(two_seq = substr(Sequence, 1, 2)) %>%
group_by(two_seq) %>%
mutate(third = substr(Sequence, 3, 3)) %>%
group_by(two_seq, third) %>%
summarize(sums = sum(sums)) %>%
mutate(prob = sums / sum(sums))
#> `summarise()` has grouped output by 'two_seq'. You can override using the
#> `.groups` argument.
#> # A tibble: 8 × 4
#> # Groups:   two_seq 
#>   two_seq third  sums  prob
#>   <chr>   <chr> <int> <dbl>
#> 1 HH      H       112 0.514
#> 2 HH      T       106 0.486
#> 3 HT      H       108 0.537
#> 4 HT      T        93 0.463
#> 5 TH      H        97 0.5
#> 6 TH      T        97 0.5
#> 7 TT      H        93 0.497
#> 8 TT      T        94 0.503

my_data %>%
mutate(two_seq = substr(Sequence, 1, 2)) %>%
#group_by(two_seq) %>%
mutate(third = substr(Sequence, 3, 3)) %>%
#group_by(two_seq, third) %>%
#summarize(sums = sum(sums)) %>%
group_by(two_seq) |>
mutate(prob = sums / sum(sums)) |>
arrange(two_seq) #This line makes comparing to the original result easier
#> # A tibble: 8 × 5
#> # Groups:   two_seq 
#>   Sequence  sums two_seq third  prob
#>   <fct>    <int> <chr>   <chr> <dbl>
#> 1 HHH        112 HH      H     0.514
#> 2 HHT        106 HH      T     0.486
#> 3 HTT         93 HT      T     0.463
#> 4 HTH        108 HT      H     0.537
#> 5 THH         97 TH      H     0.5
#> 6 THT         97 TH      T     0.5
#> 7 TTH         93 TT      H     0.497
#> 8 TTT         94 TT      T     0.503
``````

Created on 2023-01-26 with reprex v2.0.2

1 Like

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.