# Conditional Probability with dplyr

#1

For an introduction to probability, I am experimenting with using dplyr (well, tidyverse) to connect programming concepts to the idea of conditional probability. In my code below, I am using `mutate` to store numbers that I need later (simply the "numerator" and the "denominator"). My query is this: does anyone have a cleaner way of doing this calculation?

Example: Compute the probability that a randomly selected passenger on the Titanic was female given that the passenger was at least 35 years old.

``````library("tidyverse") #for data wrangling tools
library("titanic")
tdf <- titanic_train #training set of Titanic data

conditional_probability <- tdf %>%
filter(Age >= 35) %>%
mutate(denominator = n()) %>%
filter(Sex == "female") %>%
mutate(numerator = n()) %>%
summarize(unique(numerator/denominator))
``````

#2

This is a verbose, but not necessarily wrong approach. In fact might be a good idea for students familiar with concept of probability but learning their way with R / dplyr / tidyverse.

I will be interested in other opinions.

#3

You can use `sum` with your summarize call to do this all in one step:

``````tdf %>%
summarize(prob = sum(Age >= 35 & Sex == "female", na.rm = TRUE)/sum(Age >= 35, na.rm = TRUE))
``````

#4

I was just teaching conditional probabilities today! I chose the following method

``````library(tidyverse)
library(titanic)

titanic_train %>%
filter(
!is.na(Sex),
!is.na(Age)
) %>%
mutate(age_cat = ifelse(Age >= 35, "at least 35", "less than 35")) %>%
count(age_cat, Sex) %>%
group_by(age_cat) %>%
mutate(prop = n / sum(n))
#> # A tibble: 4 x 4
#> # Groups:   age_cat [2]
#>   age_cat      Sex        n  prop
#>   <chr>        <chr>  <int> <dbl>
#> 1 at least 35  female    81 0.345
#> 2 at least 35  male     154 0.655
#> 3 less than 35 female   180 0.376
#> 4 less than 35 male     299 0.624
``````

However, if you're also introducing `tidyr` the following is also a good way of going about it:

``````library(tidyverse)
library(titanic)

titanic_train %>%
filter(
!is.na(Sex),
!is.na(Age)
) %>%
mutate(age_cat = ifelse(Age >= 35, "at least 35", "less than 35")) %>%
count(age_cat, Sex) %>%
mutate(prop = female / (female + male))
#> # A tibble: 2 x 4
#>   age_cat      female  male  prop
#>   <chr>         <int> <int> <dbl>
#> 1 at least 35      81   154 0.345
#> 2 less than 35    180   299 0.376
``````

#5

Something I've just been learning about in a datacamp course is that you can take the mean of a logical vector to calculate proportion of a particular case.

``````library(tidyverse)
library(titanic)
tdf <- titanic_train #training set of Titanic data

tdf %>%
filter(Age >= 35) %>%
summarize(prob = mean(Sex == "female" , na.rm = T))
``````

#6

One other cool thing you can do is group by not just a variable but also an expression using that variable such as Age >= 35

``````library(tidyverse)
library(titanic)
tdf <- titanic_train #training set of Titanic data

tdf %>%
group_by( Age >= 35) %>%
select(Sex) %>%
table() %>%
prop.table(1)
``````