Permutation Test by Year

I am trying to conduct permutation test by year. In the following example, I only did it for a particular year. Rather than manually conducting the test for each year, how could I could I implement it for all the available years? The goal is to have a data frame with year, number of male obs, number of female obs, observed difference, and p-values.

suppressWarnings(suppressMessages(library(tidyverse)))

################
# Complete Data
################
df <- tibble(
  delay = c(320, 480, 290, 350, 180, 260, 220, 325, 485, 295, 355, 185, 265, 225),
  gender = rep(c("F", "M", "F", "M"), c(4,3,5,2)),
  year = rep(c(2020, 2021), each = 7)
)


#########################################################################
#                        PERMUTATION TEST FOR 2020
#########################################################################

# H0: mu_female = mu_male
# HA: mu_female > mu_male

# Data 2020
df2020 <- df %>% 
  filter(year == 2020)
# Step 1:
# A: Pull all delays
delay_all_2020 <- df2020 %>% pull(delay) 
length(delay_all_2020) 
#> [1] 7

# B: Female Delay
delay_female_2020 <- df2020 %>%
  filter(gender == "F") %>%
  pull(delay)

length(delay_female_2020) 
#> [1] 4

# C: Male Delay
delay_male_2020 <- df2020 %>%
  filter(gender == "M") %>%
  pull(delay)
length(delay_male_2020) 
#> [1] 3
# Step 2: Find Mean Difference (Female - Male)
observed_2020 <- mean(delay_female_2020) - mean(delay_male_2020)
observed_2020
#> [1] 140
# Step 3: Simulate
N <- 10^4-1  #set number of times to repeat this process
set.seed(99)

result_2020 <- numeric(N) # space to save the random differences
for(i in 1:N)
{
  index_2020 <- sample(length(delay_all_2020),
                       size = length(delay_female_2020),
                       replace = FALSE)
  result_2020[i] <- mean(delay_all_2020[index_2020]) - mean(delay_all_2020[-index_2020])
}
# Step 4: Plot
ggplot() +
  geom_histogram(aes(result_2020)) +
  geom_vline(xintercept=observed_2020, colour="red")+
  xlab("Female - Male") +
  ggtitle("2020")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Step 5: P-value ; H0: mu_female = mu_male, HA: mu_female > mu_male
(sum(result_2020 >= observed_2020) + 1)/(N + 1)  #P-value
#> [1] 0.029

Created on 2020-11-11 by the reprex package (v0.3.0)

Hello,

First of all, thanks for creating a nice reprex! That really speed things up :slight_smile:

Here is an example of my interpretation of your goal

suppressWarnings(suppressMessages(library(tidyverse)))

################
# Complete Data
################
df <- tibble(
  delay = c(320, 480, 290, 350, 180, 260, 220, 325, 485, 295, 355, 185, 265, 225),
  gender = rep(c("F", "M", "F", "M"), c(4,3,5,2)),
  year = rep(c(2020, 2021), each = 7)
)


#############################
# PERMUTATION TEST FOR 2020
#############################

# H0: mu_female = mu_male
# HA: mu_female > mu_male
purrr::map_df(unique(df$year), function(myYear){
  
  # Data 2020
  dfYear <- df %>% 
    filter(year == myYear)
  
  # Step 1:
  # A: Pull all delays
  delay_all <- dfYear %>% pull(delay) 

  # B: Female Delay
  delay_female <- dfYear %>%
    filter(gender == "F") %>%
    pull(delay)
  

  # C: Male Delay
  delay_male <- dfYear %>%
    filter(gender == "M") %>%
    pull(delay)
   
  # Step 2: Find Mean Difference (Female - Male)
  observed <- mean(delay_female) - mean(delay_male)
  
  # Step 3: Simulate
  N <- 10^4-1  #set number of times to repeat this process
  set.seed(99)
  result <- numeric(N) # space to save the random differences
  for(i in 1:N)
  {
    index <- sample(length(delay_all),
                    size = length(delay_female),
                    replace = FALSE)
    result[i] <- mean(delay_all[index]) - mean(delay_all[-index])
  }

  pVal = (sum(result >= observed) + 1)/(N + 1)  #P-value
  
  return(data.frame(year = myYear, nMale = length(delay_male), 
             nFemale = length(delay_female), nTotal = length(delay_all),
             diff = observed, pVal = pVal))
  
})
#>   year nMale nFemale nTotal diff   pVal
#> 1 2020     3       4      7  140 0.0290
#> 2 2021     2       5      7   84 0.1895

Created on 2020-11-12 by the reprex package (v0.3.0)

All I did was use the map_df function from the purrr package (it is part of the Tidyverse you are loading) and combined all results into a data frame for each year. The map_df will then merge them all together into one.

I did not know what your plan with the plot was, so I left it out.

Hope this helps,
PJ

1 Like

Hello,

Sure there is. You can use the map() function instead of the map_df and return multiple elements. You still have to merge things afterwards then.

suppressWarnings(suppressMessages(library(tidyverse)))

################
# Complete Data
################
df <- tibble(
  delay = c(320, 480, 290, 350, 180, 260, 220, 325, 485, 295, 355, 185, 265, 225),
  gender = rep(c("F", "M", "F", "M"), c(4,3,5,2)),
  year = rep(c(2020, 2021), each = 7)
)


#############################
# PERMUTATION TEST FOR 2020
#############################

# H0: mu_female = mu_male
# HA: mu_female > mu_male
results = purrr::map(unique(df$year), function(myYear){
  
  # Data 2020
  dfYear <- df %>% 
    filter(year == myYear)
  
  # Step 1:
  # A: Pull all delays
  delay_all <- dfYear %>% pull(delay) 
  
  # B: Female Delay
  delay_female <- dfYear %>%
    filter(gender == "F") %>%
    pull(delay)
  
  
  # C: Male Delay
  delay_male <- dfYear %>%
    filter(gender == "M") %>%
    pull(delay)
  
  # Step 2: Find Mean Difference (Female - Male)
  observed <- mean(delay_female) - mean(delay_male)
  
  # Step 3: Simulate
  N <- 10^4-1  #set number of times to repeat this process
  set.seed(99)
  result <- numeric(N) # space to save the random differences
  for(i in 1:N)
  {
    index <- sample(length(delay_all),
                    size = length(delay_female),
                    replace = FALSE)
    result[i] <- mean(delay_all[index]) - mean(delay_all[-index])
  }
  result = as.data.frame(result)
  colnames(result) = myYear
  
  pVal = (sum(result >= observed) + 1)/(N + 1)  #P-value
  
  return(list(
    yearData = data.frame(year = myYear, nMale = length(delay_male), 
                    nFemale = length(delay_female), nTotal = length(delay_all),
                    diff = observed, pVal = pVal),
    otherData = result))
  
})

#Merge the results from the first data frame
df1 = bind_rows(sapply(results, "[", "yearData"))
df1
#>   year nMale nFemale nTotal diff   pVal
#> 1 2020     3       4      7  140 0.0290
#> 2 2021     2       5      7   84 0.1895

#Merge the results from the second data frame
df2 = bind_cols(sapply(results, "[", "otherData"))
head(df2)
#>         2020 2021
#> 1  -35.00000   84
#> 2   64.16667   42
#> 3 -110.83333   35
#> 4   99.16667   -7
#> 5  -29.16667  140
#> 6   11.66667 -140

Created on 2020-11-13 by the reprex package (v0.3.0)

The sapply(results, "[", "yearData") is a neat trick to extract the Nth element from a list of lists. It will loop over all lists and in this case extract the sublist yearData. Same for the other one

PJ

Hi,

In my previous post, I just gave an example of how you can pass multiple data frames. The one I created was a dummy one :slight_smile:

I now updated it with the data you requested, though I think it is easiest to have the data frame be a column per year with all the values in it.

Let me know what you think...

PJ

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

Thanks a lot! This is really helpful.

Is it possible to have a two data frame: one that you have already produced and another with simulated differences with corresponding year for plotting?

I am thinking of plotting histogram and observed difference by year using facet_wrap. If there is a better way of doing it, please let me know.

Thanks again!

Many thanks for the quick reply.

I am afraid I did not manage to pose the question correctly.

By other data I meant this part: result[i] <- mean(delay_all[index]) - mean(delay_all[-index]). That is , the desired second data frame will have two columns: difference and year. So, for each year there will be 10^4-1 observations; total observations will be 2*10^4-1.

Thank you again for your help.

Thanks a lot PJ! Just what I wanted :smiley: