 # Permutation Test by Year

I am trying to conduct permutation test by year. In the following example, I only did it for a particular year. Rather than manually conducting the test for each year, how could I could I implement it for all the available years? The goal is to have a data frame with year, number of male obs, number of female obs, observed difference, and p-values.

``````suppressWarnings(suppressMessages(library(tidyverse)))

################
# Complete Data
################
df <- tibble(
delay = c(320, 480, 290, 350, 180, 260, 220, 325, 485, 295, 355, 185, 265, 225),
gender = rep(c("F", "M", "F", "M"), c(4,3,5,2)),
year = rep(c(2020, 2021), each = 7)
)

#########################################################################
#                        PERMUTATION TEST FOR 2020
#########################################################################

# H0: mu_female = mu_male
# HA: mu_female > mu_male

# Data 2020
df2020 <- df %>%
filter(year == 2020)
``````
``````# Step 1:
``````
``````# A: Pull all delays
delay_all_2020 <- df2020 %>% pull(delay)
length(delay_all_2020)
#>  7

# B: Female Delay
delay_female_2020 <- df2020 %>%
filter(gender == "F") %>%
pull(delay)

length(delay_female_2020)
#>  4

# C: Male Delay
delay_male_2020 <- df2020 %>%
filter(gender == "M") %>%
pull(delay)
length(delay_male_2020)
#>  3
``````
``````# Step 2: Find Mean Difference (Female - Male)
``````
``````observed_2020 <- mean(delay_female_2020) - mean(delay_male_2020)
observed_2020
#>  140
``````
``````# Step 3: Simulate
``````
``````N <- 10^4-1  #set number of times to repeat this process
set.seed(99)

result_2020 <- numeric(N) # space to save the random differences
for(i in 1:N)
{
index_2020 <- sample(length(delay_all_2020),
size = length(delay_female_2020),
replace = FALSE)
result_2020[i] <- mean(delay_all_2020[index_2020]) - mean(delay_all_2020[-index_2020])
}
``````
``````# Step 4: Plot
``````
``````ggplot() +
geom_histogram(aes(result_2020)) +
geom_vline(xintercept=observed_2020, colour="red")+
xlab("Female - Male") +
ggtitle("2020")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`````` ``````# Step 5: P-value ; H0: mu_female = mu_male, HA: mu_female > mu_male
``````
``````(sum(result_2020 >= observed_2020) + 1)/(N + 1)  #P-value
#>  0.029
``````

Created on 2020-11-11 by the reprex package (v0.3.0)

Hello,

First of all, thanks for creating a nice reprex! That really speed things up Here is an example of my interpretation of your goal

``````suppressWarnings(suppressMessages(library(tidyverse)))

################
# Complete Data
################
df <- tibble(
delay = c(320, 480, 290, 350, 180, 260, 220, 325, 485, 295, 355, 185, 265, 225),
gender = rep(c("F", "M", "F", "M"), c(4,3,5,2)),
year = rep(c(2020, 2021), each = 7)
)

#############################
# PERMUTATION TEST FOR 2020
#############################

# H0: mu_female = mu_male
# HA: mu_female > mu_male
purrr::map_df(unique(df\$year), function(myYear){

# Data 2020
dfYear <- df %>%
filter(year == myYear)

# Step 1:
# A: Pull all delays
delay_all <- dfYear %>% pull(delay)

# B: Female Delay
delay_female <- dfYear %>%
filter(gender == "F") %>%
pull(delay)

# C: Male Delay
delay_male <- dfYear %>%
filter(gender == "M") %>%
pull(delay)

# Step 2: Find Mean Difference (Female - Male)
observed <- mean(delay_female) - mean(delay_male)

# Step 3: Simulate
N <- 10^4-1  #set number of times to repeat this process
set.seed(99)
result <- numeric(N) # space to save the random differences
for(i in 1:N)
{
index <- sample(length(delay_all),
size = length(delay_female),
replace = FALSE)
result[i] <- mean(delay_all[index]) - mean(delay_all[-index])
}

pVal = (sum(result >= observed) + 1)/(N + 1)  #P-value

return(data.frame(year = myYear, nMale = length(delay_male),
nFemale = length(delay_female), nTotal = length(delay_all),
diff = observed, pVal = pVal))

})
#>   year nMale nFemale nTotal diff   pVal
#> 1 2020     3       4      7  140 0.0290
#> 2 2021     2       5      7   84 0.1895
``````

Created on 2020-11-12 by the reprex package (v0.3.0)

All I did was use the map_df function from the purrr package (it is part of the Tidyverse you are loading) and combined all results into a data frame for each year. The map_df will then merge them all together into one.

I did not know what your plan with the plot was, so I left it out.

Hope this helps,
PJ

1 Like

Hello,

Sure there is. You can use the map() function instead of the map_df and return multiple elements. You still have to merge things afterwards then.

``````suppressWarnings(suppressMessages(library(tidyverse)))

################
# Complete Data
################
df <- tibble(
delay = c(320, 480, 290, 350, 180, 260, 220, 325, 485, 295, 355, 185, 265, 225),
gender = rep(c("F", "M", "F", "M"), c(4,3,5,2)),
year = rep(c(2020, 2021), each = 7)
)

#############################
# PERMUTATION TEST FOR 2020
#############################

# H0: mu_female = mu_male
# HA: mu_female > mu_male
results = purrr::map(unique(df\$year), function(myYear){

# Data 2020
dfYear <- df %>%
filter(year == myYear)

# Step 1:
# A: Pull all delays
delay_all <- dfYear %>% pull(delay)

# B: Female Delay
delay_female <- dfYear %>%
filter(gender == "F") %>%
pull(delay)

# C: Male Delay
delay_male <- dfYear %>%
filter(gender == "M") %>%
pull(delay)

# Step 2: Find Mean Difference (Female - Male)
observed <- mean(delay_female) - mean(delay_male)

# Step 3: Simulate
N <- 10^4-1  #set number of times to repeat this process
set.seed(99)
result <- numeric(N) # space to save the random differences
for(i in 1:N)
{
index <- sample(length(delay_all),
size = length(delay_female),
replace = FALSE)
result[i] <- mean(delay_all[index]) - mean(delay_all[-index])
}
result = as.data.frame(result)
colnames(result) = myYear

pVal = (sum(result >= observed) + 1)/(N + 1)  #P-value

return(list(
yearData = data.frame(year = myYear, nMale = length(delay_male),
nFemale = length(delay_female), nTotal = length(delay_all),
diff = observed, pVal = pVal),
otherData = result))

})

#Merge the results from the first data frame
df1 = bind_rows(sapply(results, "[", "yearData"))
df1
#>   year nMale nFemale nTotal diff   pVal
#> 1 2020     3       4      7  140 0.0290
#> 2 2021     2       5      7   84 0.1895

#Merge the results from the second data frame
df2 = bind_cols(sapply(results, "[", "otherData"))
#>         2020 2021
#> 1  -35.00000   84
#> 2   64.16667   42
#> 3 -110.83333   35
#> 4   99.16667   -7
#> 5  -29.16667  140
#> 6   11.66667 -140
``````

Created on 2020-11-13 by the reprex package (v0.3.0)

The `sapply(results, "[", "yearData")` is a neat trick to extract the Nth element from a list of lists. It will loop over all lists and in this case extract the sublist yearData. Same for the other one

PJ

Hi,

In my previous post, I just gave an example of how you can pass multiple data frames. The one I created was a dummy one I now updated it with the data you requested, though I think it is easiest to have the data frame be a column per year with all the values in it.

Let me know what you think...

PJ

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

Thanks a lot! This is really helpful.

Is it possible to have a two data frame: one that you have already produced and another with simulated differences with corresponding year for plotting?

I am thinking of plotting histogram and observed difference by year using `facet_wrap`. If there is a better way of doing it, please let me know.

Thanks again!

Many thanks for the quick reply.

I am afraid I did not manage to pose the question correctly.

By other data I meant this part: `result[i] <- mean(delay_all[index]) - mean(delay_all[-index])`. That is , the desired second data frame will have two columns: difference and year. So, for each year there will be 10^4-1 observations; total observations will be 2*10^4-1.

Thank you again for your help.

Thanks a lot PJ! Just what I wanted 