Hi All,
This is probably a very basic statistics question. I am looking at the moment to check if a feature i have in my dataset would make a good attribute for trying to predict a binary outcome. I have read about correspondence analysis which seems useful for when you have lots of factors however I am currently looking at chi^2 analysis using permutation
I want to check if there is an association between my variable and the binary outcome I'm trying to predict. To this end I have set up the example below where i am trying to see if there is an association between the sex of a student and what school they go to. This is obviously a nonsense example.
My understanding of permutation analysis is as follows (with respect to chi squared test)
- Generate the chi squared statistic against my data
- Generate permutations by perturbing one of the columns so there is now a random association between the two columns
- After each permutation generate my test statistic
- Visualize to inspect the results
- A p value can be obtained by getting the proportion of test results from the perturbed data that were greater than my test statistic
Using the below as an example I have a couple of questions
Why does my p-value for the mathematical method differ so greatly from the computational method
How do I interpret the probability generated from the last line
library(lavaan)
library(infer)
library(tidyverse)
library(janitor)
mydf <- HolzingerSwineford1939 %>%
mutate(sex = ifelse(sex==1, 'M', 'F'))
# Step 1 is to calculate your test statistic
# Check for Chi.test that proportions are the same
# Checking to see if there is no relationship between the rows and the columns
tabyl(mydf, school, sex)
actual_score <- mydf %>% chisq_test(school ~ sex)
# The categories are independant
actual_score
# Step 2 Permutate 5000 chi scores
chisq_null <- mydf %>%
specify(school ~ sex, success = 'Pasteur') %>% # alt: response = origin, explanatory = season
hypothesize(null = "independence") %>%
generate(reps = 5000, type = "permute") %>%
calculate(stat = "Chisq", order = c("M", "F"))
# Visualize - This is very slow and takes a while to visualise
visualize(chisq_null, method = "both", obs_stat = actual_score$statistic, direction = "greater")
# Attempt to get the p_value from the permutations
chisq_null %>% summarise(p_val = mean(stat > actual_score$statistic))