how should I test whether the 2 same categorical variables are significantly differently distributed?

Tung · September 28, 2020, 1:17am

This is my population (true value):

Full Membership
membership level
annual	40.13%
bargain	44.02%
life	9.14%
not yet retired	1.41%
out of area	2.22%
out of area bargain	3.08%

I also have these information in my sample. I also have their counts. Now I want to know whether the columns are significantly differently distributed. What is the test I can use?

I don't think Chi-square test is suitable here. I refer to the book 'R IN ACTION', and it said: "Chi-square tests are often used to assess the relationship between two categorical variables. The null hypothesis is typically that the variables are independent versus a research hypothesis that they aren't."

For example, ethnicity Versus individuals expected to be promoted

In my case, I am comparing the same categorical variable. One is from sample, one is population (true value). The goal to do this step is to justify no bias in responses. In other words, my sample is representative to the population. What test should I use?

RahulB · September 28, 2020, 2:28am

Hey Tung,

You would still use a Chi-square Goodness of fit statistics. People often confuse it with Chi-square test of independence.
See:

Tung · September 28, 2020, 2:47am

Thank you so much. Do you know the code in R?

Tung · September 28, 2020, 1:30pm

RahulB · September 28, 2020, 6:34pm

It is the same command as regular chisq.test().

The chiSq test here would tell you whether the sample is independent of the data. You do not want this happening sine, this would mean that the sample is not representative. Hence, a p value greater than 0.05 is desirable - suggesting that the sample is not significantly different than the data.

membersVec <- c("annual", "bargain", "life", "notYetRetired", "outOfArea", "outOfAreaBargain")
membersDf <- sample(membersVec, size = 2000, prob = c(0.4013,0.4402,0.0914, 0.0141, 0.022, 0.0308), replace = T)
membersTbl <- table(membersDf)

## Sample
membersSample <- sample(membersDf, size = 200, replace = T)
membersSampleTbl <- table(membersSample)

## ChiSq test
chisq.test(membersTbl, membersSampleTbl, simulate.p.value = T, B = 1000)

Tung · September 30, 2020, 4:57pm

In my population, I have

Bundle administrator Bundle member
1224 599

In my sample, we have
Bundle administrator Bundle member
712 476

It looks like my sample is representative. However, my codes are:

bundle <- c(712, 476)
res <- chisq.test(bundle, p = c(1224/1823, 599/1823))
res

	Chi-squared test for given probabilities

data:  bundle
X-squared = 27.989, df = 1, p-value = 1.22e-07

The outcome it is very significant different. I don't understand.

system · October 7, 2020, 4:57pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.