how should I test whether the 2 same categorical variables are significantly differently distributed?

This is my population (true value):

Full Membership
membership level
annual 40.13%
bargain 44.02%
life 9.14%
not yet retired 1.41%
out of area 2.22%
out of area bargain 3.08%

I also have these information in my sample. I also have their counts. Now I want to know whether the columns are significantly differently distributed. What is the test I can use?

I don't think Chi-square test is suitable here. I refer to the book 'R IN ACTION', and it said: "Chi-square tests are often used to assess the relationship between two categorical variables. The null hypothesis is typically that the variables are independent versus a research hypothesis that they aren't."

For example, ethnicity Versus individuals expected to be promoted

In my case, I am comparing the same categorical variable. One is from sample, one is population (true value). The goal to do this step is to justify no bias in responses. In other words, my sample is representative to the population. What test should I use?

Hey Tung,

You would still use a Chi-square Goodness of fit statistics. People often confuse it with Chi-square test of independence.
See:

1 Like

Thank you so much. Do you know the code in R?

It is the same command as regular chisq.test().

The chiSq test here would tell you whether the sample is independent of the data. You do not want this happening sine, this would mean that the sample is not representative. Hence, a p value greater than 0.05 is desirable - suggesting that the sample is not significantly different than the data.

membersVec <- c("annual", "bargain", "life", "notYetRetired", "outOfArea", "outOfAreaBargain")
membersDf <- sample(membersVec, size = 2000, prob = c(0.4013,0.4402,0.0914, 0.0141, 0.022, 0.0308), replace = T)
membersTbl <- table(membersDf)

## Sample
membersSample <- sample(membersDf, size = 200, replace = T)
membersSampleTbl <- table(membersSample)

## ChiSq test
chisq.test(membersTbl, membersSampleTbl, simulate.p.value = T, B = 1000)

In my population, I have

Bundle administrator Bundle member
1224 599

In my sample, we have
Bundle administrator Bundle member
712 476

It looks like my sample is representative. However, my codes are:

bundle <- c(712, 476)
res <- chisq.test(bundle, p = c(1224/1823, 599/1823))
res

	Chi-squared test for given probabilities

data:  bundle
X-squared = 27.989, df = 1, p-value = 1.22e-07

The outcome it is very significant different. I don't understand.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.