What test to use instead of wilcoxon signed rank test?

Hi!

I'm pretty new to R and statistics in general and have a question about which test to use in R. I am trying to look whether the average of the amount of a specific type of words in 32 conversations differs between type A and type B conversations (two independent groups). My null hypothesis is that the averages do not differ; the alternative hypothesis is that there is a significant difference. I cannot use a t-test for two independent groups, because my samples are too small (type A: n = 14; type B: n = 18). I wanted to opt for a wilcoxon test, but I have too many ties in the data. Is there an alternative?

Thanks in advance for your help!

There is no minimum sample size for a t-test. You should opt for a wilcoxon test if your data is non-normal.

I'll apologise in advanced for the long post, but hopefully it'll help you to get more robust results, rather than relying on the standard testing repertoire that a lot of people use, which is often not fit for purpose (which I believe is the case here based on what you've said about your data).

Based on the type of data you have, I'd suggest creating a Poisson Model (which is usually used for count data) and use the group as a covariate in your Model. The output for the Poisson Model will give you a coefficient, standard error and p-value for being in group B compared to group A.

Here's an example (which also demonstrates that your data should be in long format). I'll generate group A to have an average of 20 and group B to have an average of 24 and use your sample sizes as above (I'll set the seed so you can replicate the random number generation):

set.seed(101)

df_A <- data.frame(grp = "A",conv = rpois(14,20))
df_B <- data.frame(grp = "B",conv = rpois(18,24))

df <- rbind(df_A,df_B)

mod <- glm(conv ~ grp, data=df, family="poisson")

So the df has a column indicating the number of words in the conversation, conv, and a column indicating which group the conversation came from, grp. When we create the model we use the formula: conv ~ grp, which means we want to regress conv against grp when creating our model.

Outputting this model doesn't give an awful lot of information:

> mod

Call:  glm(formula = conv ~ grp, family = "poisson", data = df)

Coefficients:
(Intercept)         grpB  
      2.992        0.202  

Degrees of Freedom: 31 Total (i.e. Null);  30 Residual
Null Deviance:	    35.31 
Residual Deviance: 28.26 	AIC: 190.1

But the value here: grpB = 0.202 indicates that the effect of being in group B increases the log-mean of the two groups by 0.202 (I'll get back to that in a minute).
We can get more information from the mod by running it through the summary() function:

> summary(mod)

Call:
glm(formula = conv ~ grp, family = "poisson", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5336  -0.7031   0.1806   0.5101   1.8357  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.99215    0.05987  49.979  < 2e-16 ***
grpB         0.20197    0.07656   2.638  0.00834 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 35.310  on 31  degrees of freedom
Residual deviance: 28.257  on 30  degrees of freedom
AIC: 190.06

Number of Fisher Scoring iterations: 4

Here we see a table of coefficients. The first column matches what was originally output for the (Intercept) and grpB, the second gives the standard error (which can be used to calculate confidence intervals) and the last one is the p-value. The p-value for the intercept just means that the average in group A isn't 0 (duh), but the p-value for the grpB is only 0.00834, so we can say that it is statistically significant. Therefore, we can reject H_0 and accept H_1.

What do I mean by "log-mean", well the way that Poisson Regression is formed is we try to find a function \theta, which acts on our covariates, z (in this case just the group) to give us an estimate of the average, \lambda of our Poisson distribution. So we're trying to solve:

\lambda = \exp\left(\theta(z)\right)

The results from the model are the coefficients that make up the \theta function, so in our example:

\theta(z) = 2.99215 + 0.20197*(\textrm{group}=B)

So if the group is A, \theta(z) = 2.99215, since the second term resolves to 0 and taking the exponential of that gives: \lambda = 19.92857. We can find this in R by pulling out the first coefficient from the model and taking it's exponent:

> exp(mod$coefficients[1])
(Intercept) 
   19.92857 

OR by noting that this is the average of group A:

> mean(df_A$conv)
[1] 19.92857

If we move over to group B, we set \textrm{group}=B to be 1 and get \theta(z) = 2.9915 + 0.20197 = 3.19347 and therefore \lambda = 24.38889. Again, we can find this in R by summing both the coefficients:

> exp(sum(mod$coefficients))
[1] 24.38889

OR by noting that this is the average of group B:

> mean(df_B$conv)
[1] 24.38889

By using a Poisson Regression rather than a t-test or a wilcoxon test, we are making the assumption that the data is Poisson and based on the fact that it is count data, this is a fair assumption to make. By having tighter assumptions, this strengthens the results that there is a difference (or not if that be the case).

A t-test essentially does the same as the above, but it assumes that there is a straight-forward linear relationship between group A and group B, if your data is count-data, then this relationship may not hold. You can test this by running the glm() function without specifying that we want a Poisson Regression(i.e. without the family = "poisson" argument), which will run a regular linear regression and comparing the results with those of a t-test:

> mod_2 <- glm(conv~grp,data=df)
> t_test <- t.test(conv ~ grp,data=df,var.equal = T)
> summary(mod_2)$coefficients["grpB","Pr(>|t|)"]
[1] 0.01086051
> t_test$p.value
[1] 0.01086051
2 Likes

Thanks so much! I'll try this out!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.