Puzzling Kruskal - Wallis test results

Hi everyone. So i have the following problem: for two of my datasets that both contain a categorical variable and a numeric variable of interest the kruskal.test() function gives me the exact same result, twice.

Concerning my datasets, they do share similarities: whereas the numeric variable I am talking about in both datasets contain quite different values, the categorical variable has the exact same content in both datasets. Both datasets have the same amount of datapoints. Also, the respective categorical and numeric variable have the same name in both datasets.

Now, let me show you what I mean exactly:

> kruskal.test(Signaltonoise ~ MatrixSolution, data = Datasheet_2_matrix_sel_105)

	Kruskal-Wallis rank sum test

data:  Signaltonoise by MatrixSolution
Kruskal-Wallis chi-squared = 18.701,    df = 3,    p-value = 0.0003152

> kruskal.test(Signaltonoise ~ MatrixSolution, data = Datasheet_2_matrix_sel_163)

	Kruskal-Wallis rank sum test

data:  Signaltonoise by MatrixSolution
Kruskal-Wallis chi-squared = 18.701,    df = 3,    p-value = 0.0003152

My categorical variable (MatrixSolution) splits my data points (21 in total) in both sets the same way:

category 1 --> datapoints 1 to 5 (5 in total)
category 2 --> datapoints 6 to 11 (6 in total)
category 3 --> datapoints 12 to 17 (6 in total)
category 4 --> datapoints 18 to 21 (4 in total)

The two numerical vectors (Signaltonoise) are:

for the dataset that ends with 105

|1|3.2752879|
|2|1.9166651|
|3|2.5643237|
|4|2.3300389|
|5|2.2994027|
|6|1.1736778|
|7|1.0620759|
|8|0.5249439|
|9|0.7423361|
|10|1.1883668|
|11|0.4138182|
|12|30.1478089|
|13|36.7350398|
|14|16.2086811|
|15|26.2752890|
|16|35.4236749|
|17|25.2327129|
|18|10.8551473|
|19|10.8011864|
|20|12.1467999|
|21|9.4906094|

for the dataset that ends with 163

|1|4.0289918|
|2|2.8699921|
|3|2.8377330|
|4|3.1811226|
|5|2.5667746|
|6|1.6326522|
|7|1.6269232|
|8|1.2101360|
|9|0.8997288|
|10|1.2155427|
|11|0.4847995|
|12|81.2482026|
|13|77.7213035|
|14|40.7294365|
|15|61.1245143|
|16|82.0395821|
|17|66.3434549|
|18|23.2610273|
|19|20.6216288|
|20|26.6061189|
|21|25.1866147|

I have other pairs of datasets that also share the same similarities between each other as the the two datasets that I described above, and there I do not have this problem. Theoretically it's possible that those two datasets really do share the exact same Kruskal - Wallis output parameters, but I find it more than hard to believe.

Can anybody explain to me whats going on? Thanks a lot in advance

Hi @candidaorelmex,
Welcome to the RStudio Community Forum.

Looks like you hit the jackpot! I ran your data from scratch and got exactly the same result for both sets of data:

text105 <- "
|1|3.2752879|
|2|1.9166651|
|3|2.5643237|
|4|2.3300389|
|5|2.2994027|
|6|1.1736778|
|7|1.0620759|
|8|0.5249439|
|9|0.7423361|
|10|1.1883668|
|11|0.4138182|
|12|30.1478089|
|13|36.7350398|
|14|16.2086811|
|15|26.2752890|
|16|35.4236749|
|17|25.2327129|
|18|10.8551473|
|19|10.8011864|
|20|12.1467999|
|21|9.4906094|
"
data105 <- read.delim(text=text105, header=FALSE, sep="|")
data105 <- data105[,c(2,3)]
names(data105) <- c("index", "Signaltonoise")
data105$MatrixSolution <- c(rep(1,5), rep(2,6), rep(3,6), rep(4,4))
head(data105)
#>   index Signaltonoise MatrixSolution
#> 1     1      3.275288              1
#> 2     2      1.916665              1
#> 3     3      2.564324              1
#> 4     4      2.330039              1
#> 5     5      2.299403              1
#> 6     6      1.173678              2


text163 <- "
|1|4.0289918|
|2|2.8699921|
|3|2.8377330|
|4|3.1811226|
|5|2.5667746|
|6|1.6326522|
|7|1.6269232|
|8|1.2101360|
|9|0.8997288|
|10|1.2155427|
|11|0.4847995|
|12|81.2482026|
|13|77.7213035|
|14|40.7294365|
|15|61.1245143|
|16|82.0395821|
|17|66.3434549|
|18|23.2610273|
|19|20.6216288|
|20|26.6061189|
|21|25.1866147|
"
data163 <- read.delim(text=text163, header=FALSE, sep="|")
data163 <- data163[,c(2,3)]
names(data163) <- c("index", "Signaltonoise")
data163$MatrixSolution <- c(rep(1,5), rep(2,6), rep(3,6), rep(4,4))
head(data163)
#>   index Signaltonoise MatrixSolution
#> 1     1      4.028992              1
#> 2     2      2.869992              1
#> 3     3      2.837733              1
#> 4     4      3.181123              1
#> 5     5      2.566775              1
#> 6     6      1.632652              2

kruskal.test(Signaltonoise ~ MatrixSolution, data = data105)
#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  Signaltonoise by MatrixSolution
#> Kruskal-Wallis chi-squared = 18.701, df = 3, p-value = 0.0003152
kruskal.test(Signaltonoise ~ MatrixSolution, data = data163)
#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  Signaltonoise by MatrixSolution
#> Kruskal-Wallis chi-squared = 18.701, df = 3, p-value = 0.0003152

# Let's check we are not going mad!
# Change one data point
data105$Signaltonoise[5] <- 10.000

# Different result
kruskal.test(Signaltonoise ~ MatrixSolution, data = data105)
#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  Signaltonoise by MatrixSolution
#> Kruskal-Wallis chi-squared = 18.479, df = 3, p-value = 0.0003503

Created on 2020-06-02 by the reprex package (v0.3.0)

HTH

Dear DavoWW,

thanks a lot for looking into my problem. Or my not-problem, though I can still hardly believe it. Exactly the same p-value down to the 8th decimal point?!? I would have never thought that I'd see this.

Do you by chance know any other ways to perform a kruskal wallis test in R? I know that kruskal.test() is reliable, but I would just like to exclude the possibility that its the function that turns things upside down.

its not so suprising as the test requires ranking both inputs, and it turns out your ranks in the two data are equal (well maybe that is suprising, but I don't know how you arrived at this data...)


data163$SignaltonoiseRank <- rank(data163$Signaltonoise)
data105$SignaltonoiseRank <- rank(data105$Signaltonoise)
kruskal.test(SignaltonoiseRank ~ MatrixSolution, data = data163)
#same , p-value = 0.0003152 again ....

data163 <- arrange(data163,
                   MatrixSolution,SignaltonoiseRank
                   )

data105 <- arrange(data105,
                   MatrixSolution,SignaltonoiseRank
)

data105$SignaltonoiseRank - data163$SignaltonoiseRank
# all zero's, they are the same ....
1 Like

Hi @candidaorelmex,
Its the data, not the coding.
Kruskal-Wallis is a rank-based test, so maybe not such a surprise:

rank(data105$Signaltonoise)
 [1] 10  7  9  8 12  5  4  2  3  6  1 19 21 16 18 20 17 14 13 15 11
rank(data163$Signaltonoise)
 [1] 11  9  8 10  7  6  5  3  2  4  1 20 19 16 17 21 18 13 12 15 14

boxplot(Signaltonoise ~ MatrixSolution, data=data105)
boxplot(Signaltonoise ~ MatrixSolution, data=data163)

Your homework is to find the probability of this occurring given 21 samples in 4 groups!!

Those are helpful inputs, thank you so much!
Now, under the assumption that the category a data point is in is highly relevant for the respective numeric value, does this make it less unusual?

For example: we have four nations (A, B, C, and D), in each nation the range of annual salary is well defined. These ranges are:
30 - 300 dollars/y for citizens from A
400 - 4000 dollars/y for citizens from B
0.1 - 1 dollars/y for citizens from C and
2 - 20 dollars/y for citizens from D.

If I would now take 5 random salary samples from A, 6 from B, 6 from C, and 4 from D (same sampling I used in my real experiment), I would get an analogous rank vector as I do for my datasets;
all salaries from C (my 3rd category) would rank lowest from 16 - 21
all salaries from D (my 4th category) would rank second lowest from 12 - 15
all salaries from A (my 1st category) would rank second highest from 11 - 7
all salaries from B (my 2nd category) would rank second highest from 11 - 7
correct or am I mistaking?

@candidaorelmex,
Thanks sounds logical.
If you really want to test it, then you could write some simulation code.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.