Mahalanobis distances between groups

There are two problems. This first is that the following code gives "non-numeric argument to a mathematical function" error in the calculation of pchisq. The second is that I want to find the Mahalanobis distance between males and females and test to see if it was significantly different from zero.

dff1 = data.frame(score = c(91, 93, 72, 87, 86, 73, 68, 87, 78, 99, 95, 76, 84, 96, 76, 80, 83, 84, 73, 74),
                hours = c(16, 6, 3, 1, 2, 3, 2, 5, 2, 5, 2, 3, 4, 3, 3, 3, 4, 3, 4, 4),
                prep = c(3, 4, 0, 3, 4, 0, 1, 2, 1, 2, 3, 3, 3, 2, 2, 2, 3, 3, 2, 2),
                grade = c(70, 88, 80, 83, 88, 84, 78, 94, 90, 93, 89, 82, 95, 94, 81, 93, 93, 90, 89, 89),
                mf = c("m","f","f","m","m","m","f","m","m","f","f","f","m","f","f","m","m","f","f","f"))
dff1$maha1<-mahalanobis(dff1, colMeans(dff1), cov(dff1))
dff1$p <- pchisq(dff1$mahal, df=3.0, lower.tail=FALSE)

Because mf column of the data is not numeric

Thank you for the reply. I should have seen that problem, but didn't..... I should be able to subset the dataframe, but even if I remove mf entirely I still get the same error.

dff1 = data.frame(score = c(91, 93, 72, 87, 86, 73, 68, 87, 78, 99, 95, 76, 84, 96, 76, 80, 83, 84, 73, 74),
                  hours = c(16, 6, 3, 1, 2, 3, 2, 5, 2, 5, 2, 3, 4, 3, 3, 3, 4, 3, 4, 4),
                  prep = c(3, 4, 0, 3, 4, 0, 1, 2, 1, 2, 3, 3, 3, 2, 2, 2, 3, 3, 2, 2),
                  grade = c(70, 88, 80, 83, 88, 84, 78, 94, 90, 93, 89, 82, 95, 94, 81, 93, 93, 90, 89, 89))
dff1$maha1<-mahalanobis(dff1, colMeans(dff1), cov(dff1))
dff1$p <- pchisq(dff1$mahal, df=3.0, lower.tail=FALSE)

The first argument of the mahalanobis() function is a vector or matrix, not a data frame.

Never mind, the function works fine for me with the data frame, without mf that is.

this is just a typo
maha1 i.e. maha number 1
and mahal i.e. maha letter l

1 Like

Thank you @nirgrahamuk. With that problem solved I can make the original code work with mf. I checked it with and without mf, and I get the same answer for dff1$p. So any thoughts on how to calculate the Mahalanobis distance between males and females?

dff1 = data.frame(score = c(91, 93, 72, 87, 86, 73, 68, 87, 78, 99, 95, 76, 84, 96, 76, 80, 83, 84, 73, 74),
                hours = c(16, 6, 3, 1, 2, 3, 2, 5, 2, 5, 2, 3, 4, 3, 3, 3, 4, 3, 4, 4),
                prep = c(3, 4, 0, 3, 4, 0, 1, 2, 1, 2, 3, 3, 3, 2, 2, 2, 3, 3, 2, 2),
                grade = c(70, 88, 80, 83, 88, 84, 78, 94, 90, 93, 89, 82, 95, 94, 81, 93, 93, 90, 89, 89),
                mf = c("m","f","f","m","m","m","f","m","m","f","f","f","m","f","f","m","m","f","f","f"))
dff1$maha1<-mahalanobis(dff1[,1:4], colMeans(dff1[,1:4]), cov(dff1[,1:4]))
dff1$p <- pchisq(dff1$maha1, df=3.0, lower.tail=FALSE)  #Mahalanobis distance is distribted chi-square k-1 degrees of freedom where k is the number of variables.
dff1$p

Is that a single thing ? Wouldn't there be distances between every male individual and every female individual? I dont think your question is clear. I dont know what distance between males and females means. Perhaps you are working with assumptions I dont have.

It is a single thing. In general terms I have several populations (I used male-female, but it could be types of fish, diamonds, cars, etc...) and for each population I have measured several traits. Given the variability within each population how different are the populations from each other? I am less clear on what happens next. I could take each observation from two populations and calculate D2, or I could calculate D2 between each observation in one population and the centroid of the other population. I am also not sure if I am comparing all the individual distances within a population to individual distances between populations, or just using the between population differences. The distance from A to B is probably not equal to the distance from B to A, but I am not clear on the next step.

How do you intend to summarise the distance, because they are multiple. You've shown that when you do it for m and f entire the result is a sequence of numbers. If you separated out m and f to do separately you would have two sequences. What math do you want to use?

If it helps, SAS uses an F-test to estimate the probability that the distance between two groups is zero.

I guess that you want to do ANOVA, but I agree with Yarnabrina, we can use R functions to calculate things, but it doesnt mean we should, without the context we are sort of hacking around, it feels uncomfortable pointing you in any one direction without there being clarity on the maths/stats, and if you need help clarifying the math then that will require context

The context is sort of like clustering. However in clustering one does not know the groups and one wants to combine individuals to define groups (or split groups to define individuals). I know the groups. So I want to discriminate between groups. As part of that I would like to know which group is least like the others. Something like:
dff1.mlm <- lm(cbind(score, hours, prep, grade) ~ mf, data=dff1)
if(!require(candisc)){install.packages("candisc")}
library(candisc)
dff1.can<- candisc(dff1.mlm)

but the next step results in an error

mahalanobis(dff1.can)

Context of actual problem. I have weather data for 5 years. Measurements are averaged to 1 hour intervals: air temperature, soil temperature, humidity, solar radiation, rainfall, wind speed. One year was unusual in that fruits split before harvest. The known causes for fruit split include irrigation, nutrition, and weather (or some combination). It is possible that an event shortly after bloom resulted in fruit split 7 months later, but the critical event might also be much closer to when symptoms appear. So I was trying to look at the difference between years within each month to see if the one year was unusual in any one month. This year we had almost no fruit split, but we expect that fruit split will be a problem again in some future year. The 2019 season was the bad one.

Is the data public or private ? if public we could maybe have a mini-kaggle competition on the board :laughing:

The data is public. It is available at https://fawn.ifas.ufl.edu/
The problem is that I am not able to get R to download more than a little bit of data. So I downloaded it manually, but the file is now rather large. Hourly means are available through the report generator (under Data Access tab), and 15-minute measures available through "FTP: yearly csv data." The location code is 330.
"mini-kaggle competition on the board" I don't understand what this means. I have not considered NN approaches because my limited experience with such involve classifying objects/images. I suppose one could think of "accuracy" as a measure of distance?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.