Cluster analysis: best practices and a few questions how to

Dear community,

I am a paleoecologist and I'm working with a data set of 19 sediment samples with 245 species of beetles in total. Only a few species appear in each sample, many species only have a single occurrence in one of the samples.

Now I want to see a) How similar the samples are to each other with regards to the beetle fossils I found in them and b) How the identified species relate to all other species with regards to their occurence in the 19 samples.

I've been trying out Ward Hierarchical clustering, example code:

d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
#draw dendogram with red borders around the 5 clusters#
rect.hclust(fit, k=5, border="red")

Question 1: Would you recommend any other method, or is this method as good as the next one (also with respect to any difficulties in getting the code and plots right).

My original data set was uploaded as an Excel sheet, first column chr (species names) and the other columns set to num, with the sample numbers in the first row. In used transpose to be able to also compare the samples to each other. (I tend to call them Site_species and Site_samples, where Site_species is the original data frame).

Question 2: The transposition didn't seem to work properly, I was able to see the table but there was no little blue arrow to show me more information. Something seems wrong or missing. I was not able to make a dendrogram for the transposed data frame. Is there anything that you could advise just from this description?

The dendrogram for the original dataframe Site_species gives me 245 data points. But it just names them by the number of the row. I would like to find out how these species are related and would like to have a bit more information ready up front in the diagram already, so I was wondering:

Question 3: Is there any way to have the name belonging to each row show up in the dendrogram? Otherwise, an easy way to look up which row number belongs to which name?

Question 4: Is there a way to give an extra attribute to each species (like Tree-hugger, Meadow-dweller, Wood-muncher, Dung-digger), which would then also be obvious in the dendrogram? Different colors for different rows maybe?

Thank you for any kind of help, I hope that my examples and questions are specific enough.

Hi Nick, welcome!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.