External Cluster Validation - Categorical Data

Dear All,

I've recently been attempting to evaluate output from k-modes (a cluster label), relative to a so-called True cluster label (labelled 'class' below).

In other words: I've been attempting to external validate the clustering output. However, when I tried external validation measures from the 'fpc' package, I was unsuccessful (error term posted below script).

I've attached my code for the mushroom dataset. I would appreciate if anyone could show me how to successful execute these external validation measures in the context of categorical data.

Any help appreciated.


# LIBRARIES 

install.packages('klaR')
install.packages('fpc')

library(klaR)
library(fpc)

#MUSHROOM DATA

mushrooms <- read.csv(file = "https://raw.githubusercontent.com/miachen410/Mushrooms/master/mushrooms.csv", header = FALSE)

names(mushrooms) <- c("edibility", "cap-shape", "cap-surface", "cap-color", 
                      "bruises", "odor", "gill-attachment", "gill-spacing", 
                      "gill-size", "gill-color", "stalk-shape", "stalk-root",
                      "stalk-surface-above-ring", "stalk-surface-below-ring", 
                      "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", 
                      "veil-color", "ring-number", "ring-type", "spore-print-color", 
                      "population", "habitat")

names(mushrooms)[names(mushrooms)=="edibility"] <- "class"

indexes <- apply(mushrooms, 2, function(x) any(is.na(x) | is.infinite(x)))

colnames(mushrooms)[indexes]
table(mushrooms$class)
str(mushrooms)

#REMOVING CLASS VARIABLE

mushroom.df <- subset(mushrooms, select = -c(class))

#KMODES ANALYSIS

result.kmode <- kmodes(mushroom.df, 2, iter.max = 50, weighted = FALSE)

#EXTERNAL VALIDATION ATTEMPT


class <- as.numeric(mushroom$class)
clust_stats <- cluster.stats(d = dist(mushroom.df), 
                             class, result.kmode$cluster)

#ERROR TERM 

Error in silhouette.default(clustering, dmatrix = dmat) : 
  NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In dist(mushroom.df) : NAs introduced by coercion

Using a reprex (see the FAQ) helps catch errors such as this one— should be mushrooms$class. With that correction

class <- as.numeric(mushrooms$class) 
#> Warning: NAs introduced by coercion
which(isTRUE(class))
#> integer(0)

In other words, the class argument is empty and dist() has nothing to work with.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.