confusion matrix - how to read

Hi, I'm having challenges understanding how to read confusion matrix results when there are multiple predictions. I have two examples below. Can someone help me interpret either one of these? What I speculate is that the diagonals are the accuracies, but this is not quite right because in my first example, I can't say Setosa is 33% correct.

Example 1
data(iris)
TrainData <- iris[,1:4]
TrainClasses <- iris[,5]
knnFit <- train(TrainData, TrainClasses,
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
confusionMatrix(knnFit)

Example 2
library(caret)
data("tissue_gene_expression")
set.seed(1991)

x <- tissue_gene_expression$x
y <- tissue_gene_expression$y
#set.seed(1991)
fit2 <- train(x,y, method="rpart", tuneGrid = data.frame(cp = seq(0, 0.1, 0.01)))
confusionMatrix(fit2)

It seems to me that the numbers in the confusion matrix are the percent of samples in each category. From your first example, notice that the iris data are perfectly balanced across the three species.

> table(iris$Species)

    setosa versicolor  virginica 
        50         50         50 

Note that each column of the confusion matrix sums to 33.3, matching the percentage of each species in the population and all of the displayed percentages result in integers (with a little rounding) when applied to 150 samples. The confusion matrix says that 33.3% (50) of the samples are predicted to be setosa and all of them are truly setosa. 32% (48 samples) are predicted to be versicolor and truly are versicolor while 2% (3 samples) are predicted versicolor and are truly virginica.

            Reference
Prediction   setosa versicolor virginica
  setosa       33.3        0.0       0.0
  versicolor    0.0       32.0       2.0
  virginica     0.0        1.3      31.3

To add to FJCC's answer, when you apply confusionMatrix on a object of class train, it uses confusionMatrix.train, which is by default confusionMatrix(data, norm = "overall", dnn = c("Prediction", "Reference"), ...).

If you want to get the absolute values as FJCC showed in the above post, use norm = "none". Here's an excerpt from the documentation:

There are several ways to show the table entries. Using norm = "none" will show the aggregated counts of samples on each of the cells (across all resamples). For norm = "average" , the average number of cell counts across resamples is computed (this can help evaluate how many holdout samples there were on average). The default is norm = "overall" , which is equivalento to "average" but in percentages.

See below:

library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(e1071)

data(iris)
TrainData <- iris[,1:4]
TrainClasses <- iris[,5]

knnFit <- train(TrainData, TrainClasses,
                method = "knn",
                preProcess = c("center", "scale"),
                tuneLength = 10,
                trControl = trainControl(method = "cv"))

confusionMatrix(knnFit, "none")
#> Cross-Validated (10 fold) Confusion Matrix 
#> 
#> (entries are un-normalized aggregated counts)
#>  
#>             Reference
#> Prediction   setosa versicolor virginica
#>   setosa         50          0         0
#>   versicolor      0         48         3
#>   virginica       0          2        47
#>                             
#>  Accuracy (average) : 0.9667

Created on 2019-05-05 by the reprex package (v0.2.1)

Hope this helps.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.