When performing multi-class classification, confusion matrices do a good job at presenting the results
while preserving all information: % correct classification accuracy, % misclassifications and misclassification
classes for each predicted class. Its when the number of classes gets beyond ~5 classes that these visualizations
start to become inappropriate. The matrices become too large to be presented anywhere; whether on a presentation
slide or figure in a manuscript. The issue is further amplified when we have hierarchical classification, where we
want to show inherited (mis)classifications down a tree.
This is what a confusion matrix for a relatively large number of classes looks like:
When visualizing data, its always a matter of balance between information and simplicity. In my case, I’m interested
in the relative proportion(s) of misclassifications of a target class and into which classes the misclassifications
occured. Since I’m performing hierarchical classification, I’m also interested in grouping the classes to be able to
determine misclassification classes at the upper hierarchy level with a quick glance. High accuracy values are not a
priority so I came up with a semi-quantitative visualization which I’m calling “confused pie plots”:
Yes, customized pie charts. So we have the inner ellipse showing the expected target class, and the outer ellipse represents the
predicted classes. Rows represent child nodes belonging to the same parent (column). Its relatively straightforward to
see where misclassifications occurred. Obviously less so when the color scale becomes limiting with a very large number
of classes. But even then, misclassifications at the parent nodes is still easy to see with a specific color-scale assigned
to the parents (tested it with up to 35 classes - works quite well. Results are part of a manuscript under review, will update this post
with the figure once published).
Here’s the code to generate this (or fork it on GitHub). Requires a confusion matrix in csv as input, with target classes
as rows, and predicted classes as columns. Labels should be first column and first for parent classes and second column,
second row for child classes.
Originally a biologist and chemist, now a PhD student at Imperial College London working on computational methods for biomedical data; analysis, visualizations, setting up data repositories, developing online tools, and mostly: fixing bugs. On my right is Kaiser - my Siberian husky and best friend.