Unsupervised Learning: Clustering

4.14. Unsupervised Learning: Clustering#

Clustering is a machine learning algorithm designed to find the natural groupings that exist in a dataset. The way clustering achieves this is by assigning data points to groups so that the similarity within each group is high and the similarity between groups is low.

A cluster is a group of samples that we say are similar.

A commonly used clustering algorithm is k-means clustering. This is a type of unsupervised learning algorithm because we don’t know the groups beforehand. The data does not come with labels that identify which group (or class) each data point belongs to. The groups are only found after the clustering is complete.

For example, consider the following dataset of flowers. None of these had labels, i.e. there are no pre-defined groups.

We could group these flowers in a number of different ways.

Example grouping

Example grouping

Example grouping

The k-means clustering algorithm is a machine learning method to automatically group data using the distance between samples to measure similarities. It will then try to group samples based on their similarities.