Agenda

What is cluster analysis
Similarity and dissimilarity measures
K-means clustering
Hierarchical clustering
Gaussian Mixture Models (GMM)
Validation and model selection
Other clustering methods

Cluster analysis

Unsupervised classification

Separating or clustering observations
Purpose: Seeing structure in data (gaining understanding), dimensionality reduction, outlier detection etc.
Intuitive but vague definition: Given an underlying set of points, partition them into a collection of clusters so that points in the same cluster are close together, while points in different clusters are far apart.

Similarity and dissimilarity measure

In what sense are points close in one cluster and far from points in another cluster? How do we measure that?

Similarity takes a large value when points are close. Dissimilarity takes a large value when points are far apart. This reflects the distance between observations. Any monotone-decreasing function can convert similarities to dissimilarities. Both similarity and dissimilarity measures can be subjective. For example comparing the taste of three ice creams.

Clustering as an optimization problem

What happens if we minimize dissimilarity?