Agenda
- What is cluster analysis
- Similarity and dissimilarity measures
- K-means clustering
- Hierarchical clustering
- Gaussian Mixture Models (GMM)
- Validation and model selection
- Other clustering methods
Cluster analysis
Unsupervised classification
- Separating or clustering observations
- Purpose: Seeing structure in data (gaining understanding), dimensionality reduction, outlier detection etc.
- Intuitive but vague definition: Given an underlying set of points, partition them into a collection of clusters so that points in the same cluster are close together, while points in different clusters are far apart.
Similarity and dissimilarity measure
In what sense are points close in one cluster and far from points in another cluster?
How do we measure that?
Similarity takes a large value when points are close.
Dissimilarity takes a large value when points are far apart. This reflects the distance between
observations.
Any monotone-decreasing function can convert similarities to dissimilarities.
Both similarity and dissimilarity measures can be subjective. For example comparing the taste
of three ice creams.
Clustering as an optimization problem

- What happens if we minimize dissimilarity?