"The marketing team wants to know not just WHAT groups exist, but HOW they relate to each other. Are 'budget shoppers' a completely separate segment, or are they a sub-group of 'value seekers'? Which premium customers are closest to standard-tier? We need a hierarchy, not just a partition."
β CMO of a $2B retail company during a strategy meeting
K-Means gave us k clusters. Great. But it can't tell us:
Hierarchical clustering builds a tree of relationships β a dendrogram β that answers all of these at once. You can cut it at any level to get 2, 3, 5, or 10 clusters without rerunning anything.
Agglomerative (bottom-up) clustering starts with every point as its own cluster, then repeatedly merges the two closest clusters until only one remains. Watch it happen:
The dendrogram records every merge: which clusters merged and at what distance. The height of each horizontal bar = the distance between clusters when they merged. Cut the tree at any height to get your clusters.
β¬ Drag the slider or click on the dendrogram to set the cut height. Each color = one cluster.
Taller bars = clusters that were far apart when merged. Look for large gaps.
Cut at height h β count the vertical lines crossing the cut line = number of clusters.
Clusters at k=4 are sub-divisions of clusters at k=2. No re-running needed.
Once we have clusters (not just points), how do we measure distance between clusters? Different answers give dramatically different trees.
Point clusters
Resulting dendrogram
| Linkage | Distance = ? | Pros | Cons | Best For |
|---|---|---|---|---|
| Single | Minimum distance between any two points across clusters | Handles non-spherical shapes | Chaining effect β long, stringy clusters | Detecting elongated shapes |
| Complete | Maximum distance between any two points | Compact, tight clusters | Sensitive to outliers | Equal-size, compact clusters |
| Average | Average of all pairwise distances | Balanced, robust | Less intuitive | General-purpose use |
| Ward | Increase in total within-cluster variance | Minimizes variance β similar to K-Means criterion | Tends toward equal-size clusters | Most business applications β |
Before we can merge clusters, we need to define distance between points. The choice can change your results dramatically β especially with high-dimensional or text data.
Straight-line distance. β(Ξ£(aα΅’-bα΅’)Β²)
Use for: spatial, continuous numeric data
Grid-path distance. Ξ£|aα΅’-bα΅’|
Use for: robust to outliers, city-block movement
Angle between vectors. 1 - (aΒ·b)/(|a||b|)
Use for: text, high-dimensional data, NLP
With hierarchical clustering, you can choose k after building the tree. Two methods help you pick the optimal cut:
The red arrow points to the largest gap β that's where you should cut!
Higher silhouette = better-separated clusters. Pick the k with the highest score.
| Level | Segment | Strategy | Expected ROI |
|---|---|---|---|
| Broad (2 clusters) | Premium vs. Rest | VIP program vs. mass campaign | 15% lift |
| Medium (3 clusters) | Premium / Standard / Budget | Tiered loyalty rewards | 23% lift |
| Granular (5+ clusters) | All sub-segments | Personalized 1:1 messaging | 31% lift |
Key advantage: You ran hierarchical clustering once. Marketing can cut the dendrogram at whatever level makes sense for this month's campaign β no rerunning, no choosing k upfront.
| Dimension | π΅ K-Means | π³ Hierarchical |
|---|---|---|
| Input required | Must specify k upfront | No k needed β choose after |
| Result | Flat partition (k clusters) | Full tree (all k at once) |
| Scalability | β Scales to millions of points | β οΈ Slow on large data (O(nΒ²) or O(nΒ³)) |
| Interpretability | Cluster centroids easy to explain | Dendrogram shows relationships |
| Reproducibility | Random init β different results | Deterministic (same tree every run) |
| Cluster shape | Assumes spherical/convex | Flexible (depends on linkage) |
| Outlier handling | Sensitive to outliers | Can isolate outliers as singletons |
| Data size | 10Kβ10M+ rows | Up to ~5Kβ10K rows practical |
| Best use case | Large-scale segmentation with known k | Exploratory analysis, unknown k, need hierarchy |
β’ Dataset is large (>10K rows)
β’ You know how many clusters you want
β’ Speed matters
β’ Clusters are roughly spherical
β’ You want to explore k without rerunning
β’ You need cluster relationships (hierarchy)
β’ Dataset is moderate size (<10K rows)
β’ You want a reproducible, deterministic result
Test your understanding of hierarchical clustering. Answer all 5 to complete the module.
In agglomerative clustering, what happens at the very first step?
What does the height of a bar in a dendrogram represent?
Which linkage method is most prone to the "chaining effect" (long, stringy clusters)?
You have 500,000 customer records and need to segment them quickly into 5 groups. Which method is better and why?
When using the dendrogram gap method to choose k, what should you look for?