Module 9: Hierarchical Clustering

📍 Your Journey Through Hierarchical Clustering

0%

1Concept

2Build

3Dendrogram

4Linkage

5Apply

🎯 The Business Problem K-Means Can't Solve

"The marketing team wants to know not just WHAT groups exist, but HOW they relate to each other. Are 'budget shoppers' a completely separate segment, or are they a sub-group of 'value seekers'? Which premium customers are closest to standard-tier? We need a hierarchy, not just a partition."

— CMO of a $2B retail company during a strategy meeting

K-Means gave us k clusters. Great. But it can't tell us:

🌿 Which clusters are natural sub-segments of each other?
📏 How similar are different clusters to each other?
🔢 What's the right number of clusters — without guessing k upfront?
🗺️ What's the full landscape of customer relationships?

Hierarchical clustering builds a tree of relationships — a dendrogram — that answers all of these at once. You can cut it at any level to get 2, 3, 5, or 10 clusters without rerunning anything.

🏗️ Step 1: Agglomerative Clustering — Building From the Bottom Up

Agglomerative (bottom-up) clustering starts with every point as its own cluster, then repeatedly merges the two closest clusters until only one remains. Watch it happen:

🎬 Live Merge Animation

      Click "New Points" to generate data, then "Next Merge" to step through agglomerative clustering.
    

Current clusters

Merged (last step)

Step: 0

    Algorithm in Plain English:

Start: N points → N singleton clusters

Find the two closest clusters (by chosen linkage method)

Merge them into one cluster

Repeat from step 2 until 1 cluster remains

Record every merge — this becomes your dendrogram!

🌲 Step 2: Reading the Dendrogram

The dendrogram records every merge: which clusters merged and at what distance. The height of each horizontal bar = the distance between clusters when they merged. Cut the tree at any height to get your clusters.

✂️ Interactive Dendrogram — Drag to Cut

Cut Height: 50 Clusters: 2

⬆ Drag the slider or click on the dendrogram to set the cut height. Each color = one cluster.

Adjust the cut height slider to see how many clusters you get.

🔑 Key Insight: Long Branches = Natural Clusters

📏

Height = Distance

Taller bars = clusters that were far apart when merged. Look for large gaps.

✂️

Cut = k Clusters

Cut at height h → count the vertical lines crossing the cut line = number of clusters.

🌿

Hierarchy Preserved

Clusters at k=4 are sub-divisions of clusters at k=2. No re-running needed.

🔗 Step 3: Linkage Methods — How Do We Measure Cluster Distance?

Once we have clusters (not just points), how do we measure distance between clusters? Different answers give dramatically different trees.

⚖️ Compare Linkage Methods Side-by-Side

Point clusters

Resulting dendrogram

Select a linkage method above to see how it shapes the tree.

Linkage	Distance = ?	Pros	Cons	Best For
Single	Minimum distance between any two points across clusters	Handles non-spherical shapes	Chaining effect — long, stringy clusters	Detecting elongated shapes
Complete	Maximum distance between any two points	Compact, tight clusters	Sensitive to outliers	Equal-size, compact clusters
Average	Average of all pairwise distances	Balanced, robust	Less intuitive	General-purpose use
Ward	Increase in total within-cluster variance	Minimizes variance — similar to K-Means criterion	Tends toward equal-size clusters	Most business applications ✅

📐 Distance Metrics: What Does "Close" Mean?

Before we can merge clusters, we need to define distance between points. The choice can change your results dramatically — especially with high-dimensional or text data.

📊 Visual Comparison of Distance Metrics

📐

Euclidean

Straight-line distance. √(Σ(aᵢ-bᵢ)²)
Use for: spatial, continuous numeric data

🏙️

Manhattan

Grid-path distance. Σ|aᵢ-bᵢ|
Use for: robust to outliers, city-block movement

📐

Cosine

Angle between vectors. 1 - (a·b)/(|a||b|)
Use for: text, high-dimensional data, NLP

🎯 Step 4: Choosing the Right Number of Clusters

With hierarchical clustering, you can choose k after building the tree. Two methods help you pick the optimal cut:

📏 Dendrogram Gap Method + Silhouette Score

Dendrogram — Find the Largest Gap

The red arrow points to the largest gap — that's where you should cut!

Silhouette Score by k

Higher silhouette = better-separated clusters. Pick the k with the highest score.

    Two Rules of Thumb:

    🌲 Dendrogram Gap: Find the longest vertical line with no horizontal crossing — cut in the middle of that gap.

    📊 Silhouette Score: Ranges from -1 to 1. Score near 1 = tight, well-separated clusters. Pick k that maximizes it.

💼 Business Case: Customer Hierarchy at LuxRetail

    Scenario: LuxRetail has 50,000 customers with purchase history, visit frequency, and average spend. The CMO wants a customer hierarchy — not just 3 segments, but a full tree showing how segments relate, so marketing can target at multiple levels of granularity.
  

🏪 Customer Clustering Hierarchy — Interactive Explore

📋 Discovered Customer Hierarchy (3-level)

    All Customers

    ├── 💎 Premium Tier (top 20%, avg spend $850/visit)

    │   ├── Luxury Loyalists (frequent, high spend, brand-conscious)

    │   └── Occasion Splurgers (infrequent but very high single-visit spend)

    ├── 📦 Standard Tier (middle 55%, avg spend $210/visit)

    │   ├── Regular Shoppers (steady frequency, moderate spend)

    │   └── Deal Hunters (medium frequency, discount-driven)

    └── 🏷️ Budget Tier (bottom 25%, avg spend $45/visit)

        ├── Occasional Browsers (low frequency, low spend)

        └── Churn Risk (declining engagement)

🎯 Marketing Actions at Each Level

Level	Segment	Strategy	Expected ROI
Broad (2 clusters)	Premium vs. Rest	VIP program vs. mass campaign	15% lift
Medium (3 clusters)	Premium / Standard / Budget	Tiered loyalty rewards	23% lift
Granular (5+ clusters)	All sub-segments	Personalized 1:1 messaging	31% lift

Key advantage: You ran hierarchical clustering once. Marketing can cut the dendrogram at whatever level makes sense for this month's campaign — no rerunning, no choosing k upfront.

⚔️ K-Means vs. Hierarchical: When to Use Which?

Dimension	🔵 K-Means	🌳 Hierarchical
Input required	Must specify k upfront	No k needed — choose after
Result	Flat partition (k clusters)	Full tree (all k at once)
Scalability	✅ Scales to millions of points	⚠️ Slow on large data (O(n²) or O(n³))
Interpretability	Cluster centroids easy to explain	Dendrogram shows relationships
Reproducibility	Random init → different results	Deterministic (same tree every run)
Cluster shape	Assumes spherical/convex	Flexible (depends on linkage)
Outlier handling	Sensitive to outliers	Can isolate outliers as singletons
Data size	10K–10M+ rows	Up to ~5K–10K rows practical
Best use case	Large-scale segmentation with known k	Exploratory analysis, unknown k, need hierarchy

🔵 Use K-Means when...

• Dataset is large (>10K rows)
• You know how many clusters you want
• Speed matters
• Clusters are roughly spherical

🌳 Use Hierarchical when...

• You want to explore k without rerunning
• You need cluster relationships (hierarchy)
• Dataset is moderate size (<10K rows)
• You want a reproducible, deterministic result

🧠 Knowledge Check — 5 Questions

Test your understanding of hierarchical clustering. Answer all 5 to complete the module.

Score: 0/5

Question 1 of 5

In agglomerative clustering, what happens at the very first step?

All points are merged into one cluster immediately

Each point starts as its own singleton cluster

You must specify k clusters to start with

The algorithm randomly assigns points to clusters

Question 2 of 5

What does the height of a bar in a dendrogram represent?

The number of points in the merged cluster

The order in which the merge happened

The distance between the two clusters when they merged

The silhouette score at that merge step

Question 3 of 5

Which linkage method is most prone to the "chaining effect" (long, stringy clusters)?

Single linkage

Complete linkage

Ward's method

Average linkage

Question 4 of 5

You have 500,000 customer records and need to segment them quickly into 5 groups. Which method is better and why?

Hierarchical — it gives a better dendrogram for large datasets

K-Means — it scales much better (O(nk)) vs hierarchical O(n²)

Hierarchical — because you don't need to specify k

Both are equally good at this scale

Question 5 of 5

When using the dendrogram gap method to choose k, what should you look for?

The point where the tree has exactly k leaves

The merge step with the smallest distance

The longest vertical line with no horizontal merges crossing it

The merge step where exactly half the points are combined

🗺️ Module 9 Navigation

📚 Class Material 🔵 K-Means Interactive 🏠 Course Home

🌳 Module 9B: Hierarchical Clustering — Structure, Not Just Groups