Module 9: Clustering — Find the Hidden Groups

🎯 The Challenge

RetailMax has 2 million customers but treats them all the same. Every customer gets the same emails, the same promotions, the same recommendations. Result: 12% open rate on emails, $3.2M wasted on irrelevant promotions per year, and customers churning because they feel misunderstood.

Your job: find the hidden groups. Within those 2 million customers are segments with distinct behaviors, values, and needs. Discover them, and you can personalize at scale.

Customers to segment

$3.2M

Annual waste from generic targeting

340%

ROI from personalized campaigns

📈 Your Progress

🔵 K-Means📐 Elbow Method🌳 Hierarchical🌊 DBSCAN💼 Business Case✅ Quiz

Complete sections to track your progress.

Supervised vs. Unsupervised Learning

Everything we've done so far was supervised: we had labeled examples (spam/not-spam, price, churn/no-churn) and learned to predict them.

Clustering is different. You have data with no labels. You're exploring: are there natural groups? How many? What defines each group?

Key insight: Clustering is about discovering structure, not predicting outcomes. Two customers who buy premium brands + live in urban areas + are 25–35 years old are "similar" even if we never labeled them as such.

Three Main Approaches

🎯 K-Means

Assign each point to the nearest centroid. Iterate until stable. Fast, scalable.

Best for: known # clusters, spherical shapes

🌳 Hierarchical

Merge closest points iteratively. Produces a tree (dendrogram).

Best for: unknown # clusters, nested groups

🌊 DBSCAN

Find dense regions, label sparse points as noise.

Best for: irregular shapes, outlier detection

How K-Means Works

Choose k: Decide how many clusters you want
Initialize centroids: Place k random points
Assign points: Each point joins its nearest centroid
Update centroids: Move each centroid to the mean of its cluster
Repeat: Until centroids stop moving (convergence)

The math: K-Means minimizes within-cluster sum of squares (WCSS):
WCSS = Σ Σ ||xᵢ - μₖ||²
where μₖ is the centroid of cluster k.

🎮 Interactive K-Means Visualization

Click on the canvas to place data points. Then click "Run K-Means" to watch the algorithm cluster them in real time.

Number of Clusters (k): 3

Click canvas to add points, then Run K-Means.

K-Means Strengths & Limitations

Aspect	Detail	Implication
✅ Speed	O(n·k·iterations)	Scales to millions of customers
✅ Simplicity	Easy to understand & explain	Stakeholders can grasp the segments
⚠️ Needs k upfront	Must specify number of clusters	Use elbow method or business knowledge
⚠️ Spherical clusters	Assumes similar-sized round clusters	Fails on elongated or irregular shapes
⚠️ Sensitive to init	Random start → different results	Use k-means++ or multiple restarts
❌ No noise handling	Every point must join a cluster	Outliers distort centroids

The hardest part of K-Means is choosing k. Too few clusters → segments are too broad. Too many → segments are tiny and unmeaningful.

The elbow method plots WCSS (inertia) against k. As k increases, WCSS always decreases. But there's a point of diminishing returns — the "elbow" — where adding more clusters stops being worth it.

📐 Interactive Elbow Method

Adjust the slider to see how WCSS and silhouette score change with different values of k on the RetailMax dataset.

Number of Clusters k: 3

Current k = 3: Select a k value to see interpretation.

Silhouette Score: A Better Metric

The silhouette score measures how similar each point is to its own cluster vs. other clusters:

s(i) = (b(i) - a(i)) / max(a(i), b(i))

a(i) = average distance to all points in the same cluster
b(i) = average distance to all points in the nearest other cluster
Range: -1 (wrong cluster) to +1 (perfect cluster)
Target: > 0.5 is good, > 0.7 is excellent

❓ Quick Check 1

RetailMax runs K-Means with k=2 (WCSS=1200) and k=5 (WCSS=380). Should they always choose k=5?

Yes — lower WCSS always means better clustering

No — more clusters doesn't mean more useful segments

Yes — k=5 captures more customer nuance

No — WCSS can't be compared across different k values

Instead of picking k upfront, hierarchical clustering builds a complete tree of merges. You can cut the tree at any height to get any number of clusters.

Agglomerative (Bottom-Up) Algorithm

Start: each point is its own cluster (n clusters)
Find the two closest clusters
Merge them into one
Repeat until one cluster remains
Cut the dendrogram at the desired height

Linkage methods define "closest":
• Single: minimum distance between any two points
• Complete: maximum distance between any two points
• Average: average of all pairwise distances
• Ward: minimizes increase in total WCSS (best for most use cases)

🌳 Interactive Dendrogram

Click on the dendrogram to cut at different heights and see how many clusters result. The colored bands show different cluster groupings.

Click on the dendrogram to cut at a height and see cluster count.

❓ Quick Check 2

A dendrogram shows a very long vertical line before the final merge at the top. What does this suggest?

The data has many outliers

There are likely 2 natural clusters — the two groups are very different

K-Means would be a better algorithm here

The linkage method is wrong

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) doesn't look for compact spheres. Instead, it finds dense regions and labels sparse points as noise.

Two Parameters

ε (epsilon): radius defining "neighborhood"
minPts: minimum points within radius to be a "core point"

Three Types of Points

🔵 Core Point

Has ≥ minPts neighbors within ε. Center of a dense region.

🟡 Border Point

Fewer than minPts neighbors, but reachable from a core point.

🔴 Noise Point

Not core, not reachable from any core. Outlier.

🌊 Interactive DBSCAN

Adjust epsilon to see how the density threshold changes cluster formation. Watch how noise points (red ✕) appear when epsilon is too small.

ε (Epsilon): 30

MinPts: 3

Clusters found: —

Noise points: —

When to Use DBSCAN

DBSCAN shines when:

Clusters have irregular shapes (geographic regions, behavioral patterns)
You don't know the number of clusters
There are genuine outliers in your data

DBSCAN struggles when:

Clusters have very different densities
High-dimensional data (distance becomes meaningless)
Choosing ε is critical and tricky

❓ Quick Check 3

Which algorithm is BEST for finding fraudulent transactions (rare, unusual patterns) in a customer dataset?

K-Means with k=10

Hierarchical with Ward linkage

DBSCAN — fraud cases will be labeled as noise/outliers

Any algorithm works equally well here

The Data

RetailMax has 3 key features per customer: Purchase Frequency (orders/year), Average Order Value ($), and Recency (days since last purchase).

🎯 RetailMax Segmentation Demo

Click "Segment Customers" to run K-Means on the RetailMax data and see the resulting business segments.

Number of Segments: 4

From Segments to Strategy

Segment	Profile	Size	Strategy	Expected Lift
💎 Champions	High frequency, high value, recent	8%	VIP program, early access, referral rewards	+25% AOV
🌱 Potential Loyalists	Mid frequency, growing value	22%	Loyalty tier, personalized recommendations	+40% retention
💤 At Risk	Was frequent, now dormant	31%	Win-back campaign, "We miss you" discounts	Recover 15%
👋 New Customers	Low frequency, recent	39%	Onboarding series, first-purchase discounts	+60% 2nd purchase

❓ Quick Check 4

After running K-Means, a cluster has 80% of all customers in a single group and 2 tiny groups. What went wrong?

The data wasn't normalized — features with large ranges dominate

k is too small — you need more clusters

Bad centroid initialization

All of the above could cause this — always normalize data, validate k, and use k-means++

Criterion	K-Means	Hierarchical	DBSCAN
Need to specify k?	✅ Yes	❌ No (cut later)	❌ No (automatic)
Scalability	✅ Excellent (millions)	⚠️ O(n²) — medium datasets	✅ Good
Cluster shapes	Spherical only	Any (via dendrogram)	Any shape
Handles noise?	❌ No	❌ No	✅ Yes (explicitly)
Interpretability	High (centroids)	High (dendrogram)	Medium
Best use case	Customer segmentation, profiling	Taxonomy, hierarchy understanding	Anomaly detection, geo clusters

❓ Quick Check 5 (Final)

RetailMax wants to identify customer segments to target with email campaigns. The marketing team insists on exactly 4 segments (one per team member). Which algorithm is MOST appropriate?

K-Means with k=4 — fast, interpretable, meets the constraint

DBSCAN — it finds the natural number of clusters

Hierarchical with single linkage

No algorithm — you need labels for email marketing

📝 Module 9 Summary

Clustering is unsupervised — you find structure without labels
K-Means is fast and scalable; use the elbow method to choose k; always normalize features
Hierarchical clustering builds a dendrogram — flexible, interpretable, but slower
DBSCAN finds arbitrary shapes and labels outliers as noise
Business value comes from acting on segments: personalized campaigns, tiered service, targeted offers

Next: K-Means Deep Dive → Hierarchical Deep Dive →

📊 Module 9: Clustering — Find the Hidden Groups