🔬 Module 2 Lab: Linear Regression on Housing Data

Apply OLS, multiple regression, regularization, and evaluation to real-world housing price data.

1 Explore
2 Simple LR
3 Multiple LR
4 Regularization
5 Evaluate
Lab Instructions: Work through all 5 tasks in order. Click Run on each code block to see output. Use the Hint button if you get stuck. Check the box when you complete each task.
1

Task 1: Load and Explore Housing Data

You're working with a dataset of 506 Boston-area housing observations. Each row is a neighborhood. Your target variable is MEDV (median home value in $1,000s). Let's understand the data before modeling.

# task1_explore.py
Hint: Look at the correlation values carefully. Features with high absolute correlation to MEDV are good candidates for predictors. Notice that LSTAT (% lower status population) has the strongest negative correlation, and RM (avg rooms per dwelling) has the strongest positive correlation. These will be your first two predictors.

📊 Data Explorer (Simulated)

The table below shows a sample of the housing dataset. Key variables:

#RM (Rooms)LSTAT (%Lower)AGE (Old%)PTRATIO (Pupil:Teacher)CRIM (Crime Rate)MEDV ($1000s)

Showing 10 of 506 observations. MEDV = median home value.

2

Task 2: Simple Linear Regression

Use a single predictor (RM — average number of rooms) to predict MEDV. Implement OLS from scratch, then compare to sklearn's output.

# task2_simple_lr.py
Hint: The OLS formulas are β₁ = Σ(xᵢ−x̄)(yᵢ−ȳ) / Σ(xᵢ−x̄)² and β₀ = ȳ − β₁x̄. With RM as predictor, expect R² around 0.48 — RM explains about half the variance in prices. That's actually useful but leaves room for improvement with multiple predictors.

📉 Scatter Plot: RM vs. MEDV

3

Task 3: Multiple Regression + Multicollinearity Check

Add more features to improve prediction. But beware: when predictors are correlated with each other (multicollinearity), coefficient estimates become unstable.

# task3_multiple_lr.py
Hint: Adding LSTAT and PTRATIO should push R² from ~0.48 to ~0.72. VIF > 5 signals multicollinearity. In this dataset, AGE and LSTAT can be somewhat correlated (older neighborhoods often have more lower-income residents). Solutions: drop correlated features, use PCA, or apply Ridge regression (Task 4).

🎛️ Interactive: Add/Remove Features

Select features to include in the multiple regression model:
4

Task 4: Regularization Preview (Ridge vs. Lasso)

Regularization adds a penalty term to prevent overfitting and handle multicollinearity. Ridge (L2) shrinks all coefficients. Lasso (L1) can set some to exactly zero, performing feature selection.

# task4_regularization.py
Hint: The penalty term modifies the loss function: Ridge minimizes Σ(yᵢ−ŷᵢ)² + α·Σβⱼ² while Lasso minimizes Σ(yᵢ−ŷᵢ)² + α·Σ|βⱼ|. The L1 norm in Lasso creates sparsity. Try increasing α to see more shrinkage. At α=10+, Lasso will zero out several features.

🎛️ Interactive: Regularization Strength Demo

5

Task 5: Model Evaluation — Residual Analysis

A good regression model has residuals that look like random noise. If you see patterns in residual plots, your model is misspecified (missing features, wrong functional form, or heteroscedasticity).

# task5_evaluation.py
Hint: Look for three things in residual plots: (1) No pattern in Residuals vs. Fitted (random cloud = good), (2) Roughly linear Q-Q plot (normality assumption holds), (3) No funnel shape (homoscedasticity). If you see a curved pattern in residuals vs. fitted, you may need polynomial features.

📊 Residual Diagnostic Plots

Residuals vs. Fitted
Q-Q Plot
Residual Histogram
0.731
R² (3 features)
4.47
RMSE ($K)
3.20
MAE ($K)
0.714
CV R² (5-fold)