Module 2 Lab: Linear Regression

1

Task 1: Load and Explore Housing Data

You're working with a dataset of 506 Boston-area housing observations. Each row is a neighborhood. Your target variable is MEDV (median home value in $1,000s). Let's understand the data before modeling.

# task1_explore.py

import pandas as pd
import numpy as np

# Load housing dataset (506 observations, 13 features)
df = pd.read_csv('housing.csv')

# Basic exploration
print("=== Dataset Shape ===")
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

print("\n=== First 5 Rows ===")
print(df.head())

print("\n=== Summary Statistics ===")
print(df.describe().round(2))

print("\n=== Missing Values ===")
print(df.isnull().sum())

print("\n=== Feature Correlations with MEDV ===")
corr = df.corr()['MEDV'].sort_values(ascending=False)
print(corr.round(3))

Hint: Look at the correlation values carefully. Features with high absolute correlation to MEDV are good candidates for predictors. Notice that LSTAT (% lower status population) has the strongest negative correlation, and RM (avg rooms per dwelling) has the strongest positive correlation. These will be your first two predictors.

📊 Data Explorer (Simulated)

The table below shows a sample of the housing dataset. Key variables:

#	RM (Rooms)	LSTAT (%Lower)	AGE (Old%)	PTRATIO (Pupil:Teacher)	CRIM (Crime Rate)	MEDV ($1000s)

Showing 10 of 506 observations. MEDV = median home value.

Task 1 complete — I understand the dataset structure and feature correlations

2

Task 2: Simple Linear Regression

Use a single predictor (RM — average number of rooms) to predict MEDV. Implement OLS from scratch, then compare to sklearn's output.

# task2_simple_lr.py

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# Feature and target
X = housing_data['RM'].values   # avg rooms per dwelling
y = housing_data['MEDV'].values  # median value ($1000s)

# === OLS from scratch ===
x_mean, y_mean = X.mean(), y.mean()
beta1 = np.sum((X - x_mean) * (y - y_mean)) / np.sum((X - x_mean)**2)
beta0 = y_mean - beta1 * x_mean

print("=== OLS from Scratch ===")
print(f"  β₀ (intercept): {beta0:.4f}")
print(f"  β₁ (slope):     {beta1:.4f}")
print(f"  Interpretation: Each additional room adds ${beta1*1000:.0f} in home value")

# Predictions & metrics
y_pred = beta0 + beta1 * X
ss_res = np.sum((y - y_pred)**2)
ss_tot = np.sum((y - y_mean)**2)
r2 = 1 - ss_res / ss_tot
rmse = np.sqrt(np.mean((y - y_pred)**2))
mae = np.mean(np.abs(y - y_pred))

print(f"\n  R²:   {r2:.4f}  ({r2*100:.1f}% of variance explained)")
print(f"  RMSE: {rmse:.4f} ($1000s)")
print(f"  MAE:  {mae:.4f} ($1000s)")

# === sklearn comparison ===
model = LinearRegression()
model.fit(X.reshape(-1,1), y)
print(f"\n=== sklearn Verification ===")
print(f"  sklearn β₀: {model.intercept_:.4f}  (matches: {abs(model.intercept_ - beta0) < 0.001})")
print(f"  sklearn β₁: {model.coef_[0]:.4f}    (matches: {abs(model.coef_[0] - beta1) < 0.001})")

# Predict for a 6-room house
pred_6rooms = beta0 + beta1 * 6
print(f"\n  Prediction for 6-room house: ${pred_6rooms:.1f}K")

Hint: The OLS formulas are β₁ = Σ(xᵢ−x̄)(yᵢ−ȳ) / Σ(xᵢ−x̄)² and β₀ = ȳ − β₁x̄. With RM as predictor, expect R² around 0.48 — RM explains about half the variance in prices. That's actually useful but leaves room for improvement with multiple predictors.

📉 Scatter Plot: RM vs. MEDV

Task 2 complete — I can implement and interpret simple linear regression

3

Task 3: Multiple Regression + Multicollinearity Check

Add more features to improve prediction. But beware: when predictors are correlated with each other (multicollinearity), coefficient estimates become unstable.

# task3_multiple_lr.py

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler

# Features for multiple regression
features = ['RM', 'LSTAT', 'PTRATIO', 'AGE', 'CRIM']
X = housing_data[features].values
y = housing_data['MEDV'].values

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit model
model = LinearRegression()
model.fit(X_train, y_train)

print("=== Multiple Regression Coefficients ===")
for feat, coef in zip(features, model.coef_):
    print(f"  {feat:<10} β = {coef:+.4f}")
print(f"  Intercept    β₀ = {model.intercept_:.4f}")

# Evaluate
y_pred_train = model.predict(X_train)
y_pred_test  = model.predict(X_test)
r2_train = r2_score(y_train, y_pred_train)
r2_test  = r2_score(y_test, y_pred_test)
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
rmse_test  = np.sqrt(mean_squared_error(y_test, y_pred_test))

print(f"\n=== Model Performance ===")
print(f"  Train R²: {r2_train:.4f}  |  Test R²: {r2_test:.4f}")
print(f"  Train RMSE: {rmse_train:.3f}  |  Test RMSE: {rmse_test:.3f}")
print(f"  Improvement from simple LR: +{(r2_test - 0.484)*100:.1f}% R²")

# Variance Inflation Factor (multicollinearity)
print(f"\n=== VIF Check (Multicollinearity) ===")
from numpy.linalg import inv
Xs = StandardScaler().fit_transform(X_train)
R = np.corrcoef(Xs.T)
try:
    vif = np.diag(inv(R))
    for feat, v in zip(features, vif):
        flag = "⚠️ HIGH" if v > 5 else "✓ OK"
        print(f"  {feat:<10} VIF = {v:.2f}  {flag}")
except:
    print("  (VIF computation skipped)")

Hint: Adding LSTAT and PTRATIO should push R² from ~0.48 to ~0.72. VIF > 5 signals multicollinearity. In this dataset, AGE and LSTAT can be somewhat correlated (older neighborhoods often have more lower-income residents). Solutions: drop correlated features, use PCA, or apply Ridge regression (Task 4).

🎛️ Interactive: Add/Remove Features

Select features to include in the multiple regression model:

Task 3 complete — I understand how adding features improves fit and can detect multicollinearity

4

Task 4: Regularization Preview (Ridge vs. Lasso)

Regularization adds a penalty term to prevent overfitting and handle multicollinearity. Ridge (L2) shrinks all coefficients. Lasso (L1) can set some to exactly zero, performing feature selection.

# task4_regularization.py

import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

features = ['RM', 'LSTAT', 'PTRATIO', 'AGE', 'CRIM', 'TAX', 'INDUS']
X = housing_data[features].values
y = housing_data['MEDV'].values

# Scale features (required for regularization)
scaler = StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

alpha = 1.0  # regularization strength

# Fit all three models
ols   = LinearRegression().fit(X_train_s, y_train)
ridge = Ridge(alpha=alpha).fit(X_train_s, y_train)
lasso = Lasso(alpha=alpha, max_iter=5000).fit(X_train_s, y_train)

models = {'OLS': ols, 'Ridge': ridge, 'Lasso': lasso}

print(f"=== Coefficient Comparison (α={alpha}) ===")
print(f"{'Feature':<12}", end='')
for name in models: print(f"  {name:>8}", end='')
print()
for i, feat in enumerate(features):
    print(f"{feat:<12}", end='')
    for model in models.values():
        c = model.coef_[i]
        mark = " *" if abs(c) < 0.001 else ""
        print(f"  {c:>+8.3f}{mark}", end='')
    print()

print(f"\n{'Model':<12} {'TrainR²':>8} {'TestR²':>8}")
for name, model in models.items():
    r2_tr = r2_score(y_train, model.predict(X_train_s))
    r2_te = r2_score(y_test,  model.predict(X_test_s))
    print(f"{name:<12} {r2_tr:>8.4f} {r2_te:>8.4f}")

# Lasso zeroed-out features
zeroed = [f for f,c in zip(features, lasso.coef_) if abs(c) < 0.001]
print(f"\n  Lasso zeroed out: {zeroed if zeroed else 'none at α={}'.format(alpha)}")

Hint: The penalty term modifies the loss function: Ridge minimizes Σ(yᵢ−ŷᵢ)² + α·Σβⱼ² while Lasso minimizes Σ(yᵢ−ŷᵢ)² + α·Σ|βⱼ|. The L1 norm in Lasso creates sparsity. Try increasing α to see more shrinkage. At α=10+, Lasso will zero out several features.

🎛️ Interactive: Regularization Strength Demo

Regularization Strength (α): 1.0

Task 4 complete — I understand how Ridge and Lasso penalize coefficients differently

5

Task 5: Model Evaluation — Residual Analysis

A good regression model has residuals that look like random noise. If you see patterns in residual plots, your model is misspecified (missing features, wrong functional form, or heteroscedasticity).

# task5_evaluation.py

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

features = ['RM', 'LSTAT', 'PTRATIO']
X = housing_data[features].values
y = housing_data['MEDV'].values

model = LinearRegression().fit(X, y)
y_pred = model.predict(X)
residuals = y - y_pred

print("=== Residual Summary ===")
print(f"  Mean:    {residuals.mean():+.4f}  (should be ≈ 0)")
print(f"  Std Dev: {residuals.std():.4f}")
print(f"  Min:     {residuals.min():.4f}")
print(f"  Max:     {residuals.max():.4f}")

print("\n=== Model Metrics ===")
r2   = r2_score(y, y_pred)
rmse = np.sqrt(mean_squared_error(y, y_pred))
mae  = mean_absolute_error(y, y_pred)
print(f"  R²:   {r2:.4f}  → model explains {r2*100:.1f}% of variance")
print(f"  RMSE: {rmse:.3f} ($1000s) → typical error is ${rmse*1000:.0f}")
print(f"  MAE:  {mae:.3f} ($1000s) → median error is ${mae*1000:.0f}")

print("\n=== 5-Fold Cross-Validation ===")
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_r2 = cross_val_score(model, X, y, cv=kf, scoring='r2')
cv_rmse = np.sqrt(-cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error'))
print(f"  CV R² scores:  {cv_r2.round(3)}")
print(f"  CV R² mean:    {cv_r2.mean():.4f} ± {cv_r2.std():.4f}")
print(f"  CV RMSE mean:  {cv_rmse.mean():.3f} ± {cv_rmse.std():.3f}")

# Normality of residuals (Shapiro-Wilk approximation)
skewness = np.mean(((residuals - residuals.mean()) / residuals.std())**3)
kurtosis = np.mean(((residuals - residuals.mean()) / residuals.std())**4) - 3
print(f"\n=== Residual Distribution ===")
print(f"  Skewness: {skewness:.3f}  (|<0.5| = roughly normal)")
print(f"  Kurtosis: {kurtosis:.3f}  (|<2| = roughly normal)")
flag = "✓ Roughly normal" if abs(skewness) < 1 else "⚠️ Non-normal residuals detected"
print(f"  Verdict: {flag}")

Hint: Look for three things in residual plots: (1) No pattern in Residuals vs. Fitted (random cloud = good), (2) Roughly linear Q-Q plot (normality assumption holds), (3) No funnel shape (homoscedasticity). If you see a curved pattern in residuals vs. fitted, you may need polynomial features.

📊 Residual Diagnostic Plots

Residuals vs. Fitted

Q-Q Plot

Residual Histogram

0.731

R² (3 features)

4.47

RMSE ($K)

3.20

MAE ($K)

0.714

CV R² (5-fold)

Task 5 complete — I can produce and interpret residual diagnostic plots

🎉 Lab Complete!

You've implemented OLS from scratch, built multiple regression models, explored regularization, and diagnosed model fit using residuals.

Final model R² on 3 features: 0.731 — explaining 73% of house price variance.

← Review Class Material Next Module →

🔬 Module 2 Lab: Linear Regression on Housing Data

Task 1: Load and Explore Housing Data

📊 Data Explorer (Simulated)

Task 2: Simple Linear Regression

📉 Scatter Plot: RM vs. MEDV

Task 3: Multiple Regression + Multicollinearity Check

🎛️ Interactive: Add/Remove Features

Task 4: Regularization Preview (Ridge vs. Lasso)

🎛️ Interactive: Regularization Strength Demo

Task 5: Model Evaluation — Residual Analysis

📊 Residual Diagnostic Plots

🎉 Lab Complete!