Chapter 4: NumPy

Arrays and Numerical Computing

4.1 Why NumPy?

NumPy (Numerical Python) is the foundation of the Python data science stack. It provides the ndarray, a fast, memory-efficient multidimensional array that supports vectorized operations. Pandas, scikit-learn, and matplotlib all build on NumPy arrays internally.

import numpy as np

4.2 Creating Arrays

# From a list
sales = np.array([120, 135, 150, 142, 160])

# Common constructors
zeros  = np.zeros((3, 4))           # 3x4 matrix of zeros
ones   = np.ones(5)                  # [1. 1. 1. 1. 1.]
rng    = np.arange(0, 100, 10)      # [0, 10, 20, ..., 90]
grid   = np.linspace(0, 1, 5)       # [0. , 0.25, 0.5, 0.75, 1.]
ident  = np.eye(3)                   # 3x3 identity matrix

print(sales.shape)   # (5,)
print(zeros.shape)   # (3, 4)
print(sales.dtype)   # int64

4.3 Indexing and Slicing

data = np.array([[10, 20, 30],
                 [40, 50, 60],
                 [70, 80, 90]])

print(data[0, 2])     # 30  (row 0, col 2)
print(data[1, :])     # [40, 50, 60]  (entire row 1)
print(data[:, 0])     # [10, 40, 70]  (entire col 0)
print(data[:2, 1:])  # [[20, 30], [50, 60]]  (submatrix)

# Boolean indexing
print(data[data > 50])  # [60, 70, 80, 90]

Boolean Indexing in Depth

Boolean indexing is one of NumPy's most powerful features. You create a boolean array (True/False mask) using a comparison, then pass that mask as an index. This is analogous to a WHERE clause in SQL. You can combine multiple conditions using & (and), | (or), and ~ (not), but you must wrap each condition in parentheses.

sales = np.array([120, 85, 200, 45, 310, 95, 150])

# Single condition
high_sales = sales[sales > 100]
print(high_sales)  # [120 200 310 150]

# Multiple conditions — parentheses are required
mid_range = sales[(sales >= 80) & (sales <= 200)]
print(mid_range)   # [120  85 200  95 150]

# Negate a condition
not_low = sales[~(sales < 100)]
print(not_low)     # [120 200 310 150]

# Count how many elements satisfy a condition
print(f"Days above target: {(sales > 100).sum()}")  # 4

Fancy Indexing

Fancy indexing uses an array of integers to select specific rows or columns by position. Unlike boolean indexing which selects by condition, fancy indexing selects by explicit position numbers.

products = np.array(["Widget", "Gadget", "Bracket", "Bolt", "Nut"])
revenues = np.array([15000, 4600, 5400, 18500, 2200])

# Select the 1st, 3rd, and 4th products
idx = np.array([0, 2, 3])
print(products[idx])  # ['Widget' 'Bracket' 'Bolt']
print(revenues[idx])  # [15000  5400 18500]

# Combine with np.argsort to get top-N items
top3_idx = np.argsort(revenues)[-3:][::-1]  # indices of 3 largest
print(products[top3_idx])  # ['Bolt' 'Widget' 'Bracket']

4.4 np.where: Conditional Selection

np.where(condition, x, y) returns x where the condition is True and y where it is False. Think of it as a vectorized if-else statement. This is dramatically faster than looping through each element.

demand = np.array([120, 85, 200, 45, 310])

# Classify each day
labels = np.where(demand > 100, "High", "Low")
print(labels)  # ['High' 'Low' 'High' 'Low' 'High']

# Cap demand at 200 (winsorize)
capped = np.where(demand > 200, 200, demand)
print(capped)  # [120  85 200  45 200]

# Find indices where condition is True
high_idx = np.where(demand > 100)[0]
print(high_idx)  # [0 2 4]

4.5 Vectorized Operations

NumPy operations apply element-wise without explicit loops. This is orders of magnitude faster than Python for-loops on large arrays.

units  = np.array([500, 320, 750, 410])
prices = np.array([29.99, 14.50, 7.25, 45.00])

# Element-wise operations
revenue = units * prices
print(revenue)  # [14995. 4640. 5437.5 18450.]

# Aggregations
print(f"Total revenue: ${revenue.sum():,.2f}")
print(f"Average:       ${revenue.mean():,.2f}")
print(f"Std dev:       ${revenue.std():,.2f}")

4.5 Broadcasting

Broadcasting lets NumPy perform operations on arrays of different shapes. A scalar gets "stretched" to match the array dimension.

# Apply a 10% price increase to all products
new_prices = prices * 1.10
print(new_prices)  # [32.989, 15.95, 7.975, 49.5]

# 2D example: each row is a store, columns are months
monthly = np.array([[100, 120, 110],
                    [200, 190, 210]])
weights = np.array([0.3, 0.3, 0.4])  # month weights
weighted_avg = (monthly * weights).sum(axis=1)
print(weighted_avg)  # [110. 201.]

4.6 Random Number Generation

rng = np.random.default_rng(seed=42)

# Simulate daily demand (normal distribution)
demand = rng.normal(loc=100, scale=15, size=365)
print(f"Simulated year: mean={demand.mean():.1f}, std={demand.std():.1f}")

# Uniform, integers, choice
lead_times = rng.uniform(low=3, high=10, size=50)
dice_rolls = rng.integers(low=1, high=7, size=100)
sample     = rng.choice(["A", "B", "C"], size=10, p=[0.5, 0.3, 0.2])

4.7 Concatenating and Stacking Arrays

When combining data from multiple sources, you often need to join arrays together. np.concatenate joins along an existing axis, while np.vstack and np.hstack provide convenient vertical and horizontal stacking.

# 1D concatenation
q1 = np.array([100, 120, 110])
q2 = np.array([130, 125, 140])
full_year = np.concatenate([q1, q2])
print(full_year)  # [100 120 110 130 125 140]

# 2D: vertical stack (add rows)
east = np.array([[10, 20], [30, 40]])
west = np.array([[50, 60]])
combined = np.vstack([east, west])
print(combined)
# [[10 20]
#  [30 40]
#  [50 60]]

# 2D: horizontal stack (add columns)
prices = np.array([[29.99], [14.50], [7.25]])
quantities = np.array([[500], [320], [750]])
product_data = np.hstack([prices, quantities])
print(product_data)
# [[ 29.99 500.  ]
#  [ 14.50 320.  ]
#  [  7.25 750.  ]]
concatenate vs vstack/hstack: np.vstack is shorthand for np.concatenate(..., axis=0) and np.hstack for axis=1. Use whichever reads more clearly in context. For 1D arrays, np.column_stack is handy because it treats each 1D array as a column.

4.8 Linear Algebra with np.linalg

NumPy's linalg module provides all the standard linear algebra operations. These are essential for optimization, regression, and simulation models.

A = np.array([[2, 1], [5, 3]])
b = np.array([8, 19])

# Solve Ax = b
x = np.linalg.solve(A, b)
print(f"Solution: x = {x}")  # [1. 6.]

# Determinant and inverse
print(f"Determinant: {np.linalg.det(A):.0f}")  # 1
print(f"Inverse:\n{np.linalg.inv(A)}")

# Matrix multiplication
C = A @ A  # or np.dot(A, A)
print(C)

# Eigenvalues and eigenvectors
cov_matrix = np.array([[2.0, 0.8],
                        [0.8, 1.5]])
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print(f"Eigenvalues: {eigenvalues}")
print(f"Eigenvectors:\n{eigenvectors}")

# Matrix norm (useful for measuring "size" of errors)
error = np.array([0.5, -0.3, 0.8])
print(f"L2 norm: {np.linalg.norm(error):.4f}")
Why linalg matters for analytics: Linear regression is fundamentally a linear algebra operation: the OLS estimator is (X'X)^(-1) X'y. Principal Component Analysis uses eigendecomposition. Understanding these operations helps you debug model fitting issues and work with optimization libraries like scipy.optimize.

4.9 Performance: Loops vs. Vectorized Operations

The single most important concept in NumPy is avoiding Python-level loops over array elements. NumPy's internal loops are written in C, so they run orders of magnitude faster. The following example demonstrates the difference.

import time

n = 1_000_000
a = np.random.default_rng(42).normal(size=n)
b = np.random.default_rng(99).normal(size=n)

# Slow: Python for-loop
start = time.time()
result_loop = [a[i] * b[i] for i in range(n)]
loop_time = time.time() - start

# Fast: vectorized NumPy
start = time.time()
result_vec = a * b
vec_time = time.time() - start

print(f"Loop:       {loop_time:.4f}s")
print(f"Vectorized: {vec_time:.6f}s")
print(f"Speedup:    {loop_time / vec_time:.0f}x")
Typical output: Loop: 0.2500s
Vectorized: 0.001200s
Speedup: 208x
Performance tip: Avoid looping over array elements in Python. Use vectorized NumPy operations instead. A vectorized operation on a million-element array can be 100-200x faster than a for-loop. In Jupyter, use %timeit for reliable benchmarks.

4.10 Structured Arrays

Structured arrays let you store heterogeneous data (like a database record) in a NumPy array, where each column has a name and type. They are a lightweight alternative to pandas when you need typed tabular data without the overhead of a DataFrame.

# Define a custom data type
dt = np.dtype([
    ("name", "U10"),       # Unicode string, max 10 chars
    ("price", "f8"),        # 64-bit float
    ("quantity", "i4")      # 32-bit integer
])

products = np.array([
    ("Widget",  29.99, 500),
    ("Gadget",  14.50, 320),
    ("Bracket", 7.25,  750)
], dtype=dt)

# Access by field name (like a DataFrame column)
print(products["name"])      # ['Widget' 'Gadget' 'Bracket']
print(products["price"])     # [29.99 14.50  7.25]

# Vectorized computation on a field
revenue = products["price"] * products["quantity"]
print(revenue)  # [14995. 4640. 5437.5]

4.11 Saving and Loading Arrays

NumPy provides efficient binary formats for persisting arrays to disk. Use .npy for single arrays and .npz for multiple arrays. These formats preserve data types and are much faster to load than CSV files.

# Save a single array
data = np.array([1.5, 2.3, 4.1, 3.7])
np.save("demand.npy", data)
loaded = np.load("demand.npy")
print(loaded)  # [1.5 2.3 4.1 3.7]

# Save multiple arrays in a single compressed file
prices = np.array([29.99, 14.50, 7.25])
quantities = np.array([500, 320, 750])
np.savez_compressed("products.npz", prices=prices, quantities=quantities)

# Load the .npz file
archive = np.load("products.npz")
print(archive["prices"])      # [29.99 14.5   7.25]
print(archive["quantities"])  # [500 320 750]

# For human-readable format (slower, larger files)
np.savetxt("data.csv", data, delimiter=",")
loaded_csv = np.loadtxt("data.csv", delimiter=",")
When to use .npy vs CSV: Use .npy/.npz for intermediate results in a pipeline where you need fast I/O and exact type preservation. Use CSV when you need to share data with non-Python tools (Excel, R, Stata) or want human-readable files for version control.

Exercise 4.1

Simulate 10,000 days of demand from a normal distribution (mean=500, std=80). Compute the probability that daily demand exceeds 600 units. Then compute the 95th percentile of demand using np.percentile().

Exercise 4.2

Create a 5x3 matrix where each row represents a product and columns are [cost, revenue, units]. Compute a profit column (revenue - cost) using vectorized operations, then find which product has the highest profit.

Exercise 4.3

Create two arrays: actual = np.array([100, 120, 95, 110, 130]) and forecast = np.array([105, 115, 100, 108, 125]). Compute the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) using vectorized operations. Then use np.where() to classify each period as "Over" (forecast > actual) or "Under".

Exercise 4.4

Write a timing experiment comparing a Python for-loop vs. a vectorized NumPy operation for computing the dot product of two 500,000-element arrays. Use time.time() or %timeit to measure both. Report the speedup factor.

Official Resources

Chapter 4 Takeaways

← Chapter 3: Functions Chapter 5: Pandas →