Arrays and Numerical Computing
NumPy (Numerical Python) is the foundation of the Python data science stack. It provides the ndarray, a fast, memory-efficient multidimensional array that supports vectorized operations. Pandas, scikit-learn, and matplotlib all build on NumPy arrays internally.
import numpy as np
# From a list sales = np.array([120, 135, 150, 142, 160]) # Common constructors zeros = np.zeros((3, 4)) # 3x4 matrix of zeros ones = np.ones(5) # [1. 1. 1. 1. 1.] rng = np.arange(0, 100, 10) # [0, 10, 20, ..., 90] grid = np.linspace(0, 1, 5) # [0. , 0.25, 0.5, 0.75, 1.] ident = np.eye(3) # 3x3 identity matrix print(sales.shape) # (5,) print(zeros.shape) # (3, 4) print(sales.dtype) # int64
data = np.array([[10, 20, 30], [40, 50, 60], [70, 80, 90]]) print(data[0, 2]) # 30 (row 0, col 2) print(data[1, :]) # [40, 50, 60] (entire row 1) print(data[:, 0]) # [10, 40, 70] (entire col 0) print(data[:2, 1:]) # [[20, 30], [50, 60]] (submatrix) # Boolean indexing print(data[data > 50]) # [60, 70, 80, 90]
Boolean indexing is one of NumPy's most powerful features. You create a boolean array (True/False mask) using a comparison, then pass that mask as an index. This is analogous to a WHERE clause in SQL. You can combine multiple conditions using & (and), | (or), and ~ (not), but you must wrap each condition in parentheses.
sales = np.array([120, 85, 200, 45, 310, 95, 150]) # Single condition high_sales = sales[sales > 100] print(high_sales) # [120 200 310 150] # Multiple conditions — parentheses are required mid_range = sales[(sales >= 80) & (sales <= 200)] print(mid_range) # [120 85 200 95 150] # Negate a condition not_low = sales[~(sales < 100)] print(not_low) # [120 200 310 150] # Count how many elements satisfy a condition print(f"Days above target: {(sales > 100).sum()}") # 4
Fancy indexing uses an array of integers to select specific rows or columns by position. Unlike boolean indexing which selects by condition, fancy indexing selects by explicit position numbers.
products = np.array(["Widget", "Gadget", "Bracket", "Bolt", "Nut"]) revenues = np.array([15000, 4600, 5400, 18500, 2200]) # Select the 1st, 3rd, and 4th products idx = np.array([0, 2, 3]) print(products[idx]) # ['Widget' 'Bracket' 'Bolt'] print(revenues[idx]) # [15000 5400 18500] # Combine with np.argsort to get top-N items top3_idx = np.argsort(revenues)[-3:][::-1] # indices of 3 largest print(products[top3_idx]) # ['Bolt' 'Widget' 'Bracket']
np.where(condition, x, y) returns x where the condition is True and y where it is False. Think of it as a vectorized if-else statement. This is dramatically faster than looping through each element.
demand = np.array([120, 85, 200, 45, 310]) # Classify each day labels = np.where(demand > 100, "High", "Low") print(labels) # ['High' 'Low' 'High' 'Low' 'High'] # Cap demand at 200 (winsorize) capped = np.where(demand > 200, 200, demand) print(capped) # [120 85 200 45 200] # Find indices where condition is True high_idx = np.where(demand > 100)[0] print(high_idx) # [0 2 4]
NumPy operations apply element-wise without explicit loops. This is orders of magnitude faster than Python for-loops on large arrays.
units = np.array([500, 320, 750, 410]) prices = np.array([29.99, 14.50, 7.25, 45.00]) # Element-wise operations revenue = units * prices print(revenue) # [14995. 4640. 5437.5 18450.] # Aggregations print(f"Total revenue: ${revenue.sum():,.2f}") print(f"Average: ${revenue.mean():,.2f}") print(f"Std dev: ${revenue.std():,.2f}")
Broadcasting lets NumPy perform operations on arrays of different shapes. A scalar gets "stretched" to match the array dimension.
# Apply a 10% price increase to all products new_prices = prices * 1.10 print(new_prices) # [32.989, 15.95, 7.975, 49.5] # 2D example: each row is a store, columns are months monthly = np.array([[100, 120, 110], [200, 190, 210]]) weights = np.array([0.3, 0.3, 0.4]) # month weights weighted_avg = (monthly * weights).sum(axis=1) print(weighted_avg) # [110. 201.]
rng = np.random.default_rng(seed=42) # Simulate daily demand (normal distribution) demand = rng.normal(loc=100, scale=15, size=365) print(f"Simulated year: mean={demand.mean():.1f}, std={demand.std():.1f}") # Uniform, integers, choice lead_times = rng.uniform(low=3, high=10, size=50) dice_rolls = rng.integers(low=1, high=7, size=100) sample = rng.choice(["A", "B", "C"], size=10, p=[0.5, 0.3, 0.2])
When combining data from multiple sources, you often need to join arrays together. np.concatenate joins along an existing axis, while np.vstack and np.hstack provide convenient vertical and horizontal stacking.
# 1D concatenation q1 = np.array([100, 120, 110]) q2 = np.array([130, 125, 140]) full_year = np.concatenate([q1, q2]) print(full_year) # [100 120 110 130 125 140] # 2D: vertical stack (add rows) east = np.array([[10, 20], [30, 40]]) west = np.array([[50, 60]]) combined = np.vstack([east, west]) print(combined) # [[10 20] # [30 40] # [50 60]] # 2D: horizontal stack (add columns) prices = np.array([[29.99], [14.50], [7.25]]) quantities = np.array([[500], [320], [750]]) product_data = np.hstack([prices, quantities]) print(product_data) # [[ 29.99 500. ] # [ 14.50 320. ] # [ 7.25 750. ]]
np.vstack is shorthand for np.concatenate(..., axis=0) and np.hstack for axis=1. Use whichever reads more clearly in context. For 1D arrays, np.column_stack is handy because it treats each 1D array as a column.
NumPy's linalg module provides all the standard linear algebra operations. These are essential for optimization, regression, and simulation models.
A = np.array([[2, 1], [5, 3]]) b = np.array([8, 19]) # Solve Ax = b x = np.linalg.solve(A, b) print(f"Solution: x = {x}") # [1. 6.] # Determinant and inverse print(f"Determinant: {np.linalg.det(A):.0f}") # 1 print(f"Inverse:\n{np.linalg.inv(A)}") # Matrix multiplication C = A @ A # or np.dot(A, A) print(C) # Eigenvalues and eigenvectors cov_matrix = np.array([[2.0, 0.8], [0.8, 1.5]]) eigenvalues, eigenvectors = np.linalg.eig(cov_matrix) print(f"Eigenvalues: {eigenvalues}") print(f"Eigenvectors:\n{eigenvectors}") # Matrix norm (useful for measuring "size" of errors) error = np.array([0.5, -0.3, 0.8]) print(f"L2 norm: {np.linalg.norm(error):.4f}")
(X'X)^(-1) X'y. Principal Component Analysis uses eigendecomposition. Understanding these operations helps you debug model fitting issues and work with optimization libraries like scipy.optimize.
The single most important concept in NumPy is avoiding Python-level loops over array elements. NumPy's internal loops are written in C, so they run orders of magnitude faster. The following example demonstrates the difference.
import time n = 1_000_000 a = np.random.default_rng(42).normal(size=n) b = np.random.default_rng(99).normal(size=n) # Slow: Python for-loop start = time.time() result_loop = [a[i] * b[i] for i in range(n)] loop_time = time.time() - start # Fast: vectorized NumPy start = time.time() result_vec = a * b vec_time = time.time() - start print(f"Loop: {loop_time:.4f}s") print(f"Vectorized: {vec_time:.6f}s") print(f"Speedup: {loop_time / vec_time:.0f}x")
%timeit for reliable benchmarks.
Structured arrays let you store heterogeneous data (like a database record) in a NumPy array, where each column has a name and type. They are a lightweight alternative to pandas when you need typed tabular data without the overhead of a DataFrame.
# Define a custom data type dt = np.dtype([ ("name", "U10"), # Unicode string, max 10 chars ("price", "f8"), # 64-bit float ("quantity", "i4") # 32-bit integer ]) products = np.array([ ("Widget", 29.99, 500), ("Gadget", 14.50, 320), ("Bracket", 7.25, 750) ], dtype=dt) # Access by field name (like a DataFrame column) print(products["name"]) # ['Widget' 'Gadget' 'Bracket'] print(products["price"]) # [29.99 14.50 7.25] # Vectorized computation on a field revenue = products["price"] * products["quantity"] print(revenue) # [14995. 4640. 5437.5]
NumPy provides efficient binary formats for persisting arrays to disk. Use .npy for single arrays and .npz for multiple arrays. These formats preserve data types and are much faster to load than CSV files.
# Save a single array data = np.array([1.5, 2.3, 4.1, 3.7]) np.save("demand.npy", data) loaded = np.load("demand.npy") print(loaded) # [1.5 2.3 4.1 3.7] # Save multiple arrays in a single compressed file prices = np.array([29.99, 14.50, 7.25]) quantities = np.array([500, 320, 750]) np.savez_compressed("products.npz", prices=prices, quantities=quantities) # Load the .npz file archive = np.load("products.npz") print(archive["prices"]) # [29.99 14.5 7.25] print(archive["quantities"]) # [500 320 750] # For human-readable format (slower, larger files) np.savetxt("data.csv", data, delimiter=",") loaded_csv = np.loadtxt("data.csv", delimiter=",")
.npy/.npz for intermediate results in a pipeline where you need fast I/O and exact type preservation. Use CSV when you need to share data with non-Python tools (Excel, R, Stata) or want human-readable files for version control.
Simulate 10,000 days of demand from a normal distribution (mean=500, std=80). Compute the probability that daily demand exceeds 600 units. Then compute the 95th percentile of demand using np.percentile().
Create a 5x3 matrix where each row represents a product and columns are [cost, revenue, units]. Compute a profit column (revenue - cost) using vectorized operations, then find which product has the highest profit.
Create two arrays: actual = np.array([100, 120, 95, 110, 130]) and forecast = np.array([105, 115, 100, 108, 125]). Compute the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) using vectorized operations. Then use np.where() to classify each period as "Over" (forecast > actual) or "Under".
Write a timing experiment comparing a Python for-loop vs. a vectorized NumPy operation for computing the dot product of two 500,000-element arrays. Use time.time() or %timeit to measure both. Report the speedup factor.
np.where() is a vectorized if-else.np.concatenate, np.vstack, and np.hstack to combine arrays from multiple sources.np.linalg provides solve, inverse, eigendecomposition, and norms for matrix computation.np.save()/np.savez_compressed() for fast binary I/O; use CSV only for interoperability.