Chapter 5: Statistical Testing & Inference

Hypothesis tests, confidence intervals, and assumption checks in R.

5.1 The Testing Framework

Most R test functions return an object with a $statistic, $p.value, and $conf.int. You can print the result or extract components programmatically.

# Generate sample data
set.seed(42)
group_a <- rnorm(30, mean = 100, sd = 15)
group_b <- rnorm(30, mean = 108, sd = 15)

result <- t.test(group_a, group_b)
result               # full output
result$p.value       # just the p-value
result$conf.int      # 95% CI for difference in means

5.2 t-Tests

Two-Sample t-Test

# Default: Welch's t-test (does not assume equal variance)
t.test(group_a, group_b)

# Equal-variance (Student's) t-test
t.test(group_a, group_b, var.equal = TRUE)

One-Sample t-Test

# Test if group_a mean differs from 100
t.test(group_a, mu = 100)

Paired t-Test

# Before-after design
before <- c(82, 76, 91, 88, 73)
after  <- c(88, 80, 95, 90, 78)
t.test(before, after, paired = TRUE)

5.3 Chi-Squared Test

Tests for independence between two categorical variables.

# Contingency table
survey <- matrix(c(30, 10, 20, 40), nrow = 2,
                  dimnames = list(Gender = c("M","F"),
                                  Pref = c("A","B")))
chisq.test(survey)

5.4 ANOVA

Analysis of Variance tests whether group means differ across three or more groups.

# One-way ANOVA using mpg dataset
library(dplyr)
model_aov <- aov(hwy ~ drv, data = mpg)
summary(model_aov)

# Post-hoc pairwise comparisons
TukeyHSD(model_aov)

ANOVA assumptions One-way ANOVA assumes normality within groups and homogeneity of variance. Check normality with shapiro.test() on residuals and variance equality with bartlett.test() or car::leveneTest(). If assumptions fail, consider kruskal.test() (non-parametric).

5.5 Correlation Tests

# Pearson correlation (default)
cor.test(mpg$displ, mpg$hwy)

# Spearman rank correlation (non-parametric)
cor.test(mpg$displ, mpg$hwy, method = "spearman")

# Correlation matrix for multiple variables
mpg |>
  select(displ, cty, hwy) |>
  cor()

5.6 Proportion Tests and Normality Checks

# Test if a proportion differs from 0.5
prop.test(x = 58, n = 100, p = 0.5)

# Shapiro-Wilk normality test
shapiro.test(group_a)  # p > 0.05 suggests normality

5.7 Confidence Intervals

# The t.test result already includes a CI
t.test(group_a)$conf.int

# Manual CI for a mean
x_bar <- mean(group_a)
se    <- sd(group_a) / sqrt(length(group_a))
t_crit <- qt(0.975, df = length(group_a) - 1)

c(x_bar - t_crit * se, x_bar + t_crit * se)

5.8 Effect Size: Cohen's d

A p-value tells you whether an effect exists, but not how large it is. Effect sizes quantify the magnitude of a difference, making results comparable across studies. Cohen's d is the most common measure for comparing two means.

# Manual calculation of Cohen's d
cohens_d <- function(x, y) {
  n1 <- length(x)
  n2 <- length(y)
  pooled_sd <- sqrt(((n1 - 1) * var(x) + (n2 - 1) * var(y)) / (n1 + n2 - 2))
  (mean(x) - mean(y)) / pooled_sd
}
cohens_d(group_a, group_b)

# Using the effectsize package (more convenient)
# install.packages("effectsize")
library(effectsize)
cohens_d(group_a, group_b)     # includes CI for d

Interpreting Cohen's d Cohen (1988) proposed rough benchmarks: |d| = 0.2 is "small," |d| = 0.5 is "medium," and |d| = 0.8 is "large." These are only guidelines. In some research contexts (e.g., clinical trials, educational interventions), even a small d can be practically important.

5.9 Power Analysis with the pwr Package

Statistical power is the probability of detecting a real effect. Before collecting data, a power analysis tells you how many observations you need. The conventional target is 80% power.

# install.packages("pwr")
library(pwr)

# Two-sample t-test: how many per group to detect d = 0.5?
pwr.t.test(d = 0.5, sig.level = 0.05, power = 0.80,
           type = "two.sample")
# n = 63.77 per group — round up to 64

# What power do we have with n = 30 per group?
pwr.t.test(n = 30, d = 0.5, sig.level = 0.05,
           type = "two.sample")
# power = 0.478 — underpowered!

# Power for ANOVA (k groups)
pwr.anova.test(k = 3, f = 0.25, sig.level = 0.05, power = 0.80)

# Power for correlation test
pwr.r.test(r = 0.3, sig.level = 0.05, power = 0.80)

Always do power analysis before data collection Post-hoc power analysis (computing power after seeing results) is widely considered uninformative because it is a direct function of the p-value. Plan your sample size in advance using expected effect sizes from pilot data or prior literature.

5.10 Non-Parametric Alternatives

When your data violate normality assumptions (small samples, skewed distributions, ordinal data), non-parametric tests make fewer distributional assumptions.

# Wilcoxon rank-sum test (= Mann-Whitney U)
# Non-parametric alternative to two-sample t-test
wilcox.test(group_a, group_b)

# Wilcoxon signed-rank test (paired)
wilcox.test(before, after, paired = TRUE)

# Kruskal-Wallis test (non-parametric ANOVA)
kruskal.test(hwy ~ drv, data = mpg)

# Post-hoc pairwise comparisons for Kruskal-Wallis
pairwise.wilcox.test(mpg$hwy, mpg$drv, p.adjust.method = "BH")

When to use non-parametric tests Use non-parametric tests when: (1) the sample size is very small (n < 15 per group), (2) the Shapiro-Wilk test rejects normality, (3) the data are ordinal (Likert scales), or (4) there are heavy outliers you cannot remove. The tradeoff: non-parametric tests have slightly less power than their parametric counterparts when assumptions are met.

5.11 Multiple Comparison Corrections

When you perform many tests simultaneously, the chance of at least one false positive inflates rapidly. If you run 20 tests at alpha = 0.05, you expect one false positive on average even if all null hypotheses are true.

# Suppose we have p-values from multiple tests
p_vals <- c(0.003, 0.012, 0.048, 0.07, 0.15, 0.42)

# Bonferroni correction (conservative — controls FWER)
p.adjust(p_vals, method = "bonferroni")
# 0.018  0.072  0.288  0.420  0.900  1.000

# Benjamini-Hochberg (FDR — less conservative, more powerful)
p.adjust(p_vals, method = "BH")
# 0.018  0.036  0.096  0.105  0.180  0.420

# Holm method (step-down Bonferroni — better than plain Bonferroni)
p.adjust(p_vals, method = "holm")

Method	Controls	Conservatism	When to Use
Bonferroni	Family-wise error rate	Very conservative	Small number of pre-planned comparisons
Holm	Family-wise error rate	Less conservative	General-purpose, uniformly better than Bonferroni
Benjamini-Hochberg	False discovery rate	Moderate	Exploratory analysis, genomics, many tests

5.12 Bootstrap Confidence Intervals

Bootstrapping constructs confidence intervals by resampling from your data. This is useful when the sampling distribution is unknown or when your statistic is not a simple mean.

# install.packages("boot")
library(boot)

# Define the statistic function: must take data and index arguments
boot_mean <- function(data, indices) {
  mean(data[indices])
}

# Run 10,000 bootstrap replications
set.seed(42)
boot_result <- boot(group_a, boot_mean, R = 10000)
boot_result

# Get confidence intervals (multiple methods)
boot.ci(boot_result, type = c("norm", "perc", "bca"))

# Bootstrap a median (no formula for its SE)
boot_median <- function(data, indices) median(data[indices])
med_result <- boot(group_a, boot_median, R = 10000)
boot.ci(med_result, type = "perc")

# Visualize the bootstrap distribution
hist(boot_result$t, breaks = 50, col = "#276DC3",
     main = "Bootstrap Distribution of the Mean",
     xlab = "Sample Mean")

Tip: Why use bootstrap? Bootstrap CIs are especially useful for: medians, trimmed means, ratios, correlation coefficients, regression coefficients in small samples, or any statistic where the theoretical sampling distribution is hard to derive. With 10,000 replications, the BCa (bias-corrected accelerated) interval is generally the most accurate.

5.13 Reporting Results in APA Format

Academic papers in psychology, education, and many social sciences follow APA (American Psychological Association) reporting guidelines. Even if your field uses a different style, these principles promote clear communication of statistical results.

# Example: reporting a two-sample t-test
result <- t.test(group_a, group_b)

# APA format: t(df) = value, p = value, d = value
cat(sprintf(
  "t(%0.1f) = %0.2f, p = %0.3f",
  result$parameter,
  result$statistic,
  result$p.value
))
# Example output: t(57.4) = -2.15, p = 0.036

# For very small p-values, report as p < .001
format_p <- function(p) {
  if (p < 0.001) return("p < .001")
  sprintf("p = %0.3f", p)
}

# The report package automates APA formatting
# install.packages("report")
library(report)
report(t.test(group_a, group_b))

APA reporting checklist When reporting statistical results, always include: (1) the test statistic with degrees of freedom, (2) the exact p-value (unless p < .001), (3) an effect size measure (Cohen's d, eta-squared, etc.), (4) a confidence interval, and (5) sample sizes. Round test statistics to 2 decimal places and p-values to 3.

5.14 Quick Reference Table

Scenario	Function	Key Argument
Two-sample means	`t.test(x, y)`	`var.equal`
Paired means	`t.test(x, y, paired=TRUE)`
3+ group means	`aov(y ~ group)`	`TukeyHSD()`
Categorical independence	`chisq.test(table)`
Correlation	`cor.test(x, y)`	`method`
Proportion	`prop.test(x, n)`	`p`
Normality	`shapiro.test(x)`
Non-parametric 2-sample	`wilcox.test(x, y)`	Mann-Whitney U
Non-parametric 3+ groups	`kruskal.test(y ~ g)`
Effect size	`effectsize::cohens_d(x, y)`
Power analysis	`pwr::pwr.t.test()`	`d, power, n`
Bootstrap CI	`boot::boot()`	`R` (replications)
p-value correction	`p.adjust(p, method)`	`"BH"`, `"bonferroni"`

Exercises

Exercise 5.1

Using the mtcars dataset, test whether cars with automatic transmission (am == 0) have different mpg than manual cars (am == 1). Report the t-statistic, p-value, and 95% CI. Is the difference statistically significant?

Exercise 5.2

Run a one-way ANOVA to test whether mpg differs by number of cylinders (cyl) in mtcars. If significant, perform Tukey's HSD. Which pairs of cylinder counts differ?

Exercise 5.3

You are designing an experiment to test whether a new training program improves exam scores. Based on a pilot study, you expect a medium effect size (d = 0.5). Using the pwr package, determine how many participants you need per group to achieve 80% power at alpha = 0.05. What happens to the required sample size if you want 90% power instead?

Exercise 5.4

Using the mtcars dataset: (a) Perform a Kruskal-Wallis test on mpg by cyl (the non-parametric alternative to the ANOVA in Exercise 5.2). (b) If significant, run pairwise Wilcoxon tests with Benjamini-Hochberg correction. (c) Compute Cohen's d for the 4-cyl vs. 8-cyl comparison. (d) Use the boot package to construct a 95% bootstrap percentile CI for the median mpg of 4-cylinder cars. Report all results in APA format.

External Resources

Quick-R: Statistical Tests External
R for Data Science (2e) External
CRAN Task View: Social Sciences External

Key Takeaways

R test functions return objects with $p.value, $conf.int, $statistic.
Welch's t-test (var.equal = FALSE) is the safe default for comparing two means.
Always check assumptions: normality (shapiro.test) and variance homogeneity before ANOVA.
Use TukeyHSD() for post-hoc pairwise comparisons after a significant ANOVA.
cor.test() gives both the correlation coefficient and a significance test.

← Chapter 4: Visualization Chapter 6: Linear Models →