Hypothesis tests, confidence intervals, and assumption checks in R.
Most R test functions return an object with a $statistic, $p.value, and $conf.int. You can print the result or extract components programmatically.
# Generate sample data set.seed(42) group_a <- rnorm(30, mean = 100, sd = 15) group_b <- rnorm(30, mean = 108, sd = 15) result <- t.test(group_a, group_b) result # full output result$p.value # just the p-value result$conf.int # 95% CI for difference in means
# Default: Welch's t-test (does not assume equal variance) t.test(group_a, group_b) # Equal-variance (Student's) t-test t.test(group_a, group_b, var.equal = TRUE)
# Test if group_a mean differs from 100 t.test(group_a, mu = 100)
# Before-after design before <- c(82, 76, 91, 88, 73) after <- c(88, 80, 95, 90, 78) t.test(before, after, paired = TRUE)
Tests for independence between two categorical variables.
# Contingency table survey <- matrix(c(30, 10, 20, 40), nrow = 2, dimnames = list(Gender = c("M","F"), Pref = c("A","B"))) chisq.test(survey)
Analysis of Variance tests whether group means differ across three or more groups.
# One-way ANOVA using mpg dataset library(dplyr) model_aov <- aov(hwy ~ drv, data = mpg) summary(model_aov) # Post-hoc pairwise comparisons TukeyHSD(model_aov)
shapiro.test() on residuals and variance equality with bartlett.test() or car::leveneTest(). If assumptions fail, consider kruskal.test() (non-parametric).
# Pearson correlation (default) cor.test(mpg$displ, mpg$hwy) # Spearman rank correlation (non-parametric) cor.test(mpg$displ, mpg$hwy, method = "spearman") # Correlation matrix for multiple variables mpg |> select(displ, cty, hwy) |> cor()
# Test if a proportion differs from 0.5 prop.test(x = 58, n = 100, p = 0.5) # Shapiro-Wilk normality test shapiro.test(group_a) # p > 0.05 suggests normality
# The t.test result already includes a CI t.test(group_a)$conf.int # Manual CI for a mean x_bar <- mean(group_a) se <- sd(group_a) / sqrt(length(group_a)) t_crit <- qt(0.975, df = length(group_a) - 1) c(x_bar - t_crit * se, x_bar + t_crit * se)
A p-value tells you whether an effect exists, but not how large it is. Effect sizes quantify the magnitude of a difference, making results comparable across studies. Cohen's d is the most common measure for comparing two means.
# Manual calculation of Cohen's d cohens_d <- function(x, y) { n1 <- length(x) n2 <- length(y) pooled_sd <- sqrt(((n1 - 1) * var(x) + (n2 - 1) * var(y)) / (n1 + n2 - 2)) (mean(x) - mean(y)) / pooled_sd } cohens_d(group_a, group_b) # Using the effectsize package (more convenient) # install.packages("effectsize") library(effectsize) cohens_d(group_a, group_b) # includes CI for d
Statistical power is the probability of detecting a real effect. Before collecting data, a power analysis tells you how many observations you need. The conventional target is 80% power.
# install.packages("pwr") library(pwr) # Two-sample t-test: how many per group to detect d = 0.5? pwr.t.test(d = 0.5, sig.level = 0.05, power = 0.80, type = "two.sample") # n = 63.77 per group — round up to 64 # What power do we have with n = 30 per group? pwr.t.test(n = 30, d = 0.5, sig.level = 0.05, type = "two.sample") # power = 0.478 — underpowered! # Power for ANOVA (k groups) pwr.anova.test(k = 3, f = 0.25, sig.level = 0.05, power = 0.80) # Power for correlation test pwr.r.test(r = 0.3, sig.level = 0.05, power = 0.80)
When your data violate normality assumptions (small samples, skewed distributions, ordinal data), non-parametric tests make fewer distributional assumptions.
# Wilcoxon rank-sum test (= Mann-Whitney U) # Non-parametric alternative to two-sample t-test wilcox.test(group_a, group_b) # Wilcoxon signed-rank test (paired) wilcox.test(before, after, paired = TRUE) # Kruskal-Wallis test (non-parametric ANOVA) kruskal.test(hwy ~ drv, data = mpg) # Post-hoc pairwise comparisons for Kruskal-Wallis pairwise.wilcox.test(mpg$hwy, mpg$drv, p.adjust.method = "BH")
When you perform many tests simultaneously, the chance of at least one false positive inflates rapidly. If you run 20 tests at alpha = 0.05, you expect one false positive on average even if all null hypotheses are true.
# Suppose we have p-values from multiple tests p_vals <- c(0.003, 0.012, 0.048, 0.07, 0.15, 0.42) # Bonferroni correction (conservative — controls FWER) p.adjust(p_vals, method = "bonferroni") # 0.018 0.072 0.288 0.420 0.900 1.000 # Benjamini-Hochberg (FDR — less conservative, more powerful) p.adjust(p_vals, method = "BH") # 0.018 0.036 0.096 0.105 0.180 0.420 # Holm method (step-down Bonferroni — better than plain Bonferroni) p.adjust(p_vals, method = "holm")
| Method | Controls | Conservatism | When to Use |
|---|---|---|---|
| Bonferroni | Family-wise error rate | Very conservative | Small number of pre-planned comparisons |
| Holm | Family-wise error rate | Less conservative | General-purpose, uniformly better than Bonferroni |
| Benjamini-Hochberg | False discovery rate | Moderate | Exploratory analysis, genomics, many tests |
Bootstrapping constructs confidence intervals by resampling from your data. This is useful when the sampling distribution is unknown or when your statistic is not a simple mean.
# install.packages("boot") library(boot) # Define the statistic function: must take data and index arguments boot_mean <- function(data, indices) { mean(data[indices]) } # Run 10,000 bootstrap replications set.seed(42) boot_result <- boot(group_a, boot_mean, R = 10000) boot_result # Get confidence intervals (multiple methods) boot.ci(boot_result, type = c("norm", "perc", "bca")) # Bootstrap a median (no formula for its SE) boot_median <- function(data, indices) median(data[indices]) med_result <- boot(group_a, boot_median, R = 10000) boot.ci(med_result, type = "perc") # Visualize the bootstrap distribution hist(boot_result$t, breaks = 50, col = "#276DC3", main = "Bootstrap Distribution of the Mean", xlab = "Sample Mean")
Academic papers in psychology, education, and many social sciences follow APA (American Psychological Association) reporting guidelines. Even if your field uses a different style, these principles promote clear communication of statistical results.
# Example: reporting a two-sample t-test result <- t.test(group_a, group_b) # APA format: t(df) = value, p = value, d = value cat(sprintf( "t(%0.1f) = %0.2f, p = %0.3f", result$parameter, result$statistic, result$p.value )) # Example output: t(57.4) = -2.15, p = 0.036 # For very small p-values, report as p < .001 format_p <- function(p) { if (p < 0.001) return("p < .001") sprintf("p = %0.3f", p) } # The report package automates APA formatting # install.packages("report") library(report) report(t.test(group_a, group_b))
| Scenario | Function | Key Argument |
|---|---|---|
| Two-sample means | t.test(x, y) | var.equal |
| Paired means | t.test(x, y, paired=TRUE) | |
| 3+ group means | aov(y ~ group) | TukeyHSD() |
| Categorical independence | chisq.test(table) | |
| Correlation | cor.test(x, y) | method |
| Proportion | prop.test(x, n) | p |
| Normality | shapiro.test(x) | |
| Non-parametric 2-sample | wilcox.test(x, y) | Mann-Whitney U |
| Non-parametric 3+ groups | kruskal.test(y ~ g) | |
| Effect size | effectsize::cohens_d(x, y) | |
| Power analysis | pwr::pwr.t.test() | d, power, n |
| Bootstrap CI | boot::boot() | R (replications) |
| p-value correction | p.adjust(p, method) | "BH", "bonferroni" |
Using the mtcars dataset, test whether cars with automatic transmission (am == 0) have different mpg than manual cars (am == 1). Report the t-statistic, p-value, and 95% CI. Is the difference statistically significant?
Run a one-way ANOVA to test whether mpg differs by number of cylinders (cyl) in mtcars. If significant, perform Tukey's HSD. Which pairs of cylinder counts differ?
You are designing an experiment to test whether a new training program improves exam scores. Based on a pilot study, you expect a medium effect size (d = 0.5). Using the pwr package, determine how many participants you need per group to achieve 80% power at alpha = 0.05. What happens to the required sample size if you want 90% power instead?
Using the mtcars dataset: (a) Perform a Kruskal-Wallis test on mpg by cyl (the non-parametric alternative to the ANOVA in Exercise 5.2). (b) If significant, run pairwise Wilcoxon tests with Benjamini-Hochberg correction. (c) Compute Cohen's d for the 4-cyl vs. 8-cyl comparison. (d) Use the boot package to construct a 95% bootstrap percentile CI for the median mpg of 4-cylinder cars. Report all results in APA format.
$p.value, $conf.int, $statistic.var.equal = FALSE) is the safe default for comparing two means.shapiro.test) and variance homogeneity before ANOVA.TukeyHSD() for post-hoc pairwise comparisons after a significant ANOVA.cor.test() gives both the correlation coefficient and a significance test.