Chapter 2: Data Structures

Vectors, matrices, data frames, lists, and factors: the building blocks of R.

2.1 Vectors

A vector is the most fundamental R data structure. Every element must share the same type (homogeneous).

# Create vectors with c()
nums <- c(10, 20, 30, 40)
chars <- c("a", "b", "c")
logicals <- c(TRUE, FALSE, TRUE)

# Sequence shortcuts
1:10                       # 1 2 3 ... 10
seq(0, 1, by = 0.25)      # 0.00 0.25 0.50 0.75 1.00
rep("x", 5)                 # "x" "x" "x" "x" "x"

# Vectorized operations (no loops needed)
nums * 2       # 20 40 60 80
nums > 25      # FALSE FALSE TRUE TRUE
sum(nums)      # 100
length(nums)   # 4

Indexing Vectors

x <- c(5, 12, 8, 3, 17)

x[1]           # 5  — R is 1-indexed!
x[c(2, 4)]     # 12  3
x[-1]          # 12 8 3 17  — exclude first
x[x > 10]      # 12 17  — logical indexing

# Named vectors
scores <- c(math = 92, eng = 85, sci = 88)
scores["math"]   # 92

2.2 Matrices

A matrix is a 2D vector: all elements share one type.

m <- matrix(1:12, nrow = 3, ncol = 4)
m
#      [,1] [,2] [,3] [,4]
# [1,]    1    4    7   10
# [2,]    2    5    8   11
# [3,]    3    6    9   12

m[2, 3]        # 8  — row 2, col 3
m[1, ]         # entire row 1
m[, 2]         # entire col 2
dim(m)         # 3 4
t(m)           # transpose

2.3 Data Frames

The data frame is R's workhorse for tabular data. Each column is a vector, but columns can differ in type.

df <- data.frame(
  name   = c("Alice", "Bob", "Carol"),
  age    = c(28, 34, 25),
  score  = c(91.5, 87.2, 94.0),
  passed = c(TRUE, TRUE, TRUE)
)

# Inspection
str(df)          # structure — types, dimensions
head(df, 2)      # first 2 rows
nrow(df)         # 3
ncol(df)         # 4
names(df)        # column names

# Access columns
df$name          # "Alice" "Bob" "Carol"
df[, "score"]   # same as df$score
df[1, ]          # first row
df[df$age > 26, ]  # rows where age > 26
data.frame vs. tibble The tidyverse uses tibble, a modern data frame variant that prints more cleanly and never converts strings to factors. Create one with tibble::tibble() or as_tibble(df).

2.4 Lists

Lists can hold elements of any type, including other lists. They are the most flexible R structure.

my_list <- list(
  title  = "Experiment 1",
  data   = data.frame(x = 1:3, y = c(4,5,6)),
  params = c(0.05, 100)
)

# Access list elements
my_list[[1]]          # "Experiment 1"  — extracts the element
my_list[1]            # a sub-list of length 1
my_list$data          # the data frame
my_list[["params"]]   # c(0.05, 100)

str(my_list)           # see the nested structure
[ ] vs. [[ ]] Single brackets [ ] return a sub-list. Double brackets [[ ]] extract the actual element. Think of [ ] as a boxcar and [[ ]] as pulling the item out of the boxcar.

2.5 Factors

Factors represent categorical variables. They store levels (categories) and are used by many statistical functions.

status <- factor(c("low", "mid", "high", "low", "mid"))
levels(status)   # "high" "low" "mid"  — alphabetical by default

# Ordered factor (ordinal)
status_ord <- factor(
  c("low", "mid", "high", "low"),
  levels  = c("low", "mid", "high"),
  ordered = TRUE
)
status_ord[1] < status_ord[3]  # TRUE

# Convert factor to numeric (careful!)
as.numeric(status)  # gives underlying integers 2 3 1 2 3

2.6 names() and Naming Conventions

Every R object can carry names. For vectors, use names(); for data frames, names() returns column names (same as colnames()). Consistent naming makes code easier to read and maintain.

# Name a vector after creation
temps <- c(72, 68, 75, 80, 77)
names(temps) <- c("Mon", "Tue", "Wed", "Thu", "Fri")
temps["Wed"]   # 75

# Rename data frame columns
df <- data.frame(x = 1:3, y = c(10, 20, 30))
names(df) <- c("id", "revenue")
names(df)   # "id" "revenue"
Naming conventions in R R allows dots in variable names (my.var), but this conflicts with S3 method dispatch. Best practice is snake_case for variables and functions (my_variable, calc_mean). Use PascalCase for S4 classes. Avoid spaces and special characters in names.

2.7 which() for Logical Indexing

The which() function returns the positions (indices) where a logical condition is TRUE. This is useful when you need index numbers rather than the values themselves.

scores <- c(45, 82, 67, 91, 53, 78, 95)

# Which positions have scores above 80?
which(scores > 80)       # 2 4 7

# Find the position of the maximum
which.max(scores)         # 7
which.min(scores)         # 1

# Practical use: find rows in a data frame
df <- data.frame(
  student = c("Ana", "Ben", "Cal", "Dee"),
  grade = c(88, 72, 95, 60)
)
failing_rows <- which(df$grade < 70)
df[failing_rows, ]         # Dee, 60

2.8 NULL vs. NA vs. NaN

These three special values are frequently confused. Understanding the differences is essential for data cleaning.

ValueMeaningTest FunctionExample
NAMissing value (exists but unknown)is.na()A survey question left blank
NULLAbsence of an object (nothing exists)is.null()An optional function argument not provided
NaNNot a Number (undefined math result)is.nan()0/0 produces NaN
# NA propagates through computations
sum(c(1, 2, NA, 4))            # NA
sum(c(1, 2, NA, 4), na.rm = TRUE)  # 7

# NULL disappears from vectors
c(1, NULL, 3)                   # 1 3 (length 2, not 3)

# NaN is a special type of NA
is.na(NaN)                      # TRUE
is.nan(NA)                      # FALSE
Always handle NA explicitly Most R functions return NA if the input contains any NA values. Use na.rm = TRUE in summary functions (mean(), sum(), sd()), or remove NAs beforehand with na.omit() or complete.cases().

2.9 Type Coercion Rules

When you mix types in a vector, R silently coerces everything to the most flexible type. The hierarchy is: logical < integer < double < character.

# Mixing types triggers automatic coercion
c(TRUE, 1L, 3.14)         # 1.00 1.00 3.14  (all double)
c(1, 2, "three")          # "1" "2" "three" (all character)
c(TRUE, FALSE, 1)         # 1 0 1            (all double)

# Explicit coercion
as.numeric("42")           # 42
as.character(100)          # "100"
as.integer(3.7)            # 3  (truncates, does not round)
as.logical(0)              # FALSE
as.numeric("hello")        # NA (with warning)
Why logical-to-numeric matters Since TRUE coerces to 1 and FALSE to 0, you can use sum() and mean() on logical vectors. For example, mean(x > 10) gives the proportion of values exceeding 10. This pattern is used constantly in data analysis.

2.10 The apply Family

Before the tidyverse, the apply family of functions was the primary way to iterate over data structures. These functions remain widely used and appear frequently in existing R code.

# apply() — works on matrices/data frames, by row (1) or column (2)
m <- matrix(1:12, nrow = 3)
apply(m, 1, sum)        # row sums: 22 26 30
apply(m, 2, mean)       # column means: 2 5 8 11

# sapply() — apply a function to each element, return a vector
words <- c("hello", "world", "R")
sapply(words, nchar)     # 5 5 1

# lapply() — same as sapply but always returns a list
lapply(1:3, function(x) x^2)  # list(1, 4, 9)

# tapply() — apply a function by groups
tapply(mtcars$mpg, mtcars$cyl, mean)
#     4      6      8
# 26.66  19.74  15.10
apply vs. tidyverse The tidyverse equivalents (e.g., purrr::map(), dplyr::summarize()) are generally more readable and consistent. However, apply() is useful for matrix operations, and tapply() is a quick way to get grouped summaries without loading any packages.

2.11 Environment Inspection

Your R environment is the collection of all objects (variables, functions, data) currently in memory. Knowing how to inspect and clean it prevents confusion, especially in long analysis sessions.

# List all objects in the current environment
ls()

# Remove a specific object
rm(x)

# Remove multiple objects
rm(x, y, z)

# Remove ALL objects (clean slate)
rm(list = ls())

# Check if an object exists
exists("my_data")   # TRUE or FALSE

# Check memory usage
object.size(mtcars) |> format(units = "KB")  # "7 Kb"

2.12 tibble vs. data.frame: A Deeper Look

Tibbles (from the tibble package, part of tidyverse) are an improved version of data frames. The differences are subtle but important for everyday work.

library(tibble)

# Create a tibble
tbl <- tibble(
  name  = c("Alice", "Bob", "Carol"),
  score = c(91.5, 87.2, 94.0)
)

# Key differences from data.frame:
# 1. Never converts strings to factors
class(tbl$name)       # "character" (data.frame might give "factor" in old R)

# 2. Prints nicely — shows dimensions, column types, only first 10 rows
tbl                    # compact display with type annotations

# 3. Stricter subsetting — [ always returns a tibble
tbl[, "name"]         # still a tibble (data.frame would drop to vector)

# 4. No partial matching on column names
# tbl$na   — would give NULL with warning (data.frame might match "name")

# Convert between the two
as_tibble(mtcars)      # data.frame to tibble
as.data.frame(tbl)     # tibble to data.frame
When to use which? Use tibbles for interactive analysis and tidyverse pipelines. Use base data.frame when writing packages or code that must work without any dependencies. Many functions accept both interchangeably.

2.13 Useful Inspection Functions

FunctionPurpose
str(x)Compact display of structure
class(x)High-level type ("data.frame", "list", etc.)
typeof(x)Internal storage type ("double", "character")
length(x)Number of elements (or columns for df)
dim(x)Rows and columns
summary(x)Quick statistical summary
head(x, n)First n elements/rows

Exercises

Exercise 2.1

Create a named vector temps with temperatures for Monday through Friday: 72, 68, 75, 80, 77. Extract only the days where temperature exceeds 74 using logical indexing.

Exercise 2.2

Build a data frame called products with columns: name (3 product names), price (numeric), and category (factor with levels "electronics", "clothing", "food"). Use str() to verify the types. Then extract only rows where price is above the median price.

Exercise 2.3

Create a matrix of exam scores: 4 students (rows) by 3 subjects (columns). Use apply() to compute each student's average score across subjects and each subject's average across students. Which student performed best overall? Which subject had the highest average?

Exercise 2.4

Create a vector x <- c(3, NA, 7, NaN, 12, NA, 5). (a) How many NA values does it contain? (Use sum(is.na(x)).) (b) Compute the mean, ignoring missing values. (c) Use which() to find the positions of the missing values. (d) Replace all NA values with the mean of the non-missing values (this is called imputation).

External Resources

Key Takeaways

← Chapter 1: Getting Started Chapter 3: Data Wrangling →