Vectors, matrices, data frames, lists, and factors: the building blocks of R.
A vector is the most fundamental R data structure. Every element must share the same type (homogeneous).
# Create vectors with c() nums <- c(10, 20, 30, 40) chars <- c("a", "b", "c") logicals <- c(TRUE, FALSE, TRUE) # Sequence shortcuts 1:10 # 1 2 3 ... 10 seq(0, 1, by = 0.25) # 0.00 0.25 0.50 0.75 1.00 rep("x", 5) # "x" "x" "x" "x" "x" # Vectorized operations (no loops needed) nums * 2 # 20 40 60 80 nums > 25 # FALSE FALSE TRUE TRUE sum(nums) # 100 length(nums) # 4
x <- c(5, 12, 8, 3, 17) x[1] # 5 — R is 1-indexed! x[c(2, 4)] # 12 3 x[-1] # 12 8 3 17 — exclude first x[x > 10] # 12 17 — logical indexing # Named vectors scores <- c(math = 92, eng = 85, sci = 88) scores["math"] # 92
A matrix is a 2D vector: all elements share one type.
m <- matrix(1:12, nrow = 3, ncol = 4) m # [,1] [,2] [,3] [,4] # [1,] 1 4 7 10 # [2,] 2 5 8 11 # [3,] 3 6 9 12 m[2, 3] # 8 — row 2, col 3 m[1, ] # entire row 1 m[, 2] # entire col 2 dim(m) # 3 4 t(m) # transpose
The data frame is R's workhorse for tabular data. Each column is a vector, but columns can differ in type.
df <- data.frame( name = c("Alice", "Bob", "Carol"), age = c(28, 34, 25), score = c(91.5, 87.2, 94.0), passed = c(TRUE, TRUE, TRUE) ) # Inspection str(df) # structure — types, dimensions head(df, 2) # first 2 rows nrow(df) # 3 ncol(df) # 4 names(df) # column names # Access columns df$name # "Alice" "Bob" "Carol" df[, "score"] # same as df$score df[1, ] # first row df[df$age > 26, ] # rows where age > 26
tibble, a modern data frame variant that prints more cleanly and never converts strings to factors. Create one with tibble::tibble() or as_tibble(df).
Lists can hold elements of any type, including other lists. They are the most flexible R structure.
my_list <- list( title = "Experiment 1", data = data.frame(x = 1:3, y = c(4,5,6)), params = c(0.05, 100) ) # Access list elements my_list[[1]] # "Experiment 1" — extracts the element my_list[1] # a sub-list of length 1 my_list$data # the data frame my_list[["params"]] # c(0.05, 100) str(my_list) # see the nested structure
[ ] return a sub-list. Double brackets [[ ]] extract the actual element. Think of [ ] as a boxcar and [[ ]] as pulling the item out of the boxcar.
Factors represent categorical variables. They store levels (categories) and are used by many statistical functions.
status <- factor(c("low", "mid", "high", "low", "mid")) levels(status) # "high" "low" "mid" — alphabetical by default # Ordered factor (ordinal) status_ord <- factor( c("low", "mid", "high", "low"), levels = c("low", "mid", "high"), ordered = TRUE ) status_ord[1] < status_ord[3] # TRUE # Convert factor to numeric (careful!) as.numeric(status) # gives underlying integers 2 3 1 2 3
Every R object can carry names. For vectors, use names(); for data frames, names() returns column names (same as colnames()). Consistent naming makes code easier to read and maintain.
# Name a vector after creation temps <- c(72, 68, 75, 80, 77) names(temps) <- c("Mon", "Tue", "Wed", "Thu", "Fri") temps["Wed"] # 75 # Rename data frame columns df <- data.frame(x = 1:3, y = c(10, 20, 30)) names(df) <- c("id", "revenue") names(df) # "id" "revenue"
my.var), but this conflicts with S3 method dispatch. Best practice is snake_case for variables and functions (my_variable, calc_mean). Use PascalCase for S4 classes. Avoid spaces and special characters in names.
The which() function returns the positions (indices) where a logical condition is TRUE. This is useful when you need index numbers rather than the values themselves.
scores <- c(45, 82, 67, 91, 53, 78, 95) # Which positions have scores above 80? which(scores > 80) # 2 4 7 # Find the position of the maximum which.max(scores) # 7 which.min(scores) # 1 # Practical use: find rows in a data frame df <- data.frame( student = c("Ana", "Ben", "Cal", "Dee"), grade = c(88, 72, 95, 60) ) failing_rows <- which(df$grade < 70) df[failing_rows, ] # Dee, 60
These three special values are frequently confused. Understanding the differences is essential for data cleaning.
| Value | Meaning | Test Function | Example |
|---|---|---|---|
NA | Missing value (exists but unknown) | is.na() | A survey question left blank |
NULL | Absence of an object (nothing exists) | is.null() | An optional function argument not provided |
NaN | Not a Number (undefined math result) | is.nan() | 0/0 produces NaN |
# NA propagates through computations sum(c(1, 2, NA, 4)) # NA sum(c(1, 2, NA, 4), na.rm = TRUE) # 7 # NULL disappears from vectors c(1, NULL, 3) # 1 3 (length 2, not 3) # NaN is a special type of NA is.na(NaN) # TRUE is.nan(NA) # FALSE
NA if the input contains any NA values. Use na.rm = TRUE in summary functions (mean(), sum(), sd()), or remove NAs beforehand with na.omit() or complete.cases().
When you mix types in a vector, R silently coerces everything to the most flexible type. The hierarchy is: logical < integer < double < character.
# Mixing types triggers automatic coercion c(TRUE, 1L, 3.14) # 1.00 1.00 3.14 (all double) c(1, 2, "three") # "1" "2" "three" (all character) c(TRUE, FALSE, 1) # 1 0 1 (all double) # Explicit coercion as.numeric("42") # 42 as.character(100) # "100" as.integer(3.7) # 3 (truncates, does not round) as.logical(0) # FALSE as.numeric("hello") # NA (with warning)
TRUE coerces to 1 and FALSE to 0, you can use sum() and mean() on logical vectors. For example, mean(x > 10) gives the proportion of values exceeding 10. This pattern is used constantly in data analysis.
Before the tidyverse, the apply family of functions was the primary way to iterate over data structures. These functions remain widely used and appear frequently in existing R code.
# apply() — works on matrices/data frames, by row (1) or column (2) m <- matrix(1:12, nrow = 3) apply(m, 1, sum) # row sums: 22 26 30 apply(m, 2, mean) # column means: 2 5 8 11 # sapply() — apply a function to each element, return a vector words <- c("hello", "world", "R") sapply(words, nchar) # 5 5 1 # lapply() — same as sapply but always returns a list lapply(1:3, function(x) x^2) # list(1, 4, 9) # tapply() — apply a function by groups tapply(mtcars$mpg, mtcars$cyl, mean) # 4 6 8 # 26.66 19.74 15.10
purrr::map(), dplyr::summarize()) are generally more readable and consistent. However, apply() is useful for matrix operations, and tapply() is a quick way to get grouped summaries without loading any packages.
Your R environment is the collection of all objects (variables, functions, data) currently in memory. Knowing how to inspect and clean it prevents confusion, especially in long analysis sessions.
# List all objects in the current environment ls() # Remove a specific object rm(x) # Remove multiple objects rm(x, y, z) # Remove ALL objects (clean slate) rm(list = ls()) # Check if an object exists exists("my_data") # TRUE or FALSE # Check memory usage object.size(mtcars) |> format(units = "KB") # "7 Kb"
Tibbles (from the tibble package, part of tidyverse) are an improved version of data frames. The differences are subtle but important for everyday work.
library(tibble) # Create a tibble tbl <- tibble( name = c("Alice", "Bob", "Carol"), score = c(91.5, 87.2, 94.0) ) # Key differences from data.frame: # 1. Never converts strings to factors class(tbl$name) # "character" (data.frame might give "factor" in old R) # 2. Prints nicely — shows dimensions, column types, only first 10 rows tbl # compact display with type annotations # 3. Stricter subsetting — [ always returns a tibble tbl[, "name"] # still a tibble (data.frame would drop to vector) # 4. No partial matching on column names # tbl$na — would give NULL with warning (data.frame might match "name") # Convert between the two as_tibble(mtcars) # data.frame to tibble as.data.frame(tbl) # tibble to data.frame
data.frame when writing packages or code that must work without any dependencies. Many functions accept both interchangeably.
| Function | Purpose |
|---|---|
str(x) | Compact display of structure |
class(x) | High-level type ("data.frame", "list", etc.) |
typeof(x) | Internal storage type ("double", "character") |
length(x) | Number of elements (or columns for df) |
dim(x) | Rows and columns |
summary(x) | Quick statistical summary |
head(x, n) | First n elements/rows |
Create a named vector temps with temperatures for Monday through Friday: 72, 68, 75, 80, 77. Extract only the days where temperature exceeds 74 using logical indexing.
Build a data frame called products with columns: name (3 product names), price (numeric), and category (factor with levels "electronics", "clothing", "food"). Use str() to verify the types. Then extract only rows where price is above the median price.
Create a matrix of exam scores: 4 students (rows) by 3 subjects (columns). Use apply() to compute each student's average score across subjects and each subject's average across students. Which student performed best overall? Which subject had the highest average?
Create a vector x <- c(3, NA, 7, NaN, 12, NA, 5). (a) How many NA values does it contain? (Use sum(is.na(x)).) (b) Compute the mean, ignoring missing values. (c) Use which() to find the positions of the missing values. (d) Replace all NA values with the mean of the non-missing values (this is called imputation).
c()) are homogeneous; lists (list()) are heterogeneous.[ for subsetting and [[ for element extraction.str() is your best friend for understanding any R object.