R Data Science
Overview
Generate high-quality R code following tidyverse conventions and modern best practices. This skill covers data manipulation, visualization, statistical analysis, and reproducible research workflows commonly used in public health, epidemiology, and data science.
Core Principles
-
Tidyverse-first: Use tidyverse packages (dplyr, tidyr, ggplot2, purrr, readr) as the default approach
-
Pipe-forward: Use the native pipe |> for chains (R 4.1+); fall back to %>% for older versions
-
Reproducibility: Structure all work for reproducibility using Quarto, renv, and clear documentation
-
Defensive coding: Validate inputs, handle missing data explicitly, and fail informatively
Quick Reference: Common Patterns
Data Import
library(tidyverse)
CSV (most common)
df <- read_csv("data/raw/dataset.csv")
Excel
df <- readxl::read_excel("data/raw/dataset.xlsx", sheet = "Sheet1")
Clean column names immediately
df <- df |> janitor::clean_names()
Data Wrangling Pipeline
analysis_data <- raw_data |>
Clean and filter
filter(!is.na(key_variable)) |>
Transform variables
mutate( date = as.Date(date_string, format = "%Y-%m-%d"), age_group = cut(age, breaks = c(0, 18, 45, 65, Inf), labels = c("0-17", "18-44", "45-64", "65+")) ) |>
Summarize
group_by(region, age_group) |> summarize( n = n(), mean_value = mean(outcome, na.rm = TRUE), .groups = "drop" )
Basic ggplot2 Visualization
ggplot(df, aes(x = date, y = count, color = category)) + geom_line(linewidth = 1) + scale_color_brewer(palette = "Set2") + labs( title = "Trend Over Time", subtitle = "By category", x = "Date", y = "Count", color = "Category", caption = "Source: Dataset Name" ) + theme_minimal(base_size = 12) + theme( legend.position = "bottom", plot.title = element_text(face = "bold") )
Tidyverse Style Guide Essentials
Naming Conventions
-
snake_case for objects and functions: case_counts , calculate_rate()
-
Verbs for functions: filter_outliers() , compute_summary()
-
Nouns for data: patient_data , surveillance_df
-
Avoid: dots in names (reserved for S3), single letters except in lambdas
Code Formatting
-
Indentation: 2 spaces (never tabs)
-
Line length: 80 characters maximum
-
Operators: Spaces around <- , = , + , |> , but not : , :: , $
-
Commas: Space after, never before
-
Pipes: New line after each |>
Good
result <- data |> filter(year >= 2020) |> group_by(county) |> summarize(total = sum(cases))
Bad
result<-data|>filter(year>=2020)|>group_by(county)|>summarize(total=sum(cases))
Assignment
-
Use <- for assignment, never = or ->
-
Use = only for function arguments
Comments
Load and clean surveillance data ------------------------------------------
Calculate age-adjusted rates
Using direct standardization method per CDC guidelines
adjusted_rate <- calculate_adjusted_rate(df, standard_pop)
Package Ecosystem
Core Tidyverse (Always Load)
library(tidyverse) # Loads: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats
Data Import/Export
Task Package Key Functions
CSV/TSV readr read_csv() , write_csv()
Excel readxl, writexl read_excel() , write_xlsx()
SAS/SPSS/Stata haven read_sas() , read_spss() , read_stata()
JSON jsonlite read_json() , fromJSON()
Databases DBI, dbplyr dbConnect() , tbl()
Data Manipulation
Task Package Key Functions
Column cleaning janitor clean_names() , tabyl()
Date handling lubridate ymd() , mdy() , floor_date()
String operations stringr str_detect() , str_extract()
Missing data naniar vis_miss() , replace_with_na()
Visualization
Task Package Key Functions
Core plotting ggplot2 ggplot() , geom_*()
Extensions ggrepel, patchwork geom_text_repel() , + operator
Interactive plotly ggplotly()
Tables gt, kableExtra gt() , kable()
Statistical Analysis
Task Package Key Functions
Model summaries broom tidy() , glance() , augment()
Regression stats, lme4 lm() , glm() , lmer()
Survival survival Surv() , survfit() , coxph()
Survey data survey svydesign() , svymean()
Epidemiology & Public Health
Task Package Key Functions
Epi calculations epiR epi.2by2() , epi.conf()
Outbreak tools incidence2, epicontacts incidence() , make_epicontacts()
Disease mapping SpatialEpi expected() , EBlocal()
Surveillance surveillance sts() , farrington()
Rate calculations epitools riskratio() , oddsratio() , ageadjust.direct()
Reproducibility Standards
Project Structure
project/ ├── project.Rproj ├── renv.lock ├── CLAUDE.md # Claude Code configuration ├── README.md ├── data/ │ ├── raw/ # Never modify │ └── processed/ # Analysis-ready ├── R/ # Custom functions ├── scripts/ # Pipeline scripts ├── analysis/ # Quarto documents └── output/ ├── figures/ └── tables/
Quarto Document Header
title: "Analysis Title" author: "Your Name" date: today format: html: toc: true code-fold: true embed-resources: true execute: warning: false message: false
Package Management with renv
Initialize (once per project)
renv::init()
Snapshot dependencies after installing packages
renv::snapshot()
Restore environment (for collaborators)
renv::restore()
Workflow Documentation
Always include at the top of scripts:
============================================================================
Title: Analysis of [Subject]
Author: [Name]
Date: [Date]
Purpose: [One-sentence description]
Input: data/processed/clean_data.csv
Output: output/figures/trend_plot.png
============================================================================
Common Analysis Patterns
Descriptive Statistics Table
df |> group_by(category) |> summarize( n = n(), mean = mean(value, na.rm = TRUE), sd = sd(value, na.rm = TRUE), median = median(value, na.rm = TRUE), q25 = quantile(value, 0.25, na.rm = TRUE), q75 = quantile(value, 0.75, na.rm = TRUE) ) |> gt::gt() |> gt::fmt_number(columns = where(is.numeric), decimals = 2)
Regression with Tidy Output
model <- glm(outcome ~ exposure + age + sex, data = df, family = binomial)
Tidy coefficients
tidy_results <- broom::tidy(model, conf.int = TRUE, exponentiate = TRUE) |> select(term, estimate, conf.low, conf.high, p.value)
Model diagnostics
glance_results <- broom::glance(model)
Epi Curve (Epidemic Curve)
library(incidence2)
Create incidence object
inc <- incidence( df, date_index = "onset_date", interval = "week", groups = "outcome_category" )
Plot
plot(inc) + labs( title = "Epidemic Curve", x = "Week of Onset", y = "Number of Cases" ) + theme_minimal()
Rate Calculation
Age-adjusted rates using direct standardization
library(epitools)
Stratum-specific counts and populations
result <- ageadjust.direct( count = df$cases, pop = df$population, stdpop = standard_population$pop # e.g., US 2000 standard )
Error Handling
Defensive Data Checks
Validate data before analysis
stopifnot( "Data frame is empty" = nrow(df) > 0, "Missing required columns" = all(c("id", "date", "value") %in% names(df)), "Duplicate IDs found" = !any(duplicated(df$id)) )
Informative warnings for data quality issues
if (sum(is.na(df$key_var)) > 0) { warning(sprintf("%d missing values in key_var (%.1f%%)", sum(is.na(df$key_var)), 100 * mean(is.na(df$key_var)))) }
Safe File Operations
Check file exists before reading
if (!file.exists(filepath)) { stop(sprintf("File not found: %s", filepath)) }
Create directories if needed
dir.create("output/figures", recursive = TRUE, showWarnings = FALSE)
Performance Tips
For Large Datasets
Use data.table for >1M rows
library(data.table) dt <- fread("large_file.csv")
Or use arrow for very large/parquet files
library(arrow) df <- read_parquet("data.parquet")
Lazy evaluation with duckdb
library(duckdb) con <- dbConnect(duckdb()) df_lazy <- tbl(con, "data.csv")
Vectorization Over Loops
Good: vectorized
df$rate <- df$cases / df$population * 100000
Avoid: row-by-row loop
for (i in 1:nrow(df)) { df$rate[i] <- df$cases[i] / df$population[i] * 100000 }
Additional Resources
For detailed patterns, consult:
-
Tidyverse Style Guide: https://style.tidyverse.org/
-
R for Data Science (2e): https://r4ds.hadley.nz/
-
The Epidemiologist R Handbook: https://epirhandbook.com/
-
Quarto Documentation: https://quarto.org/
Version History
- v1.0.0 (2025-12-04): Initial release for PubHealthAI community