R Data Science

Overview

Generate high-quality R code following tidyverse conventions and modern best practices. This skill covers data manipulation, visualization, statistical analysis, and reproducible research workflows commonly used in public health, epidemiology, and data science.

Core Principles

Tidyverse-first: Use tidyverse packages (dplyr, tidyr, ggplot2, purrr, readr) as the default approach
Pipe-forward: Use the native pipe |> for chains (R 4.1+); fall back to %>% for older versions
Reproducibility: Structure all work for reproducibility using Quarto, renv, and clear documentation
Defensive coding: Validate inputs, handle missing data explicitly, and fail informatively

Quick Reference: Common Patterns

Data Import

library(tidyverse)

CSV (most common)

df <- read_csv("data/raw/dataset.csv")

Excel

df <- readxl::read_excel("data/raw/dataset.xlsx", sheet = "Sheet1")

Clean column names immediately

df <- df |> janitor::clean_names()

Data Wrangling Pipeline

analysis_data <- raw_data |>

Clean and filter

filter(!is.na(key_variable)) |>

Transform variables

mutate( date = as.Date(date_string, format = "%Y-%m-%d"), age_group = cut(age, breaks = c(0, 18, 45, 65, Inf), labels = c("0-17", "18-44", "45-64", "65+")) ) |>

Summarize

group_by(region, age_group) |> summarize( n = n(), mean_value = mean(outcome, na.rm = TRUE), .groups = "drop" )

Basic ggplot2 Visualization

ggplot(df, aes(x = date, y = count, color = category)) + geom_line(linewidth = 1) + scale_color_brewer(palette = "Set2") + labs( title = "Trend Over Time", subtitle = "By category", x = "Date", y = "Count", color = "Category", caption = "Source: Dataset Name" ) + theme_minimal(base_size = 12) + theme( legend.position = "bottom", plot.title = element_text(face = "bold") )

Tidyverse Style Guide Essentials

Naming Conventions

snake_case for objects and functions: case_counts , calculate_rate()
Verbs for functions: filter_outliers() , compute_summary()
Nouns for data: patient_data , surveillance_df
Avoid: dots in names (reserved for S3), single letters except in lambdas

Code Formatting

Indentation: 2 spaces (never tabs)
Line length: 80 characters maximum
Operators: Spaces around <- , = , + , |> , but not : , :: , $
Commas: Space after, never before
Pipes: New line after each |>

Good

result <- data |> filter(year >= 2020) |> group_by(county) |> summarize(total = sum(cases))

Bad

result<-data|>filter(year>=2020)|>group_by(county)|>summarize(total=sum(cases))

Assignment

Use <- for assignment, never = or ->
Use = only for function arguments

Comments

Load and clean surveillance data ------------------------------------------

Calculate age-adjusted rates

Using direct standardization method per CDC guidelines

adjusted_rate <- calculate_adjusted_rate(df, standard_pop)

Package Ecosystem

Core Tidyverse (Always Load)

library(tidyverse) # Loads: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats

Data Import/Export

Task Package Key Functions

CSV/TSV readr read_csv() , write_csv()

Excel readxl, writexl read_excel() , write_xlsx()

SAS/SPSS/Stata haven read_sas() , read_spss() , read_stata()

JSON jsonlite read_json() , fromJSON()

Databases DBI, dbplyr dbConnect() , tbl()

Data Manipulation

Task Package Key Functions

Column cleaning janitor clean_names() , tabyl()

Date handling lubridate ymd() , mdy() , floor_date()

String operations stringr str_detect() , str_extract()

Missing data naniar vis_miss() , replace_with_na()

Visualization

Task Package Key Functions

Core plotting ggplot2 ggplot() , geom_*()

Extensions ggrepel, patchwork geom_text_repel() , + operator

Interactive plotly ggplotly()

Tables gt, kableExtra gt() , kable()

Statistical Analysis

Task Package Key Functions

Model summaries broom tidy() , glance() , augment()

Regression stats, lme4 lm() , glm() , lmer()

Survival survival Surv() , survfit() , coxph()

Survey data survey svydesign() , svymean()

Epidemiology & Public Health

Task Package Key Functions

Epi calculations epiR epi.2by2() , epi.conf()

Outbreak tools incidence2, epicontacts incidence() , make_epicontacts()

Disease mapping SpatialEpi expected() , EBlocal()

Surveillance surveillance sts() , farrington()

Rate calculations epitools riskratio() , oddsratio() , ageadjust.direct()

Reproducibility Standards

Project Structure

project/ ├── project.Rproj ├── renv.lock ├── CLAUDE.md # Claude Code configuration ├── README.md ├── data/ │ ├── raw/ # Never modify │ └── processed/ # Analysis-ready ├── R/ # Custom functions ├── scripts/ # Pipeline scripts ├── analysis/ # Quarto documents └── output/ ├── figures/ └── tables/

Quarto Document Header

title: "Analysis Title" author: "Your Name" date: today format: html: toc: true code-fold: true embed-resources: true execute: warning: false message: false

Package Management with renv

Initialize (once per project)

renv::init()

Snapshot dependencies after installing packages

renv::snapshot()

Restore environment (for collaborators)

renv::restore()

Workflow Documentation

Always include at the top of scripts:

============================================================================

Title: Analysis of [Subject]

Author: [Name]

Date: [Date]

Purpose: [One-sentence description]

Input: data/processed/clean_data.csv

Output: output/figures/trend_plot.png

============================================================================

Common Analysis Patterns

Descriptive Statistics Table

df |> group_by(category) |> summarize( n = n(), mean = mean(value, na.rm = TRUE), sd = sd(value, na.rm = TRUE), median = median(value, na.rm = TRUE), q25 = quantile(value, 0.25, na.rm = TRUE), q75 = quantile(value, 0.75, na.rm = TRUE) ) |> gt::gt() |> gt::fmt_number(columns = where(is.numeric), decimals = 2)

Regression with Tidy Output

model <- glm(outcome ~ exposure + age + sex, data = df, family = binomial)

Tidy coefficients

tidy_results <- broom::tidy(model, conf.int = TRUE, exponentiate = TRUE) |> select(term, estimate, conf.low, conf.high, p.value)

Model diagnostics

glance_results <- broom::glance(model)

Epi Curve (Epidemic Curve)

library(incidence2)

Create incidence object

inc <- incidence( df, date_index = "onset_date", interval = "week", groups = "outcome_category" )

Plot

plot(inc) + labs( title = "Epidemic Curve", x = "Week of Onset", y = "Number of Cases" ) + theme_minimal()

Rate Calculation

Age-adjusted rates using direct standardization

library(epitools)

Stratum-specific counts and populations

result <- ageadjust.direct( count = df$cases, pop = df$population, stdpop = standard_population$pop # e.g., US 2000 standard )

Error Handling

Defensive Data Checks

Validate data before analysis

stopifnot( "Data frame is empty" = nrow(df) > 0, "Missing required columns" = all(c("id", "date", "value") %in% names(df)), "Duplicate IDs found" = !any(duplicated(df$id)) )

Informative warnings for data quality issues

if (sum(is.na(df$key_var)) > 0) { warning(sprintf("%d missing values in key_var (%.1f%%)", sum(is.na(df$key_var)), 100 * mean(is.na(df$key_var)))) }

Safe File Operations

Check file exists before reading

if (!file.exists(filepath)) { stop(sprintf("File not found: %s", filepath)) }

Create directories if needed

dir.create("output/figures", recursive = TRUE, showWarnings = FALSE)

Performance Tips

For Large Datasets

Use data.table for >1M rows

library(data.table) dt <- fread("large_file.csv")

Or use arrow for very large/parquet files

library(arrow) df <- read_parquet("data.parquet")

Lazy evaluation with duckdb

library(duckdb) con <- dbConnect(duckdb()) df_lazy <- tbl(con, "data.csv")

Vectorization Over Loops

Good: vectorized

df$rate <- df$cases / df$population * 100000

Avoid: row-by-row loop

for (i in 1:nrow(df)) { df$rate[i] <- df$cases[i] / df$population[i] * 100000 }

Additional Resources

For detailed patterns, consult:

Tidyverse Style Guide: https://style.tidyverse.org/
R for Data Science (2e): https://r4ds.hadley.nz/
The Epidemiologist R Handbook: https://epirhandbook.com/
Quarto Documentation: https://quarto.org/

Version History

v1.0.0 (2025-12-04): Initial release for PubHealthAI community

r-data-science

Safety Notice

Copy this and send it to your AI assistant to learn

CSV (most common)

Excel

Clean column names immediately

Clean and filter

Transform variables

Summarize

Good

Bad

Load and clean surveillance data ------------------------------------------

Calculate age-adjusted rates

Using direct standardization method per CDC guidelines

title: "Analysis Title" author: "Your Name" date: today format: html: toc: true code-fold: true embed-resources: true execute: warning: false message: false

Initialize (once per project)

Snapshot dependencies after installing packages

Restore environment (for collaborators)

============================================================================

Title: Analysis of [Subject]

Author: [Name]

Date: [Date]

Purpose: [One-sentence description]

Input: data/processed/clean_data.csv

Output: output/figures/trend_plot.png

============================================================================

Tidy coefficients

Model diagnostics

Create incidence object

Plot

Age-adjusted rates using direct standardization

Stratum-specific counts and populations

Validate data before analysis

Informative warnings for data quality issues

Check file exists before reading

Create directories if needed

Use data.table for >1M rows

Or use arrow for very large/parquet files

Lazy evaluation with duckdb

Good: vectorized

Avoid: row-by-row loop

Source Transparency

Related Skills

Visual Explainer

MinerU OCR Local & API

My Browser Agent

ZeroCut AI Video