Data Visualization
Decision framework for translating data into effective visual form. Synthesizes Bertin, Cleveland, Tufte, Cairo, Wilke, and Knaflic — optimized for scientific work with cosmology-specific conventions.
The Intake Protocol
Before plotting, establish two dimensions:
1. Data Structure Analysis
Identify what you're visualizing:
| Data Type | Description | Likely Forms |
|---|---|---|
| Amounts | Values across categories | Bar, dot plot, heatmap |
| Distributions | Spread/shape of values | Histogram, KDE, violin, ridgeline |
| X-Y Relationships | Continuous variables | Scatter, line, confidence bands |
| Uncertainty | Error on measurements | Error bars, bands, gradient ribbons |
| Proportions | Parts of whole | Stacked bar, pie (rarely) |
| Spatial/Maps | Geographic or sky data | Mollweide, healpix, choropleth |
| Correlations | Variable relationships | Covariance matrix, triangle plot |
2. Communication Mode
Determine the venue—this switches the entire rule set:
Mode A: Analytical/Paper
- Audience: Expert peers, reviewers
- Optimize for: Precision, black/white printing, convention
- Philosophy: Tufte/Cleveland/Wilke—density is permitted, accuracy is paramount
- Color: Restrained, colorblind-safe, grayscale-compatible
- Default: This mode unless otherwise specified
Mode B: Presentation/Outreach
- Audience: Mixed expertise, attention-competitive
- Optimize for: Impact, engagement, narrative clarity
- Philosophy: Cairo/McCandless/Knaflic—preattentive pop, visual hierarchy
- Color: Bold accent colors, clear entry points
- Use when: Talks, posters, press releases, social media
The Decision Framework
Route from data to visualization form:
Step 1: Analyze Variables (Bertin)
For each variable, classify:
- Quantitative: Continuous numeric (position, intensity, redshift)
- Ordered: Categorical with sequence (low/med/high, redshift bins)
- Categorical: Nominal groups (experiments, instruments, sky regions)
Check for uncertainty: Is there error on mean (discrete bars) or intrinsic spread (continuous band)?
Step 2: Select Encoding (Cleveland)
Match importance to perceptual accuracy:
| Rank | Encoding | Use For |
|---|---|---|
| 1 | Position on common scale | Primary comparisons, precise values |
| 2 | Position on non-aligned scales | Secondary comparisons |
| 3 | Length | Bar charts (amounts only) |
| 4 | Angle/Slope | Avoid for precise reading |
| 5 | Area | Gestalt impressions, bubble charts |
| 6 | Color saturation | Tertiary encoding, density |
Rule: If precise comparison is needed, use position. If gestalt impression is needed, use color/area.
Step 3: Select Form (Wilke)
Consult viz-catalog.md for the specific form. Key mappings:
| You Have | Consider |
|---|---|
| Spectrum (continuous x, continuous y, uncertainty) | Line + confidence band, residual subplot |
| Correlation/covariance matrix | Heatmap, diverging colormap, white at zero |
| Parameter posteriors | Triangle plot, ridgeline, violin |
| Comparison across groups | Small multiples > overlay when groups > 4 |
| Time series | Line, banking to 45 degrees |
| Amounts across categories | Dot plot (Cleveland) > bar chart |
Step 4: Apply Mode-Specific Rules
If Mode A (Paper):
- Enforce strict linear/log scaling
- No bubble charts for precise quantities
- No dual y-axes
- Redundant encoding (shape + color) for colorblind safety
- Direct labeling over legends when <=4 series
- Light grid lines, subordinate to data
If Mode B (Outreach):
- Establish visual hierarchy—most important data most salient
- One clear entry point (where does eye go first?)
- Bolder colors, but maintain accuracy
- Annotations that guide reading
- Title states the takeaway, not the topic
Cosmology-Specific Overrides
These conventions override general principles for domain consistency:
Power Spectra
- Flatten steeply falling spectra: Multiply by x-axis factor to reveal percent-level features
- Angular: Plot ell^n C_ell (commonly D_ell = ell(ell+1)C_ell/2pi, but factor varies)
- Matter: Plot k^3 P(k) or Delta^2(k) to flatten
- Correlation functions: Plot theta xi(theta) or similar
- Log-linear preferred: Log scale on x (multipole/k), linear on y after flattening
- Reveals small differences hidden by log-log compression
- Reserve log-log only when dynamic range is the message
- Label x-axis with actual values (10, 100, 1000), not exponents
- Residual panel: Show (data - model)/sigma or data/model below main panel
- Uncertainty: Confidence bands if dense sampling, error bars if sparse
Covariance Matrices
- Diverging colormap required (RdBu, coolwarm)
- White/neutral at zero (or at 1 for correlation matrices)
- Explicit colorbar with position-based lookup for precise values
- Consider: Showing only upper/lower triangle for symmetry
Triangle/Corner Plots
- Standard layout: 1D posteriors on diagonal, 2D contours off-diagonal
- Contour levels: 68%, 95% (1sigma, 2sigma)
- Consistent axis ranges across all panels showing same parameter
- Direct parameter labels on axes, not legend
Sky Maps (Healpix/Mollweide)
- Projection matters: Mollweide for full-sky, orthographic for regions
- Graticule: RA/Dec grid, labeled at edges
- Sequential colormap for intensity, diverging for residuals
Error Representation
- Asymmetric errors: Make asymmetry visually obvious
- Bands vs bars: Use bands for continuous functions, bars for discrete points
- Multiple sigma levels: Gradient opacity (dark = 1sigma, light = 2sigma)
Encoding Principles
Brief rules from perceptual science:
Preattentive Attributes (Cairo)
These "pop out" in <250ms—use for key distinctions:
- Color (hue)
- Size
- Position
- Orientation
If your main finding should be visible at a glance, encode it preattentively.
Working Memory Limits
Humans hold ~4 chunks in working memory:
- Legends with >4 items require constant back-and-forth
- Direct labeling dramatically reduces cognitive load
- Group by meaningful categories to chunk (8 items -> 2 groups of 4)
Redundant Encoding (Wilke)
Never rely on color alone:
- Shape + color for categories
- Position + color for emphasis
- Ensures colorblind safety and bad projector survival
The Refinement Loop
After generating the plot, inspect against:
The Squint Test (Knaflic)
Squint at the figure. What stands out? If it's not your main finding, you have:
- Clutter competing with signal
- Wrong visual hierarchy
- Preattentive attributes on wrong elements
Data-Ink Ratio (Tufte)
For each element, ask: "Does this earn its ink?"
- Remove chart frames if not essential
- Lighten or remove gridlines
- Replace legends with direct labels
- Remove redundant axis lines
The 1+1=3 Principle (Tufte)
Two elements create emergent visual artifacts (the space between). Check:
- Dense grids creating moire
- Grouped bars creating unintended rhythms
- Close parallel lines creating "third" shapes
Colorblind Check
Verify with simulation (viridis is designed for CVD safety). Test: Would the message survive grayscale printing?
Reference Files
Consult as needed:
- viz-catalog.md — Form directory organized by visualization function
- color-palettes.md — Colormaps, categorical palettes, Porch Morning theme
- design-system.md — Typography, decluttering checklist, styling
Library preference: Use seaborn over raw matplotlib when possible. Seaborn provides cleaner defaults and better statistical visualization primitives.
Quick Reference: Common Mistakes
| Mistake | Fix |
|---|---|
| Jet/rainbow colormap | Use forestdawn (diverging) or mako/rocket (sequential) |
| >5 colors in legend | Small multiples or direct labeling |
| Dual y-axes | Two separate plots or faceting |
| 3D effects | Never. Use 2D with color/facets |
| Pie charts for comparison | Dot plot or bar chart |
| Bar chart not starting at zero | Start at zero (length encoding) or use dot plot |
| Truncated axis exaggerating effect | Show full range or use log scale |
| Heavy matplotlib defaults | Apply decluttering checklist |