Data Analysis Patterns

Expert guidance for making critical decisions in data analysis workflows, particularly around aggregation, recalculation, and maintaining analytical integrity.

When to Use This Skill

Deciding whether to recalculate from raw data vs reuse aggregated data
Changing category definitions in existing analyses
Ensuring accuracy in publication-quality analyses
Handling conflated features that need separation
Optimizing analysis pipelines without sacrificing correctness
Merging multi-source datasets with composite keys
Handling DataFrame type conversion issues during enrichment

Core Patterns

Recalculating vs Reusing Aggregated Data

When you have pre-aggregated data but need different categories or groupings:

Recalculate from raw data when category definitions fundamentally change, previously conflated features need separation, aggregation criteria change, or publication accuracy is critical
Approximation may be acceptable for exploratory analysis, when categories align closely, or when raw data is unavailable
Rule: If you can't confidently map old to new without information loss, recalculate

For detailed patterns and code examples, see data-manipulation-recipes.md

Composite Keys for Multi-Source Data Merging

When merging datasets from multiple sources, a single identifier often isn't unique enough:

Create composite keys by concatenating multiple fields with a delimiter (| or :: )
Always verify uniqueness after creating the composite key
Handle duplicates explicitly before merging (latest date, then most complete record)
Remove composite key before final save (temporary working column)

For implementation details, see data-manipulation-recipes.md

Separating Conflated Features

When one metric combines multiple independent features, separate into independent analyses:

Identify which features are mixed in each category
Create separate category systems for each independent feature
Enables clear interpretation and future independent analysis

For examples, see data-manipulation-recipes.md

DataFrame Type Conversion During Enrichment

Type mismatches are common when enriching DataFrames from external sources:

Check target column dtype before assignment
Convert values to match target dtype (easier than converting whole column)
Use helper functions to encapsulate type checking logic
Handle NaN explicitly with pd.notna() checks

For the type-safe assignment pattern and examples, see data-manipulation-recipes.md

AWS Data Enrichment Patterns

When enriching tabular data from AWS S3 or external repositories:

Use multi-source path resolution (direct lookup + path inference)
Auto-detect most complete input file for idempotent re-runs
Add columns idempotently (check before adding)
Use TEST_MODE for initial validation before full enrichment

For implementation patterns, see enrichment-patterns.md

Critical Data Validation

Column names don't always match their content. Always verify against source code before using categorical columns:

Locate source script and verify assignment logic
Add assertions for biological plausibility
Test with known control samples
Document and fix any mismatches found

For the full verification workflow and prevention checklist, see validation-patterns.md

Data Provenance Verification

Derived columns may use inferior sources, causing silent data loss:

Compare derived column coverage against likely source columns
Cross-tabulate to verify mapping consistency
Prefer reclassifying from rich source columns over merging sparse external files

For diagnostic patterns, see validation-patterns.md

Organizing Analysis Text for Token Efficiency

Separate computation (notebooks) from interpretation (markdown files):

Create analysis_files/ directory with per-figure markdown files
Keep notebooks for code, analysis files for interpretation
Token reduction: 98% (1.1M tokens notebook vs 22K tokens analysis files)

For directory structure and writing guidelines, see analysis-organization.md

Multi-Factor Experimental Design Analysis

When experimental design has multiple factors:

Use three-category design to isolate individual factor effects
Compare pairs controlling for one factor at a time
Identify synergistic, dominant, or antagonistic interactions

For interpretation framework and examples, see analysis-interpretation.md

Interpreting Paradoxical Results

When one category performs better on metric X but worse on related metrics Y and Z:

Apply the trade-off hypothesis framework
Document counter-intuitive results transparently
Explore mechanistic explanations rather than dismissing findings

For documentation patterns, see analysis-interpretation.md

Species Name Reconciliation

When external services use different species names than your metadata:

Classify mismatches into systematic replacements vs name variants
Use fuzzy matching for variant detection
Propagate corrections to ALL related files
Version files to track correction stages

For reconciliation workflow and code, see species-reconciliation.md

Phylogenetic Tree Coverage Analysis

Track what percentage of your phylogenetic tree has data available:

Calculate coverage metric and identify missing species
Categorize missing as recoverable, phylogenetic context, or unknown
Recover Time Tree proxy replacements from deprecated datasets
Document expected vs unexpected missing data

For coverage analysis workflow, see species-reconciliation.md

Distinguishing True Variation from Power Limitations

When analyzing multiple groups, determine if lack of effect is real or insufficient power:

Power limitation indicators: Small sample, trend in expected direction, category imbalance, wide CIs
True null indicators: Large sample with narrow CIs, opposite direction from other groups, significant in some metrics but not others
Report appropriately: "insufficient power" vs "no effect despite adequate power"

For reporting recommendations and examples, see analysis-interpretation.md

Technology Confounding Analysis

Temporal trends may reflect technology adoption rather than methodology improvements:

Use three-stage approach: mixed-technology baseline, technology-controlled subset, comparison
Test orthogonality, persistence, and temporal patterns
Decision matrix for whether to pool across technologies

For the systematic testing approach, see analysis-interpretation.md

Data Consolidation and Enrichment Workflows

When working with multiple intermediate dataset versions:

Follow Consolidate -> Enrich -> Verify pattern
Always rebuild filtered subsets from enriched master (don't manually merge)
Extract accurate dates from repository filenames when release dates are unreliable

For workflow details, see enrichment-patterns.md

Data File Compression Strategies

For large data files, compress instead of delete:

Decision tree: active (keep) / regenerable (delete) / archive (compress)
BED/VCF/FASTA compress 70-90% with gzip
Update scripts to read compressed files directly
Document compression in READMEs

For compression benchmarks and workflows, see compression-strategies.md

Key Principles

Default to recalculation when category definitions change, features were conflated, or publication accuracy is needed
Document approximations when used, and validate against subsets of recalculated data
Separate conflated features into independent analyses for clarity
Always verify column names against source code before analysis
Check dtype before assignment when enriching DataFrames
Rebuild filtered subsets from master rather than manually merging new columns
Test for technology confounding before pooling across technology generations
Compress rather than delete data files that may be needed later

Best Practices

Assess Information Loss

Before deciding to reuse aggregated data, check: Can you perfectly reconstruct raw data from aggregates? If NO, recalculate.

Document Your Decision

""" Data source: scaffold_telomere_data.csv (n=6,356 scaffolds) Recalculated: 2026-01-29 Reason: Previous aggregation conflated terminal and interstitial presence Method: [describe categorization logic] """

Validate Against Original if Possible

original_total = df['cat1'] + df['cat2'] + df['cat3'] + df['cat4'] new_total = df['new_cat1'] + df['new_cat2'] + df['new_cat3'] assert (original_total == new_total).all(), "Category totals don't match!"

Time vs Accuracy Trade-off

Exploration phase: Approximations okay, clearly documented
Publication phase: Always recalculate for accuracy
Intermediate: Recalculate once, save results, reuse those

Performance Considerations

Recalculation is often faster than you think:

Modern pandas on 10,000+ rows

df['new_cat'] = df.apply(categorize_func, axis=1) result = df.groupby('species').agg({'new_cat': 'value_counts'})

Often < 1 second

Optimize: use vectorized operations, filter to relevant columns, cache intermediate results.

Supporting Files

File Content

data-manipulation-recipes.md Recalculation patterns, composite keys, conflated features, type conversion

enrichment-patterns.md AWS enrichment, data consolidation, date extraction, filtered dataset rebuilding

validation-patterns.md Column name verification, data quality checks, data provenance

analysis-interpretation.md Multi-factor design, paradoxical results, power limitations, technology confounding

species-reconciliation.md Species name reconciliation, phylogenetic tree coverage

analysis-organization.md Token-efficient analysis text organization, statistical results population

compression-strategies.md File compression decision tree, benchmarks, script updates

data-analysis-patterns

Safety Notice

Copy this and send it to your AI assistant to learn

Modern pandas on 10,000+ rows

Often < 1 second

Source Transparency

Related Skills

jupyter-notebook-analysis

token-efficiency

bioinformatics-fundamentals

folder-organization