data-analysis-patterns

Data Analysis Patterns

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-analysis-patterns" with this command: npx skills add delphine-l/claude_global/delphine-l-claude-global-data-analysis-patterns

Data Analysis Patterns

Expert guidance for making critical decisions in data analysis workflows, particularly around aggregation, recalculation, and maintaining analytical integrity.

When to Use This Skill

  • Deciding whether to recalculate from raw data vs reuse aggregated data

  • Changing category definitions in existing analyses

  • Ensuring accuracy in publication-quality analyses

  • Handling conflated features that need separation

  • Optimizing analysis pipelines without sacrificing correctness

  • Merging multi-source datasets with composite keys

  • Handling DataFrame type conversion issues during enrichment

Core Patterns

  1. Recalculating vs Reusing Aggregated Data

When you have pre-aggregated data but need different categories or groupings:

  • Recalculate from raw data when category definitions fundamentally change, previously conflated features need separation, aggregation criteria change, or publication accuracy is critical

  • Approximation may be acceptable for exploratory analysis, when categories align closely, or when raw data is unavailable

  • Rule: If you can't confidently map old to new without information loss, recalculate

For detailed patterns and code examples, see data-manipulation-recipes.md

  1. Composite Keys for Multi-Source Data Merging

When merging datasets from multiple sources, a single identifier often isn't unique enough:

  • Create composite keys by concatenating multiple fields with a delimiter (| or :: )

  • Always verify uniqueness after creating the composite key

  • Handle duplicates explicitly before merging (latest date, then most complete record)

  • Remove composite key before final save (temporary working column)

For implementation details, see data-manipulation-recipes.md

  1. Separating Conflated Features

When one metric combines multiple independent features, separate into independent analyses:

  • Identify which features are mixed in each category

  • Create separate category systems for each independent feature

  • Enables clear interpretation and future independent analysis

For examples, see data-manipulation-recipes.md

  1. DataFrame Type Conversion During Enrichment

Type mismatches are common when enriching DataFrames from external sources:

  • Check target column dtype before assignment

  • Convert values to match target dtype (easier than converting whole column)

  • Use helper functions to encapsulate type checking logic

  • Handle NaN explicitly with pd.notna() checks

For the type-safe assignment pattern and examples, see data-manipulation-recipes.md

  1. AWS Data Enrichment Patterns

When enriching tabular data from AWS S3 or external repositories:

  • Use multi-source path resolution (direct lookup + path inference)

  • Auto-detect most complete input file for idempotent re-runs

  • Add columns idempotently (check before adding)

  • Use TEST_MODE for initial validation before full enrichment

For implementation patterns, see enrichment-patterns.md

  1. Critical Data Validation

Column names don't always match their content. Always verify against source code before using categorical columns:

  • Locate source script and verify assignment logic

  • Add assertions for biological plausibility

  • Test with known control samples

  • Document and fix any mismatches found

For the full verification workflow and prevention checklist, see validation-patterns.md

  1. Data Provenance Verification

Derived columns may use inferior sources, causing silent data loss:

  • Compare derived column coverage against likely source columns

  • Cross-tabulate to verify mapping consistency

  • Prefer reclassifying from rich source columns over merging sparse external files

For diagnostic patterns, see validation-patterns.md

  1. Organizing Analysis Text for Token Efficiency

Separate computation (notebooks) from interpretation (markdown files):

  • Create analysis_files/ directory with per-figure markdown files

  • Keep notebooks for code, analysis files for interpretation

  • Token reduction: 98% (1.1M tokens notebook vs 22K tokens analysis files)

For directory structure and writing guidelines, see analysis-organization.md

  1. Multi-Factor Experimental Design Analysis

When experimental design has multiple factors:

  • Use three-category design to isolate individual factor effects

  • Compare pairs controlling for one factor at a time

  • Identify synergistic, dominant, or antagonistic interactions

For interpretation framework and examples, see analysis-interpretation.md

  1. Interpreting Paradoxical Results

When one category performs better on metric X but worse on related metrics Y and Z:

  • Apply the trade-off hypothesis framework

  • Document counter-intuitive results transparently

  • Explore mechanistic explanations rather than dismissing findings

For documentation patterns, see analysis-interpretation.md

  1. Species Name Reconciliation

When external services use different species names than your metadata:

  • Classify mismatches into systematic replacements vs name variants

  • Use fuzzy matching for variant detection

  • Propagate corrections to ALL related files

  • Version files to track correction stages

For reconciliation workflow and code, see species-reconciliation.md

  1. Phylogenetic Tree Coverage Analysis

Track what percentage of your phylogenetic tree has data available:

  • Calculate coverage metric and identify missing species

  • Categorize missing as recoverable, phylogenetic context, or unknown

  • Recover Time Tree proxy replacements from deprecated datasets

  • Document expected vs unexpected missing data

For coverage analysis workflow, see species-reconciliation.md

  1. Distinguishing True Variation from Power Limitations

When analyzing multiple groups, determine if lack of effect is real or insufficient power:

  • Power limitation indicators: Small sample, trend in expected direction, category imbalance, wide CIs

  • True null indicators: Large sample with narrow CIs, opposite direction from other groups, significant in some metrics but not others

  • Report appropriately: "insufficient power" vs "no effect despite adequate power"

For reporting recommendations and examples, see analysis-interpretation.md

  1. Technology Confounding Analysis

Temporal trends may reflect technology adoption rather than methodology improvements:

  • Use three-stage approach: mixed-technology baseline, technology-controlled subset, comparison

  • Test orthogonality, persistence, and temporal patterns

  • Decision matrix for whether to pool across technologies

For the systematic testing approach, see analysis-interpretation.md

  1. Data Consolidation and Enrichment Workflows

When working with multiple intermediate dataset versions:

  • Follow Consolidate -> Enrich -> Verify pattern

  • Always rebuild filtered subsets from enriched master (don't manually merge)

  • Extract accurate dates from repository filenames when release dates are unreliable

For workflow details, see enrichment-patterns.md

  1. Data File Compression Strategies

For large data files, compress instead of delete:

  • Decision tree: active (keep) / regenerable (delete) / archive (compress)

  • BED/VCF/FASTA compress 70-90% with gzip

  • Update scripts to read compressed files directly

  • Document compression in READMEs

For compression benchmarks and workflows, see compression-strategies.md

Key Principles

  • Default to recalculation when category definitions change, features were conflated, or publication accuracy is needed

  • Document approximations when used, and validate against subsets of recalculated data

  • Separate conflated features into independent analyses for clarity

  • Always verify column names against source code before analysis

  • Check dtype before assignment when enriching DataFrames

  • Rebuild filtered subsets from master rather than manually merging new columns

  • Test for technology confounding before pooling across technology generations

  • Compress rather than delete data files that may be needed later

Best Practices

Assess Information Loss

Before deciding to reuse aggregated data, check: Can you perfectly reconstruct raw data from aggregates? If NO, recalculate.

Document Your Decision

""" Data source: scaffold_telomere_data.csv (n=6,356 scaffolds) Recalculated: 2026-01-29 Reason: Previous aggregation conflated terminal and interstitial presence Method: [describe categorization logic] """

Validate Against Original if Possible

original_total = df['cat1'] + df['cat2'] + df['cat3'] + df['cat4'] new_total = df['new_cat1'] + df['new_cat2'] + df['new_cat3'] assert (original_total == new_total).all(), "Category totals don't match!"

Time vs Accuracy Trade-off

  • Exploration phase: Approximations okay, clearly documented

  • Publication phase: Always recalculate for accuracy

  • Intermediate: Recalculate once, save results, reuse those

Performance Considerations

Recalculation is often faster than you think:

Modern pandas on 10,000+ rows

df['new_cat'] = df.apply(categorize_func, axis=1) result = df.groupby('species').agg({'new_cat': 'value_counts'})

Often < 1 second

Optimize: use vectorized operations, filter to relevant columns, cache intermediate results.

Supporting Files

File Content

data-manipulation-recipes.md Recalculation patterns, composite keys, conflated features, type conversion

enrichment-patterns.md AWS enrichment, data consolidation, date extraction, filtered dataset rebuilding

validation-patterns.md Column name verification, data quality checks, data provenance

analysis-interpretation.md Multi-factor design, paradoxical results, power limitations, technology confounding

species-reconciliation.md Species name reconciliation, phylogenetic tree coverage

analysis-organization.md Token-efficient analysis text organization, statistical results population

compression-strategies.md File compression decision tree, benchmarks, script updates

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

jupyter-notebook-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
General

token-efficiency

No summary provided by upstream source.

Repository SourceNeeds Review
General

bioinformatics-fundamentals

No summary provided by upstream source.

Repository SourceNeeds Review
General

folder-organization

No summary provided by upstream source.

Repository SourceNeeds Review