jupyter-notebook-analysis

Jupyter Notebook Analysis Patterns

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "jupyter-notebook-analysis" with this command: npx skills add delphine-l/claude_global/delphine-l-claude-global-jupyter-notebook-analysis

Jupyter Notebook Analysis Patterns

Expert knowledge for creating comprehensive, statistically rigorous Jupyter notebook analyses.

When to Use This Skill

  • Creating multi-cell Jupyter notebooks for data analysis

  • Adding correlation analyses with statistical testing

  • Implementing outlier removal strategies

  • Building series of related visualizations (10+ figures)

  • Analyzing large datasets with multiple characteristics

  • Building data update/enrichment notebooks with multi-source merging

  • Generating figures for sharing with Claude or other AI tools

Important: Image Size Constraints

When generating images to share with Claude, images must not exceed 8000 pixels in either dimension. Add this helper to your notebook imports:

Standard imports with Claude size checking

import matplotlib.pyplot as plt import seaborn as sns from PIL import Image

MAX_CLAUDE_DIM = 7999 # Claude API limit with safety margin

def save_figure(filename, dpi=300, **kwargs): """Save figure with automatic Claude size constraint check.""" plt.savefig(filename, dpi=dpi, bbox_inches='tight', **kwargs)

# Verify and auto-resize if needed
img = Image.open(filename)
if img.width > MAX_CLAUDE_DIM or img.height > MAX_CLAUDE_DIM:
    print(f"Auto-resizing {filename} for Claude compatibility")
    print(f"   Original: {img.width}x{img.height}")
    img.thumbnail((MAX_CLAUDE_DIM, MAX_CLAUDE_DIM), Image.Resampling.LANCZOS)
    img.save(filename)
    print(f"   Resized: {img.width}x{img.height}")
else:
    print(f"OK {filename}: {img.width}x{img.height}")

Safe figure sizes for Claude (300 DPI)

FIG_SIZES = { 'small': (7, 5), # 2100x1500 px 'medium': (12, 9), # 3600x2700 px 'large': (20, 15), # 6000x4500 px 'max': (26, 26), # 7800x7800 px - maximum safe }

Use in notebook

fig, ax = plt.subplots(figsize=FIG_SIZES['medium'])

... plotting code ...

save_figure('figure.png')

For complete image size guidance, see the data-visualization skill.

Core Notebook Patterns

Data Update/Enrichment Notebooks

Use structured notebook patterns for multi-source data merging and enrichment. Key principles:

  • Configuration section at top with safety defaults (ENABLE_AWS_FETCH = False , TEST_MODE = True )

  • Composite keys for complex merge uniqueness requirements

  • Conflict resolution with configurable strategy (NEW vs OLD priority)

  • Idempotent column addition -- check if columns exist before adding

  • Enrichment tracking -- count what was actually saved, not just fetched

  • Two-stage file workflow -- input file -> distinct output file (never in-place)

For detailed patterns including data update, enrichment, and AWS GenomeArk workflows, see notebook-patterns.md.

Notebook Editing

Always use NotebookEdit tool for .ipynb file modifications -- never the Edit tool (corrupts JSON structure).

Three modes: replace (update cell content), insert (add new cell after target), delete (remove cell).

Key rules:

  • Always specify cell_type when inserting

  • Find cell IDs with jq or Python JSON parsing

  • After programmatic edits, instruct user to "Restart & Run All"

  • Update in dependency order when changing metrics across cells

For NotebookEdit usage, programmatic JSON manipulation, bulk operations, and cell newline handling, see notebook-editing.md.

Statistical Methods

Required for All Correlation Analyses

  • Pearson correlation with p-values using scipy.stats.pearsonr

  • Report r, p-value, and n on every correlation plot

  • Mann-Whitney U test for group comparisons

Outlier Handling

  • Stage 1: Count-based outliers (IQR method) -- remove before analysis

  • Stage 2: Value-based outliers (percentile) -- apply only to visualization, not statistics

  • Apply characteristic-specific outlier removal separately per analysis

  • Always report number of outliers removed

Statistical Claim Verification (CRITICAL)

BEFORE finalizing any analysis notebook, verify ALL statistical claims against actual computed values. Text claims can become stale after data/code updates. Extract claims, rerun tests, create verification table.

For detailed statistical methods, outlier removal code, claim verification workflow, and confounding analysis, see statistical-methods.md.

Publication-Quality Figures

Key Standards

  • DPI: 300 for publication, 150 for digital viewing

  • Font sizes: Title 18pt bold, axis labels 16pt bold, ticks 14pt, legend 12pt

  • Colors: Use colorblind-safe palettes (IBM/Okabe-Ito). Blue #0173B2

  • Orange #DE8F05 for two-group comparisons
  • Data imbalance: Add prominent warnings when sample size ratio > 5x

Image Display

  • Use HTML <img> tags in markdown cells for responsive SVG/PNG scaling

  • Crop SVGs by modifying viewBox attributes directly (no ImageMagick needed)

  • Manage DPI to prevent "Output too large" errors (use 150 DPI default)

For detailed font size tables, color palette code, imbalance handling, SVG manipulation, and DPI management, see visualization-guide.md.

Notebook Organization

Large Notebooks (60+ cells)

  • Use markdown section headers with cell pairing pattern

  • Consistent naming for figures, variables, and functions

  • Progressive enhancement from basic to complex analyses

Dual-Notebook System

For analyses with 5+ figures preparing for publication:

  • Code notebook: Executable analysis, figure generation, statistical tests

  • Presentation notebook: Figure displays, captions, interpretations, methods

Splitting and Deprecation

When splitting notebooks, recreate all calculated columns and variable definitions in each split. When deprecating, create dated directories with documentation.

For figure usage analysis, splitting strategies, dual-notebook workflow, publication notebook structure, TOC generation, deprecation workflow, and migration guides, see notebook-organization.md.

Sharing and Export

Key Rules

  • Preserve outputs when preparing sharing packages (outputs ARE the documentation)

  • Use relative paths (never absolute) for portability

  • HTML export is best for sharing (self-contained, no software needed)

  • Update paths programmatically when moving notebooks to subdirectories

For path management, HTML/PDF/LaTeX export, sharing package structure, and output preservation guidelines, see sharing-and-export.md.

Template and Helper Patterns

Template Generation

For creating multiple similar analysis cells:

template = ''' if len(data_with_species) > 0: print('Analyzing {display} vs {metric}...\n') species_data = {{}} for inv in data_with_species: {name} = safe_float_convert(inv.get('{name}')) if {name} is None: continue # ... analysis code '''

characteristics = [ {'name': 'genome_size', 'display': 'Genome Size', 'unit': 'Gb'}, {'name': 'heterozygosity', 'display': 'Heterozygosity', 'unit': '%'}, ]

for char in characteristics: code = template.format(**char)

Helper Function Pattern

Define once, reuse throughout:

def safe_float_convert(value): """Convert string to float, handling comma separators""" if not value or not str(value).strip(): return None try: return float(str(value).replace(',', '')) except (ValueError, TypeError): return None

Troubleshooting

Key pitfalls to watch for:

  • Variable shadowing: Never use data as a loop variable (shadows global)

  • Column name mismatches: Always print df.columns.tolist() before processing

  • Cell execution order: After NotebookEdit inserts, "Restart & Run All"

  • Notebook size: Use jq for notebooks > 256 KB

For detailed troubleshooting, variable validation, debugging techniques, and environment setup, see troubleshooting.md.

Best Practices Summary

  • Always check data availability before creating analyses

  • Document outlier removal clearly in titles and comments

  • Use consistent naming for variables and figures

  • Include statistical testing for all correlations

  • Separate visualization from statistics when filtering outliers

  • Create templates for repetitive analyses

  • Use helper functions consistently across cells

  • Organize with markdown headers for navigation

  • Test with small datasets before running full analyses

  • Save intermediate results for expensive computations

  • Use NotebookEdit tool for all .ipynb file modifications

Supporting Files Reference

File Contents

notebook-patterns.md Data update, enrichment, AWS GenomeArk patterns

notebook-editing.md NotebookEdit tool, programmatic manipulation, metrics updates

visualization-guide.md Publication figures, colors, image display, SVG, DPI

statistical-methods.md Outlier handling, statistical rigor, claim verification

notebook-organization.md Splitting, dual-notebook, deprecation, figure analysis

sharing-and-export.md Paths, HTML/PDF export, sharing packages

troubleshooting.md Common pitfalls, debugging, validation, environment

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

data-analysis-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

token-efficiency

No summary provided by upstream source.

Repository SourceNeeds Review
General

bioinformatics-fundamentals

No summary provided by upstream source.

Repository SourceNeeds Review
General

folder-organization

No summary provided by upstream source.

Repository SourceNeeds Review