Bioinformatician Skill

Purpose

Implement computational analyses of biological data, including:

Data loading and quality control
Statistical analysis
Bioinformatics pipelines
Visualization
Integration with domain-specific tools

When to Use This Skill

Use this skill when you need to:

Implement an analysis plan in code (from PI)
Process genomics/transcriptomics/proteomics data
Perform statistical tests on biological data
Create publication-quality visualizations
Build reproducible analysis pipelines
Integrate multiple bioinformatics tools

Workflow Integration

Primary Pattern: Receive Plan → Implement → Deliver Notebook

Receive analysis_plan.md from PI ↓ Implement in Jupyter notebook ↓ (copilot reviews continuously) Deliver completed notebook to PI for interpretation

Integration Points:

RECEIVES: Analysis plan from principal-investigator
WORKS WITH: copilot (adversarial code review during implementation)
CALLS: Domain-specific skills (scanpy , pydeseq2 , biopython , etc.)
OUTPUTS: Jupyter notebooks with analysis code + results

Core Capabilities

Data Loading and Validation

Read common formats (CSV, TSV, HDF5, Parquet, FASTQ, BAM, VCF)
Validate data integrity and format
Handle compressed files
Memory-efficient loading for large datasets

Quality Control

Sample quality metrics
Outlier detection
Batch effect assessment
Positive/negative control validation

Statistical Analysis

Differential expression/abundance
Enrichment analysis
Clustering and dimensionality reduction
Correlation and regression
Multiple testing correction

Visualization

Publication-quality plots (matplotlib, seaborn, plotly)
Interactive visualizations
Consistent styling
Proper labeling and legends

Pipeline Development

Modular, reusable code
Parameter documentation
Progress logging
Error handling

Standard Notebook Structure

Use the template in assets/notebook-structure-template.ipynb :

Title and Description
- Research question
- Date, author
- Reference to analysis plan
Setup
- Imports
- Configuration parameters
- Random seeds for reproducibility
Data Loading
- Read data files
- Initial inspection
- Data structure validation
Quality Control
- Sample metrics
- Filtering criteria
- QC visualizations
Analysis
- Statistical tests
- Transformations
- Model fitting
Visualization
- Main figures
- Supplementary plots
Export Results
- Save processed data
- Export figures
- Summary statistics
Session Info
- Package versions
- Execution time

Biological Literacy Framework

Writing Style for Biological Context

All biological context in notebooks should follow concise scientific prose:

Principles:

✅ Brief: 1-3 sentences per section, not paragraphs
✅ Clear: Use precise biological terminology
✅ Factual: State what/why without excessive detail
✅ Publication-ready: Like Methods/Results sections in papers

Example - Good (Concise):

Biological Context

Differential expression analysis comparing wild-type and mutant neurons identifies genes affected by loss of transcription factor X. Expected upregulation of target genes based on ChIP-seq data (Smith et al. 2020).

Example - Avoid (Too Verbose):

Biological Context

In this analysis, we will perform differential expression analysis to compare gene expression between wild-type neurons and neurons with a mutation in transcription factor X. Previous research has shown that transcription factor X plays a critical role in neuronal development by binding to the promoters of many developmentally important genes...

When to Provide Interpretation vs Handoff

Bioinformatician Handles (routine interpretation):

Standard results following known biology
Positive/negative controls behaving as expected
Results matching literature precedents
Technical QC assessments with biological implications
Magnitude/direction sanity checks

Handoff to Biologist-Commentator (expert needed):

Novel or unexpected findings
Results contradicting established biology
Unclear biological mechanisms
Publication-critical interpretations
Proposing new hypotheses or models

Enhanced Notebook Structure

Use this structure for biologically-literate notebooks:

Title and Scientific Context
- Research question (biological, not just technical)
- Biological hypothesis
- Expected outcome and why it matters
- Relevant background (1-2 sentences)
Setup (code)
- Imports, parameters, seeds
Data Loading
- Code: Load data
- Biological description of dataset (markdown):
  - What organism/tissue/condition
  - What genes/features measured
  - What biological question dataset addresses
Quality Control
- Code: QC metrics, filtering
- Biological interpretation of QC (markdown):
  - Are pass rates expected for this data type?
  - Do failed samples have biological meaning?
  - Red flags from biological perspective?
Analysis
- Code: Statistical tests, transformations
- Biological reasoning for each step (markdown):
  - Why this method for this question?
  - What biological assumption being tested?
  - Positive/negative controls?
Results
- Code: Generate results
- Biological sanity checks (markdown):
  - Do magnitudes make sense?
  - Do directions align with biology?
  - Any known biology violated?
Visualization
- Code: Plots
- Biological interpretation scaffolding (markdown):
  - What biological pattern does this show?
  - Is this expected or surprising?
  - What follow-up questions does this raise?
Preliminary Interpretation
- Bioinformatician's biological assessment (markdown):
  - Main findings in biological terms
  - Caveats and limitations
  - Questions for biologist-commentator
Handoff to Expert (if needed)
- Structured questions for biologist-commentator (markdown):
  - Specific results needing interpretation
  - Unexpected findings to validate
  - Biological mechanisms to explore
Export (code)
- Save data, figures, session info

Biological Sanity Check Framework

Run these checks before accepting results:

Expression/Abundance Checks

Order of magnitude reasonable? (log2FC > 10 is suspicious)
Direction matches known biology? (check a few known genes)
Positive controls behave as expected?
Negative controls show no signal?

Statistical Checks with Biological Lens

Top hits include known biology? (literature validation)
Results robust to threshold changes?
Batch effects vs real biology separated?
Multiple testing appropriate for biology? (discovery vs validation)

Genomics-Specific

Chromosome names consistent? (chr1 vs 1)
Coordinates sensible? (within chromosome bounds)
Strand orientation correct for gene features?
Genome build consistent throughout?

Experimental Design

Sample size adequate for this effect size?
Replicates biological or technical?
Confounders identified and addressed?
Controls appropriate for this experiment type?

If any check fails: Document in notebook, flag for biologist-commentator review

Biological Context Templates

Template: Differential Expression Analysis

Biological Context

Comparing [condition A] vs [condition B] to identify genes involved in [biological process]. Expected upregulation of [pathway X] genes based on [mechanism/literature]. Positive controls: [gene1, gene2]. Expected log2FC range: [X-Y] based on [citation].

Biological Sanity Checks

Known pathway genes show expected direction (e.g., gene1 ↑, gene2 ↓)
Housekeepers unchanged (actb, gapdh)
Magnitudes reasonable (log2FC < 10 for transcriptional regulation)

Preliminary Interpretation

Top hits include [gene X, Y, Z] involved in [biological process], consistent with [hypothesis/literature]. [Gene W] unexpected - requires expert validation.

Handoff: Unexpected downregulation of [gene W] contradicts known role in [process]. Biologist-commentator needed for mechanism assessment.

Template: Single-Cell Clustering

Biological Context

Clustering [tissue] cells to identify cell types. Expected populations: [celltype1 (markers: a,b,c), celltype2 (markers: d,e,f)]. Reference atlas: [citation if available].

Cluster Validation

Cluster 1: [celltype] - markers: [genes] ✓
Cluster 2: [celltype] - markers: [genes] ✓
Cluster 3: Novel population - markers: [genes] - needs expert review

Handoff: Cluster 3 shows unexpected marker combination [X+Y+Z-]. Biologist-commentator needed for cell type identification and biological significance.

Template: Expert Handoff Format

Use this concise format when escalating to biologist-commentator:

Expert Interpretation Needed

Finding: [Specific result with statistics] Context: [1-2 sentence background] Issue: [What's unexpected/unclear and why] Question: [Specific question for expert]

Validation Done: [Positive controls: ✓/✗, Literature: consistent/contradicts]

Example:

Expert Interpretation Needed

Finding: Gene X shows 8-fold upregulation (padj<0.001) in mutant vs WT Context: Gene X is transcriptional repressor, expected downregulation of targets Issue: Target genes also upregulated (contradicts repressor function) Question: Alternative mechanism? Post-transcriptional regulation? Data artifact?

Validation Done: Positive controls ✓, replicates consistent ✓, literature shows conflicting results

Biologist-Commentator Integration Pattern

When to Invoke Biologist-Commentator

Pre-Analysis (Method Validation):

Skill(skill="biologist-commentator", args="Validate that DESeq2 appropriate for [specific experiment design]. Confirm controls adequate and confounders addressed.")

During Analysis (Quick Check):

Use biological sanity check framework (above)
Document any red flags
Continue if checks pass, escalate if fail

Post-Analysis (Expert Interpretation):

Skill(skill="biologist-commentator", args="Interpret biological significance of [specific finding]. Results show [X], which is [expected/unexpected]. Known biology suggests [Y]. Please validate interpretation and suggest mechanisms.")

Handoff Workflow

Bioinformatician: Run analysis, perform sanity checks, document findings
Handoff: Create structured handoff section in notebook (see template above)
Biologist-Commentator: Provides expert interpretation, mechanism insights, validation
Bioinformatician: Incorporate interpretation into notebook, flag needed validations

Pre-Flight Checklist

Before starting implementation, verify:

Analysis plan clearly defines objectives
Data files exist and paths are correct
Required packages installed
Expected output format understood
Random seeds set for reproducibility

Use assets/analysis-checklist.md for complete list.

Reproducibility Standards

Critical: Every bioinformatics analysis must be fully reproducible. Another researcher should be able to recreate your computational environment and obtain identical results.

Environment Documentation (Mandatory)

Start every notebook with environment documentation:

%%

Computational Environment

import sys import numpy as np import pandas as pd import scanpy as sc # or relevant packages

print("=" * 60) print("COMPUTATIONAL ENVIRONMENT") print("=" * 60) print(f"Python: {sys.version}") print(f"NumPy: {np.version}") print(f"Pandas: {pd.version}") print(f"Scanpy: {sc.version}") # Replace with your key packages print("=" * 60) print("\nFor full environment, see requirements.txt")

Create environment files before starting analysis:

For micromamba users (recommended for bioinformatics):

Export micromamba packages:

micromamba env export > environment.yml

Export pip-installed packages separately (micromamba export does not include pip packages):

pip freeze > pip-requirements.txt

For pip users:

pip freeze > requirements.txt

Document which file to use in notebook

In notebook markdown cell:

Computational Environment

Kernel: Python 3.11 (bio-analysis-env)
Environment file: environment.yml (recreate with micromamba env create -f environment.yml)
Key packages: scanpy==1.10.0, numpy==1.26.3, pandas==2.2.0, scipy==1.12.0
Execution date: 2026-01-29

Random Seed Setting (Mandatory for Stochastic Processes)

Set seeds in setup cell:

%%

Random seeds for reproducibility

import numpy as np import random

RANDOM_SEED = 42 # Document choice (convention, replicating published analysis, etc.)

Core Python/NumPy

np.random.seed(RANDOM_SEED) random.seed(RANDOM_SEED)

Scanpy (single-cell analysis)

import scanpy as sc sc.settings.seed = RANDOM_SEED

PyTorch (if using deep learning)

import torch torch.manual_seed(RANDOM_SEED) if torch.cuda.is_available(): torch.cuda.manual_seed_all(RANDOM_SEED)

TensorFlow (if using)

import tensorflow as tf tf.random.set_seed(RANDOM_SEED)

print(f"Random seed set to {RANDOM_SEED} for reproducibility")

Bioinformatics operations requiring seeds:

Dimensionality reduction: UMAP, t-SNE, PCA with randomized SVD
Clustering: Leiden, Louvain (graph-based)
Sampling: Random subsampling, bootstrap, cross-validation
Imputation: Stochastic imputation methods
Simulation: Monte Carlo, permutation tests
Machine learning: Random forests, neural networks, k-means initialization

Document in notebook:

Stochastic Operations

This analysis uses:

UMAP (random initialization, seed=42)
Leiden clustering (random walk, seed=42)
1000-iteration permutation test (seed=42)

All seeds set to 42 for reproducibility.

Session Info Output (Mandatory)

End every notebook with comprehensive session info:

%%

Session Information for Reproducibility

import session_info

session_info.show( dependencies=True, html=False )

Alternative for single-cell workflows:

import scanpy as sc

sc.logging.print_versions()

Alternative for base Python:

import sys

import pkg_resources

print(f"Python: {sys.version}")

for pkg in ['numpy', 'pandas', 'scipy', 'matplotlib', 'seaborn']:

print(f"{pkg}: {pkg_resources.get_distribution(pkg).version}")

What this captures:

Python version
Operating system
All package versions (including dependencies)
Execution timestamp

Why this matters:

API changes between package versions
Statistical method implementations evolve
Bugs get fixed (results may change)
Reviewers need to verify methods

File Path Best Practices

Use relative paths and variables:

%%

from pathlib import Path

Define all paths at top of notebook

DATA_DIR = Path("data/raw") PROCESSED_DIR = Path("data/processed") RESULTS_DIR = Path("results/analysis_2026-01-29") FIGURES_DIR = RESULTS_DIR / "figures"

Create output directories

for directory in [PROCESSED_DIR, RESULTS_DIR, FIGURES_DIR]: directory.mkdir(parents=True, exist_ok=True)

Use variables throughout

counts_file = DATA_DIR / "counts_matrix.h5ad" metadata_file = DATA_DIR / "sample_metadata.csv" output_file = PROCESSED_DIR / "normalized_counts.h5ad" figure_file = FIGURES_DIR / "umap_clusters.pdf"

print(f"Data directory: {DATA_DIR.resolve()}") print(f"Results directory: {RESULTS_DIR.resolve()}")

Never use hardcoded absolute paths:

❌ BAD (non-reproducible):

adata = sc.read_h5ad("/Users/yourname/project/data/counts.h5ad") plt.savefig("/Users/yourname/Desktop/figure.pdf")

✅ GOOD (reproducible):

adata = sc.read_h5ad(DATA_DIR / "counts.h5ad") plt.savefig(FIGURES_DIR / "umap_clusters.pdf")

Data Provenance Documentation

Document data sources in notebook:

Data Sources

Input Data

File: data/raw/GSE123456_counts.h5ad
Source: GEO accession GSE123456
Download date: 2026-01-15
Download command: wget https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE123456
Original publication: Smith et al. (2025) Nature 600:123-130
Organism: Homo sapiens
Tissue: Primary cortical neurons
n samples: 50 (25 control, 25 treatment)
n features: 20,000 genes

Reference Data

Genome build: GRCh38 (hg38)
Gene annotations: GENCODE v42
Downloaded: 2026-01-10 from https://www.gencodegenes.org/

Why this matters:

Data can be updated or removed from repositories
Genome builds affect coordinate-based analyses
Sample metadata clarifies experimental design
Enables others to download identical data

Reproducibility Pre-Flight Checklist

Before starting analysis, verify:

Environment documented (environment.yml or requirements.txt exists)
Environment creation documented in notebook
Random seeds will be set for all stochastic operations
File paths use variables (no hardcoded absolute paths)
Data sources documented (where to download, version, date)
Genome build / reference database versions specified
Session info cell will be added at end

Before handoff to PI, verify:

Notebook runs end-to-end without errors (Restart Kernel & Run All)
Results reproducible (run twice, identical outputs)
All figures saved to FIGURES_DIR with descriptive names
All processed data saved to PROCESSED_DIR
Session info cell executed and output visible
Execution time reasonable (< 2 hours for routine analyses)

Integration with notebook-writer Skill

When creating notebooks programmatically, use notebook-writer skill with reproducibility standards:

from pathlib import Path

Use notebook-writer to create template

cells = [ {'type': 'markdown', 'content': '## Computational Environment\n...'}, {'type': 'code', 'content': 'import sys\nprint(f"Python: {sys.version}")'}, {'type': 'markdown', 'content': '## Data Loading\n...'}, # ... analysis cells ... {'type': 'markdown', 'content': '## Session Info'}, {'type': 'code', 'content': 'import session_info\nsession_info.show()'} ]

Create reproducible notebook

notebook_path = create_notebook_markdown( title="Reproducible RNA-seq Analysis", cells=cells, output_path=Path("analysis/rnaseq_analysis.md") )

Common Reproducibility Failures and Fixes

Issue Problem Fix

Different results on rerun No random seed set Set seeds for numpy, random, scanpy, torch

Import errors Missing package versions Create requirements.txt or environment.yml

File not found Hardcoded paths Use Path variables defined at top

Old package behavior Package version mismatch Document versions with session_info.show()

Data source vanished URL changed or removed Document download date, accession, mirror sites

Genome coordinate mismatch Different genome build Specify build (GRCh38 vs GRCh37) in notebook

Bioinformatics-Specific Reproducibility Considerations

Organism and Reference Versions:

Document in code cell

ORGANISM = "Homo sapiens" GENOME_BUILD = "GRCh38" # or "mm39" for mouse, "dm6" for fly, etc. ANNOTATION_VERSION = "GENCODE v42" # or "Ensembl 110" ANNOTATION_DATE = "2026-01-10"

print(f"Analysis configuration:") print(f" Organism: {ORGANISM}") print(f" Genome: {GENOME_BUILD}") print(f" Annotations: {ANNOTATION_VERSION} ({ANNOTATION_DATE})")

Bioinformatics Tools (if used):

External Tools

STAR aligner: v2.7.11a (for read mapping)
MACS2: v2.2.9.1 (for peak calling)
bedtools: v2.31.0 (for interval operations)

All tools available in micromamba environment (see environment.yml).

Data Processing Parameters:

Document all filtering/QC thresholds

QC_PARAMS = { 'min_genes_per_cell': 200, 'min_cells_per_gene': 3, 'max_pct_mt': 15, # percent mitochondrial reads 'min_counts': 1000, 'highly_variable_genes': 2000, 'n_pcs': 50, # principal components 'umap_neighbors': 15, 'leiden_resolution': 0.8 }

print("Quality control parameters:") for param, value in QC_PARAMS.items(): print(f" {param}: {value}")

Code Quality Standards

During Implementation

Copilot reviews continuously - expect adversarial feedback
Write clear comments explaining biological context
Use descriptive variable names
Modularize repeated operations into functions
Log progress for long-running analyses

Testing

Validate on small test data first
Check edge cases (empty data, single sample, all zeros)
Compare to expected results (positive controls)
Verify reproducibility (run twice, same results)

Common Analysis Patterns

Pattern 1: Differential Expression (RNA-seq)

1. Load counts

2. Filter low-abundance genes

3. Normalize (DESeq2, TMM, or library size)

4. Statistical test (DESeq2, edgeR, limma)

5. Multiple testing correction

6. Volcano plot + heatmap

→ Use pydeseq2 skill for implementation details

Pattern 2: Single-Cell Analysis

1. Load AnnData object

2. QC filtering (cells and genes)

3. Normalization and log-transform

4. Feature selection (highly variable genes)

5. Dimensionality reduction (PCA, UMAP)

6. Clustering

7. Marker gene identification

8. Visualization

→ Use scanpy skill for implementation details

Pattern 3: Sequence Analysis

1. Read FASTA/FASTQ

2. Quality filtering

3. Alignment or motif search

4. Feature extraction

5. Statistical summary

→ Use biopython skill for implementation details

References

For detailed guidance:

references/analysis_workflows.md
Step-by-step workflows for common analyses
references/data_structures.md
When to use pandas/anndata/Bioconductor
references/statistical_methods.md
Which test for which data
references/visualization_best_practices.md
Plot selection and styling

Helper Scripts

Available in scripts/ :

qc_pipeline.py
Automated QC for RNA-seq data
differential_expression_template.py
Complete DESeq2 pipeline
data_loader_helpers.py
Functions for common file formats

Usage: Read these scripts as reference implementations, copy/adapt for your specific analysis, or call directly via Bash if appropriate.

Integration with Domain Skills

When analysis requires specialized knowledge:

Data Type Primary Skill When to Use

Single-cell RNA-seq scanpy

Cell type identification, clustering, trajectory

Bulk RNA-seq pydeseq2

Differential gene expression

Sequences biopython

Alignment, motif search, format conversion

Statistical modeling statsmodels

Regression, time series, GLMs

Pathway analysis gseapy or manual Gene set enrichment

Pattern:

Use bioinformatician for overall workflow
Invoke specialized skill for domain-specific steps
Integrate results back into main analysis

Copilot Review Integration

During implementation, copilot skill reviews your code:

Expect critical feedback (adversarial but constructive)
Fix issues immediately before proceeding
Iterate until code is robust
Don't take criticism personally - it catches bugs early

Deliverables

Complete notebook should include:

Technical Components (existing):

Code cells: Well-commented, modular analysis
Visualizations: Publication-ready figures
Statistics: Complete reporting (test, p-value, effect size, n)
Exports: Processed data files, figure files
Session info: Package versions for reproducibility

Biological Components (new): 6. Biological Context Cells (markdown):

Research question in biological terms
Hypothesis and expected outcomes
Biological description of each analysis step
Relevance to biological question

Sanity Check Documentation (markdown):

Results of biological plausibility checks
Positive/negative control validation
Known biology comparison
Red flags or concerns

Preliminary Interpretation (markdown):

Main findings in biological language
Consistency with expectations
Novel or surprising results
Biological implications

Expert Handoff Section (markdown, if needed):

Structured questions for biologist-commentator
Specific findings needing interpretation
Recommended follow-up analyses
Caveats and limitations

Quality Indicator: Notebook should be readable by biologist who doesn't code

Quality Indicators

Your notebook is ready when:

Technical Quality:

All code executes without errors
Random seed set, results reproducible
QC checks passed (positive controls work)
Visualizations properly labeled
Statistics completely reported
Copilot approved code (no outstanding critical issues)

Biological Quality:

Biological context provided for all major sections (concise, 1-3 sentences)
Biological sanity checks completed and documented
Positive/negative controls validated against biological expectations
Preliminary interpretation written in biological terms
Handoff to biologist-commentator structured (if unexpected findings)
Notebook readable by non-coding biologist

Integration Ready:

Ready for PI to expand interpretations for publication
Clear which findings are routine vs need expert review

bioinformatician

Safety Notice

Copy this and send it to your AI assistant to learn

Biological Context

Biological Context

Biological Context

Biological Sanity Checks

Preliminary Interpretation

Biological Context

Cluster Validation

Expert Interpretation Needed

Expert Interpretation Needed

%%

Computational Environment

For micromamba users (recommended for bioinformatics):

Export micromamba packages:

Export pip-installed packages separately (micromamba export does not include pip packages):

For pip users:

Document which file to use in notebook

Computational Environment

%%

Random seeds for reproducibility

Core Python/NumPy

Scanpy (single-cell analysis)

PyTorch (if using deep learning)

TensorFlow (if using)

Stochastic Operations

%%

Session Information for Reproducibility

Alternative for single-cell workflows:

import scanpy as sc

sc.logging.print_versions()

Alternative for base Python:

import sys

import pkg_resources

print(f"Python: {sys.version}")

for pkg in ['numpy', 'pandas', 'scipy', 'matplotlib', 'seaborn']:

print(f"{pkg}: {pkg_resources.get_distribution(pkg).version}")

%%

Define all paths at top of notebook

Create output directories

Use variables throughout

❌ BAD (non-reproducible):

✅ GOOD (reproducible):

Data Sources

Input Data

Reference Data

Use notebook-writer to create template

Create reproducible notebook

Document in code cell

External Tools

Document all filtering/QC thresholds

1. Load counts

2. Filter low-abundance genes

3. Normalize (DESeq2, TMM, or library size)

4. Statistical test (DESeq2, edgeR, limma)

5. Multiple testing correction

6. Volcano plot + heatmap

1. Load AnnData object

2. QC filtering (cells and genes)

3. Normalization and log-transform

4. Feature selection (highly variable genes)

5. Dimensionality reduction (PCA, UMAP)

6. Clustering

7. Marker gene identification

8. Visualization

1. Read FASTA/FASTQ

2. Quality filtering

3. Alignment or motif search

4. Feature extraction

5. Statistical summary

Source Transparency

Related Skills

mathematician

statistician

procurement

requirements-analyst