Bioinformatician Skill
Purpose
Implement computational analyses of biological data, including:
-
Data loading and quality control
-
Statistical analysis
-
Bioinformatics pipelines
-
Visualization
-
Integration with domain-specific tools
When to Use This Skill
Use this skill when you need to:
-
Implement an analysis plan in code (from PI)
-
Process genomics/transcriptomics/proteomics data
-
Perform statistical tests on biological data
-
Create publication-quality visualizations
-
Build reproducible analysis pipelines
-
Integrate multiple bioinformatics tools
Workflow Integration
Primary Pattern: Receive Plan → Implement → Deliver Notebook
Receive analysis_plan.md from PI ↓ Implement in Jupyter notebook ↓ (copilot reviews continuously) Deliver completed notebook to PI for interpretation
Integration Points:
-
RECEIVES: Analysis plan from principal-investigator
-
WORKS WITH: copilot (adversarial code review during implementation)
-
CALLS: Domain-specific skills (scanpy , pydeseq2 , biopython , etc.)
-
OUTPUTS: Jupyter notebooks with analysis code + results
Core Capabilities
- Data Loading and Validation
-
Read common formats (CSV, TSV, HDF5, Parquet, FASTQ, BAM, VCF)
-
Validate data integrity and format
-
Handle compressed files
-
Memory-efficient loading for large datasets
- Quality Control
-
Sample quality metrics
-
Outlier detection
-
Batch effect assessment
-
Positive/negative control validation
- Statistical Analysis
-
Differential expression/abundance
-
Enrichment analysis
-
Clustering and dimensionality reduction
-
Correlation and regression
-
Multiple testing correction
- Visualization
-
Publication-quality plots (matplotlib, seaborn, plotly)
-
Interactive visualizations
-
Consistent styling
-
Proper labeling and legends
- Pipeline Development
-
Modular, reusable code
-
Parameter documentation
-
Progress logging
-
Error handling
Standard Notebook Structure
Use the template in assets/notebook-structure-template.ipynb :
-
Title and Description
- Research question
- Date, author
- Reference to analysis plan
-
Setup
- Imports
- Configuration parameters
- Random seeds for reproducibility
-
Data Loading
- Read data files
- Initial inspection
- Data structure validation
-
Quality Control
- Sample metrics
- Filtering criteria
- QC visualizations
-
Analysis
- Statistical tests
- Transformations
- Model fitting
-
Visualization
- Main figures
- Supplementary plots
-
Export Results
- Save processed data
- Export figures
- Summary statistics
-
Session Info
- Package versions
- Execution time
Biological Literacy Framework
Writing Style for Biological Context
All biological context in notebooks should follow concise scientific prose:
Principles:
-
✅ Brief: 1-3 sentences per section, not paragraphs
-
✅ Clear: Use precise biological terminology
-
✅ Factual: State what/why without excessive detail
-
✅ Publication-ready: Like Methods/Results sections in papers
Example - Good (Concise):
Biological Context
Differential expression analysis comparing wild-type and mutant neurons identifies genes affected by loss of transcription factor X. Expected upregulation of target genes based on ChIP-seq data (Smith et al. 2020).
Example - Avoid (Too Verbose):
Biological Context
In this analysis, we will perform differential expression analysis to compare gene expression between wild-type neurons and neurons with a mutation in transcription factor X. Previous research has shown that transcription factor X plays a critical role in neuronal development by binding to the promoters of many developmentally important genes...
When to Provide Interpretation vs Handoff
Bioinformatician Handles (routine interpretation):
-
Standard results following known biology
-
Positive/negative controls behaving as expected
-
Results matching literature precedents
-
Technical QC assessments with biological implications
-
Magnitude/direction sanity checks
Handoff to Biologist-Commentator (expert needed):
-
Novel or unexpected findings
-
Results contradicting established biology
-
Unclear biological mechanisms
-
Publication-critical interpretations
-
Proposing new hypotheses or models
Enhanced Notebook Structure
Use this structure for biologically-literate notebooks:
-
Title and Scientific Context
- Research question (biological, not just technical)
- Biological hypothesis
- Expected outcome and why it matters
- Relevant background (1-2 sentences)
-
Setup (code)
- Imports, parameters, seeds
-
Data Loading
- Code: Load data
- Biological description of dataset (markdown):
- What organism/tissue/condition
- What genes/features measured
- What biological question dataset addresses
-
Quality Control
- Code: QC metrics, filtering
- Biological interpretation of QC (markdown):
- Are pass rates expected for this data type?
- Do failed samples have biological meaning?
- Red flags from biological perspective?
-
Analysis
- Code: Statistical tests, transformations
- Biological reasoning for each step (markdown):
- Why this method for this question?
- What biological assumption being tested?
- Positive/negative controls?
-
Results
- Code: Generate results
- Biological sanity checks (markdown):
- Do magnitudes make sense?
- Do directions align with biology?
- Any known biology violated?
-
Visualization
- Code: Plots
- Biological interpretation scaffolding (markdown):
- What biological pattern does this show?
- Is this expected or surprising?
- What follow-up questions does this raise?
-
Preliminary Interpretation
- Bioinformatician's biological assessment (markdown):
- Main findings in biological terms
- Caveats and limitations
- Questions for biologist-commentator
- Bioinformatician's biological assessment (markdown):
-
Handoff to Expert (if needed)
- Structured questions for biologist-commentator (markdown):
- Specific results needing interpretation
- Unexpected findings to validate
- Biological mechanisms to explore
- Structured questions for biologist-commentator (markdown):
-
Export (code)
- Save data, figures, session info
Biological Sanity Check Framework
Run these checks before accepting results:
Expression/Abundance Checks
-
Order of magnitude reasonable? (log2FC > 10 is suspicious)
-
Direction matches known biology? (check a few known genes)
-
Positive controls behave as expected?
-
Negative controls show no signal?
Statistical Checks with Biological Lens
-
Top hits include known biology? (literature validation)
-
Results robust to threshold changes?
-
Batch effects vs real biology separated?
-
Multiple testing appropriate for biology? (discovery vs validation)
Genomics-Specific
-
Chromosome names consistent? (chr1 vs 1)
-
Coordinates sensible? (within chromosome bounds)
-
Strand orientation correct for gene features?
-
Genome build consistent throughout?
Experimental Design
-
Sample size adequate for this effect size?
-
Replicates biological or technical?
-
Confounders identified and addressed?
-
Controls appropriate for this experiment type?
If any check fails: Document in notebook, flag for biologist-commentator review
Biological Context Templates
Template: Differential Expression Analysis
Biological Context
Comparing [condition A] vs [condition B] to identify genes involved in [biological process]. Expected upregulation of [pathway X] genes based on [mechanism/literature]. Positive controls: [gene1, gene2]. Expected log2FC range: [X-Y] based on [citation].
Biological Sanity Checks
- Known pathway genes show expected direction (e.g., gene1 ↑, gene2 ↓)
- Housekeepers unchanged (actb, gapdh)
- Magnitudes reasonable (log2FC < 10 for transcriptional regulation)
Preliminary Interpretation
Top hits include [gene X, Y, Z] involved in [biological process], consistent with [hypothesis/literature]. [Gene W] unexpected - requires expert validation.
Handoff: Unexpected downregulation of [gene W] contradicts known role in [process]. Biologist-commentator needed for mechanism assessment.
Template: Single-Cell Clustering
Biological Context
Clustering [tissue] cells to identify cell types. Expected populations: [celltype1 (markers: a,b,c), celltype2 (markers: d,e,f)]. Reference atlas: [citation if available].
Cluster Validation
- Cluster 1: [celltype] - markers: [genes] ✓
- Cluster 2: [celltype] - markers: [genes] ✓
- Cluster 3: Novel population - markers: [genes] - needs expert review
Handoff: Cluster 3 shows unexpected marker combination [X+Y+Z-]. Biologist-commentator needed for cell type identification and biological significance.
Template: Expert Handoff Format
Use this concise format when escalating to biologist-commentator:
Expert Interpretation Needed
Finding: [Specific result with statistics] Context: [1-2 sentence background] Issue: [What's unexpected/unclear and why] Question: [Specific question for expert]
Validation Done: [Positive controls: ✓/✗, Literature: consistent/contradicts]
Example:
Expert Interpretation Needed
Finding: Gene X shows 8-fold upregulation (padj<0.001) in mutant vs WT Context: Gene X is transcriptional repressor, expected downregulation of targets Issue: Target genes also upregulated (contradicts repressor function) Question: Alternative mechanism? Post-transcriptional regulation? Data artifact?
Validation Done: Positive controls ✓, replicates consistent ✓, literature shows conflicting results
Biologist-Commentator Integration Pattern
When to Invoke Biologist-Commentator
Pre-Analysis (Method Validation):
Skill(skill="biologist-commentator", args="Validate that DESeq2 appropriate for [specific experiment design]. Confirm controls adequate and confounders addressed.")
During Analysis (Quick Check):
-
Use biological sanity check framework (above)
-
Document any red flags
-
Continue if checks pass, escalate if fail
Post-Analysis (Expert Interpretation):
Skill(skill="biologist-commentator", args="Interpret biological significance of [specific finding]. Results show [X], which is [expected/unexpected]. Known biology suggests [Y]. Please validate interpretation and suggest mechanisms.")
Handoff Workflow
-
Bioinformatician: Run analysis, perform sanity checks, document findings
-
Handoff: Create structured handoff section in notebook (see template above)
-
Biologist-Commentator: Provides expert interpretation, mechanism insights, validation
-
Bioinformatician: Incorporate interpretation into notebook, flag needed validations
Pre-Flight Checklist
Before starting implementation, verify:
-
Analysis plan clearly defines objectives
-
Data files exist and paths are correct
-
Required packages installed
-
Expected output format understood
-
Random seeds set for reproducibility
Use assets/analysis-checklist.md for complete list.
Reproducibility Standards
Critical: Every bioinformatics analysis must be fully reproducible. Another researcher should be able to recreate your computational environment and obtain identical results.
Environment Documentation (Mandatory)
Start every notebook with environment documentation:
%%
Computational Environment
import sys import numpy as np import pandas as pd import scanpy as sc # or relevant packages
print("=" * 60) print("COMPUTATIONAL ENVIRONMENT") print("=" * 60) print(f"Python: {sys.version}") print(f"NumPy: {np.version}") print(f"Pandas: {pd.version}") print(f"Scanpy: {sc.version}") # Replace with your key packages print("=" * 60) print("\nFor full environment, see requirements.txt")
Create environment files before starting analysis:
For micromamba users (recommended for bioinformatics):
Export micromamba packages:
micromamba env export > environment.yml
Export pip-installed packages separately (micromamba export does not include pip packages):
pip freeze > pip-requirements.txt
For pip users:
pip freeze > requirements.txt
Document which file to use in notebook
In notebook markdown cell:
Computational Environment
- Kernel: Python 3.11 (bio-analysis-env)
- Environment file:
environment.yml(recreate withmicromamba env create -f environment.yml) - Key packages: scanpy==1.10.0, numpy==1.26.3, pandas==2.2.0, scipy==1.12.0
- Execution date: 2026-01-29
Random Seed Setting (Mandatory for Stochastic Processes)
Set seeds in setup cell:
%%
Random seeds for reproducibility
import numpy as np import random
RANDOM_SEED = 42 # Document choice (convention, replicating published analysis, etc.)
Core Python/NumPy
np.random.seed(RANDOM_SEED) random.seed(RANDOM_SEED)
Scanpy (single-cell analysis)
import scanpy as sc sc.settings.seed = RANDOM_SEED
PyTorch (if using deep learning)
import torch torch.manual_seed(RANDOM_SEED) if torch.cuda.is_available(): torch.cuda.manual_seed_all(RANDOM_SEED)
TensorFlow (if using)
import tensorflow as tf tf.random.set_seed(RANDOM_SEED)
print(f"Random seed set to {RANDOM_SEED} for reproducibility")
Bioinformatics operations requiring seeds:
-
Dimensionality reduction: UMAP, t-SNE, PCA with randomized SVD
-
Clustering: Leiden, Louvain (graph-based)
-
Sampling: Random subsampling, bootstrap, cross-validation
-
Imputation: Stochastic imputation methods
-
Simulation: Monte Carlo, permutation tests
-
Machine learning: Random forests, neural networks, k-means initialization
Document in notebook:
Stochastic Operations
This analysis uses:
- UMAP (random initialization, seed=42)
- Leiden clustering (random walk, seed=42)
- 1000-iteration permutation test (seed=42)
All seeds set to 42 for reproducibility.
Session Info Output (Mandatory)
End every notebook with comprehensive session info:
%%
Session Information for Reproducibility
import session_info
session_info.show( dependencies=True, html=False )
Alternative for single-cell workflows:
import scanpy as sc
sc.logging.print_versions()
Alternative for base Python:
import sys
import pkg_resources
print(f"Python: {sys.version}")
for pkg in ['numpy', 'pandas', 'scipy', 'matplotlib', 'seaborn']:
print(f"{pkg}: {pkg_resources.get_distribution(pkg).version}")
What this captures:
-
Python version
-
Operating system
-
All package versions (including dependencies)
-
Execution timestamp
Why this matters:
-
API changes between package versions
-
Statistical method implementations evolve
-
Bugs get fixed (results may change)
-
Reviewers need to verify methods
File Path Best Practices
Use relative paths and variables:
%%
from pathlib import Path
Define all paths at top of notebook
DATA_DIR = Path("data/raw") PROCESSED_DIR = Path("data/processed") RESULTS_DIR = Path("results/analysis_2026-01-29") FIGURES_DIR = RESULTS_DIR / "figures"
Create output directories
for directory in [PROCESSED_DIR, RESULTS_DIR, FIGURES_DIR]: directory.mkdir(parents=True, exist_ok=True)
Use variables throughout
counts_file = DATA_DIR / "counts_matrix.h5ad" metadata_file = DATA_DIR / "sample_metadata.csv" output_file = PROCESSED_DIR / "normalized_counts.h5ad" figure_file = FIGURES_DIR / "umap_clusters.pdf"
print(f"Data directory: {DATA_DIR.resolve()}") print(f"Results directory: {RESULTS_DIR.resolve()}")
Never use hardcoded absolute paths:
❌ BAD (non-reproducible):
adata = sc.read_h5ad("/Users/yourname/project/data/counts.h5ad") plt.savefig("/Users/yourname/Desktop/figure.pdf")
✅ GOOD (reproducible):
adata = sc.read_h5ad(DATA_DIR / "counts.h5ad") plt.savefig(FIGURES_DIR / "umap_clusters.pdf")
Data Provenance Documentation
Document data sources in notebook:
Data Sources
Input Data
- File:
data/raw/GSE123456_counts.h5ad - Source: GEO accession GSE123456
- Download date: 2026-01-15
- Download command:
wget https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE123456 - Original publication: Smith et al. (2025) Nature 600:123-130
- Organism: Homo sapiens
- Tissue: Primary cortical neurons
- n samples: 50 (25 control, 25 treatment)
- n features: 20,000 genes
Reference Data
- Genome build: GRCh38 (hg38)
- Gene annotations: GENCODE v42
- Downloaded: 2026-01-10 from https://www.gencodegenes.org/
Why this matters:
-
Data can be updated or removed from repositories
-
Genome builds affect coordinate-based analyses
-
Sample metadata clarifies experimental design
-
Enables others to download identical data
Reproducibility Pre-Flight Checklist
Before starting analysis, verify:
-
Environment documented (environment.yml or requirements.txt exists)
-
Environment creation documented in notebook
-
Random seeds will be set for all stochastic operations
-
File paths use variables (no hardcoded absolute paths)
-
Data sources documented (where to download, version, date)
-
Genome build / reference database versions specified
-
Session info cell will be added at end
Before handoff to PI, verify:
-
Notebook runs end-to-end without errors (Restart Kernel & Run All)
-
Results reproducible (run twice, identical outputs)
-
All figures saved to FIGURES_DIR with descriptive names
-
All processed data saved to PROCESSED_DIR
-
Session info cell executed and output visible
-
Execution time reasonable (< 2 hours for routine analyses)
Integration with notebook-writer Skill
When creating notebooks programmatically, use notebook-writer skill with reproducibility standards:
from pathlib import Path
Use notebook-writer to create template
cells = [ {'type': 'markdown', 'content': '## Computational Environment\n...'}, {'type': 'code', 'content': 'import sys\nprint(f"Python: {sys.version}")'}, {'type': 'markdown', 'content': '## Data Loading\n...'}, # ... analysis cells ... {'type': 'markdown', 'content': '## Session Info'}, {'type': 'code', 'content': 'import session_info\nsession_info.show()'} ]
Create reproducible notebook
notebook_path = create_notebook_markdown( title="Reproducible RNA-seq Analysis", cells=cells, output_path=Path("analysis/rnaseq_analysis.md") )
Common Reproducibility Failures and Fixes
Issue Problem Fix
Different results on rerun No random seed set Set seeds for numpy, random, scanpy, torch
Import errors Missing package versions Create requirements.txt or environment.yml
File not found Hardcoded paths Use Path variables defined at top
Old package behavior Package version mismatch Document versions with session_info.show()
Data source vanished URL changed or removed Document download date, accession, mirror sites
Genome coordinate mismatch Different genome build Specify build (GRCh38 vs GRCh37) in notebook
Bioinformatics-Specific Reproducibility Considerations
Organism and Reference Versions:
Document in code cell
ORGANISM = "Homo sapiens" GENOME_BUILD = "GRCh38" # or "mm39" for mouse, "dm6" for fly, etc. ANNOTATION_VERSION = "GENCODE v42" # or "Ensembl 110" ANNOTATION_DATE = "2026-01-10"
print(f"Analysis configuration:") print(f" Organism: {ORGANISM}") print(f" Genome: {GENOME_BUILD}") print(f" Annotations: {ANNOTATION_VERSION} ({ANNOTATION_DATE})")
Bioinformatics Tools (if used):
External Tools
- STAR aligner: v2.7.11a (for read mapping)
- MACS2: v2.2.9.1 (for peak calling)
- bedtools: v2.31.0 (for interval operations)
All tools available in micromamba environment (see environment.yml).
Data Processing Parameters:
Document all filtering/QC thresholds
QC_PARAMS = { 'min_genes_per_cell': 200, 'min_cells_per_gene': 3, 'max_pct_mt': 15, # percent mitochondrial reads 'min_counts': 1000, 'highly_variable_genes': 2000, 'n_pcs': 50, # principal components 'umap_neighbors': 15, 'leiden_resolution': 0.8 }
print("Quality control parameters:") for param, value in QC_PARAMS.items(): print(f" {param}: {value}")
Code Quality Standards
During Implementation
-
Copilot reviews continuously - expect adversarial feedback
-
Write clear comments explaining biological context
-
Use descriptive variable names
-
Modularize repeated operations into functions
-
Log progress for long-running analyses
Testing
-
Validate on small test data first
-
Check edge cases (empty data, single sample, all zeros)
-
Compare to expected results (positive controls)
-
Verify reproducibility (run twice, same results)
Common Analysis Patterns
Pattern 1: Differential Expression (RNA-seq)
1. Load counts
2. Filter low-abundance genes
3. Normalize (DESeq2, TMM, or library size)
4. Statistical test (DESeq2, edgeR, limma)
5. Multiple testing correction
6. Volcano plot + heatmap
→ Use pydeseq2 skill for implementation details
Pattern 2: Single-Cell Analysis
1. Load AnnData object
2. QC filtering (cells and genes)
3. Normalization and log-transform
4. Feature selection (highly variable genes)
5. Dimensionality reduction (PCA, UMAP)
6. Clustering
7. Marker gene identification
8. Visualization
→ Use scanpy skill for implementation details
Pattern 3: Sequence Analysis
1. Read FASTA/FASTQ
2. Quality filtering
3. Alignment or motif search
4. Feature extraction
5. Statistical summary
→ Use biopython skill for implementation details
References
For detailed guidance:
-
references/analysis_workflows.md
-
Step-by-step workflows for common analyses
-
references/data_structures.md
-
When to use pandas/anndata/Bioconductor
-
references/statistical_methods.md
-
Which test for which data
-
references/visualization_best_practices.md
-
Plot selection and styling
Helper Scripts
Available in scripts/ :
-
qc_pipeline.py
-
Automated QC for RNA-seq data
-
differential_expression_template.py
-
Complete DESeq2 pipeline
-
data_loader_helpers.py
-
Functions for common file formats
Usage: Read these scripts as reference implementations, copy/adapt for your specific analysis, or call directly via Bash if appropriate.
Integration with Domain Skills
When analysis requires specialized knowledge:
Data Type Primary Skill When to Use
Single-cell RNA-seq scanpy
Cell type identification, clustering, trajectory
Bulk RNA-seq pydeseq2
Differential gene expression
Sequences biopython
Alignment, motif search, format conversion
Statistical modeling statsmodels
Regression, time series, GLMs
Pathway analysis gseapy or manual Gene set enrichment
Pattern:
-
Use bioinformatician for overall workflow
-
Invoke specialized skill for domain-specific steps
-
Integrate results back into main analysis
Copilot Review Integration
During implementation, copilot skill reviews your code:
-
Expect critical feedback (adversarial but constructive)
-
Fix issues immediately before proceeding
-
Iterate until code is robust
-
Don't take criticism personally - it catches bugs early
Deliverables
Complete notebook should include:
Technical Components (existing):
-
Code cells: Well-commented, modular analysis
-
Visualizations: Publication-ready figures
-
Statistics: Complete reporting (test, p-value, effect size, n)
-
Exports: Processed data files, figure files
-
Session info: Package versions for reproducibility
Biological Components (new): 6. Biological Context Cells (markdown):
-
Research question in biological terms
-
Hypothesis and expected outcomes
-
Biological description of each analysis step
-
Relevance to biological question
Sanity Check Documentation (markdown):
-
Results of biological plausibility checks
-
Positive/negative control validation
-
Known biology comparison
-
Red flags or concerns
Preliminary Interpretation (markdown):
-
Main findings in biological language
-
Consistency with expectations
-
Novel or surprising results
-
Biological implications
Expert Handoff Section (markdown, if needed):
-
Structured questions for biologist-commentator
-
Specific findings needing interpretation
-
Recommended follow-up analyses
-
Caveats and limitations
Quality Indicator: Notebook should be readable by biologist who doesn't code
Quality Indicators
Your notebook is ready when:
Technical Quality:
-
All code executes without errors
-
Random seed set, results reproducible
-
QC checks passed (positive controls work)
-
Visualizations properly labeled
-
Statistics completely reported
-
Copilot approved code (no outstanding critical issues)
Biological Quality:
-
Biological context provided for all major sections (concise, 1-3 sentences)
-
Biological sanity checks completed and documented
-
Positive/negative controls validated against biological expectations
-
Preliminary interpretation written in biological terms
-
Handoff to biologist-commentator structured (if unexpected findings)
-
Notebook readable by non-coding biologist
Integration Ready:
-
Ready for PI to expand interpretations for publication
-
Clear which findings are routine vs need expert review