autoresearch
"The researcher's job shifts from writing Python to writing Markdown." — Andrej Karpathy
Autoresearch is an autonomous ML experimentation framework. An AI agent iteratively modifies train.py , runs fixed 5-minute GPU experiments, evaluates with a single metric (val_bpb ), and commits only improvements via git ratcheting. The result: wake up to 100+ experiments logged and a monotonically better model.
When to use this skill
-
Setting up autoresearch on a GPU machine for the first time
-
Writing or refining program.md research directives for the agent
-
Launching an overnight autonomous experiment loop
-
Interpreting results.tsv to understand what the agent found
-
Configuring the system for constrained hardware (limited VRAM)
-
Understanding the ratcheting mechanism and git workflow
-
Porting to Apple Silicon (MLX) or Windows RTX
Core Architecture
Human authors program.md │ ▼ Agent reads program.md + train.py │ ▼ Agent modifies train.py → git commit │ ▼ uv run train.py (exactly 300 seconds) │ ▼ Extract val_bpb + peak_vram_mb │ ┌────┴────┐ improved? no improvement │ │ keep commit git reset HEAD~1 │ │ └──────┬───────┘ │ log to results.tsv │ ▼ repeat ∞
Mutable vs. Immutable Files
File Agent access Purpose
train.py
Read + Write Model, optimizer, training loop (~630 lines)
program.md
Read-only Human research directives
prepare.py
Read-only Data pipeline + evaluate_bpb() harness
constants.py
Read-only TIME_BUDGET=300 , MAX_SEQ_LEN , EVAL_TOKENS
pyproject.toml
Read-only Locked dependencies (no new packages)
results.tsv
Append All experiments: kept and discarded
Instructions
Step 1: Install Prerequisites
Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
Clone the repository
git clone https://github.com/karpathy/autoresearch cd autoresearch
Install locked dependencies
uv sync
Step 2: Prepare Data (One-Time, ~2 Minutes)
Downloads FineWeb-Edu parquet shards, trains BPE tokenizer
Last shard is reserved for validation — never seen during training
uv run prepare.py
For constrained hardware, edit prepare.py before running:
Lower MAX_SEQ_LEN for GPUs with limited VRAM
MAX_SEQ_LEN = 256 # default: 2048
Step 3: Run a Baseline Experiment
Single 5-minute experiment to verify setup
uv run train.py > run.log 2>&1
Extract key metrics
grep "^val_bpb:|^peak_vram_mb:" run.log
Expected output:
val_bpb: 0.9979 peak_vram_mb: 38420
Step 4: Author program.md
program.md is the human-written research charter the agent reads at the start of every loop iteration. Write it as precise Markdown instructions:
Research Program
Goal
Minimize val_bpb on the FineWeb-Edu validation set within the 300-second budget.
Current Baseline
val_bpb: 0.9979 (depth-12 GPT, Muon + AdamW optimizer)
Directions to Explore
- Attention variants: MLA, GQA, sliding window, local-global hybrid
- Layer types: MoE FFN layers, SwiGLU activations
- Optimizer tuning: Muon momentum, AdamW β values, learning rate schedule
- Architectural depth/width tradeoffs within VRAM budget
Constraints
- Must complete within 300 seconds
- Peak VRAM must stay under 39GB
- No new packages (use only what is in pyproject.toml)
- Do not modify prepare.py or constants.py
Notes from Previous Runs
- Depth-12 improvements transfer to depth-24 (scale-invariant gains)
- RoPE positional encoding outperformed learned embeddings (+0.008 val_bpb)
Effective program.md principles:
-
Be specific about what to explore — vague directives waste experiments
-
Record what has already been tried (prevents redundant experiments)
-
Note hardware constraints explicitly
-
Use the current best val_bpb as a reference point
Step 5: Run the Autonomous Agent Loop
Point your AI agent (Claude Code, Codex, etc.) at the repository with program.md as its research context. The agent will:
- Read program.md
- current train.py
-
Hypothesize an improvement
-
Modify train.py
- commit
-
Execute uv run train.py (300 seconds)
-
Extract val_bpb ; keep or revert via git
-
Append to results.tsv
-
Repeat
With Claude Code (OMC):
From inside autoresearch/
Give Claude the context: "Run the autoresearch loop following program.md"
With Claude Code CLI directly:
claude "Follow program.md. Run autonomous research loop on train.py. Execute: uv run train.py, extract val_bpb, keep improvements, revert failures. Log everything to results.tsv. Do not stop until I say so."
Step 6: Monitor Results
Live monitoring during a run
watch -n 30 "tail -20 results.tsv"
Count kept vs. discarded
awk -F'\t' '{print $4}' results.tsv | sort | uniq -c
Find the best experiment
sort -t$'\t' -k2 -n results.tsv | head -5
Check current best val_bpb
git log --oneline -5
Step 7: Interpret results.tsv
commit val_bpb memory_gb status description a3f2c91 0.9697 37.2 keep SwiGLU activation + depth-12 b8e1d04 0.9821 38.1 discard MoE 4-expert: marginal gain c1a5f30 crash — crash OOM: sequence length 4096
Status Meaning
keep
val_bpb improved; commit retained on branch
discard
No improvement; git reset HEAD~1 applied
crash
OOM, syntax error, or timeout; always reverted
Examples
Example 1: Overnight Run Summary
Session summary: 126 experiments, 18 improvements Best val_bpb: 0.9697 (started: 0.9979) Top improvements:
- SwiGLU activation: -0.012 val_bpb
- GQA with 4 KV heads: -0.009 val_bpb
- Muon momentum 0.92→0.95: -0.006 val_bpb
Example 2: Low-VRAM Configuration (6GB GPU)
In prepare.py — edit before uv run prepare.py
MAX_SEQ_LEN = 256 # was 2048 EVAL_TOKENS = 2_097_152 # was 20_971_520 (scale down proportionally)
Example 3: Extract Experiments by Category
Find all attention-related experiments
grep -i "attention|GQA|MLA|MHA" results.tsv
List only improvements sorted by gain
awk -F'\t' '$4=="keep"' results.tsv | sort -t$'\t' -k2 -n
Available scripts
Run from inside the autoresearch repository directory:
Script Purpose Usage
setup.sh
One-time environment setup bash scripts/setup.sh [--seq-len 512]
run-experiment.sh
Single 5-min experiment + metric extraction bash scripts/run-experiment.sh
run-loop.sh
Autonomous loop: run → keep/revert → repeat bash scripts/run-loop.sh [--max 20]
show-results.sh
Human-readable results.tsv report bash scripts/show-results.sh [--top 10]
check-hardware.sh
GPU/CUDA/uv availability check (JSON output) bash scripts/check-hardware.sh
Typical overnight session
bash scripts/check-hardware.sh bash scripts/setup.sh --seq-len 512 # adjust for your VRAM
Edit program.md with your research directives
bash scripts/run-loop.sh --max 100 --desc "session-1" bash scripts/show-results.sh --kept-only
References
Detailed documentation in references/ :
File Contents
references/architecture.md
System design, immutability contract, git ratcheting, key design decisions
references/program-md-guide.md
How to write effective program.md directives; full template + principles
references/hardware-config.md
VRAM settings by GPU, memory optimization techniques, troubleshooting
Best practices
-
Write program.md before running — the agent is only as good as its directives; vague programs waste compute
-
Start with the baseline first — always uv run train.py manually before launching the loop to confirm the setup works
-
Keep MAX_SEQ_LEN in prepare.py consistent — changing it mid-run invalidates val_bpb comparisons
-
Never modify prepare.py or constants.py — the evaluation harness must stay fixed for results to be meaningful
-
Scale improvements before committing — test that a depth-12 improvement also holds at depth-24 before treating it as a fundamental gain
-
Commit program.md updates — version-control your research directives alongside results.tsv for reproducibility
-
Monitor VRAM — add peak_vram_mb constraints in program.md for your GPU's headroom
-
No new dependencies — the agent cannot pip install ; it can only use what is in pyproject.toml
Hardware Requirements
Hardware Status Notes
H100 80GB Recommended Default config, full MAX_SEQ_LEN=2048
A100 40GB Supported Lower MAX_SEQ_LEN if needed
RTX 4090 24GB Community Reduce MAX_SEQ_LEN to 512
GTX 1660 Ti 6GB Community fork MAX_SEQ_LEN=256, reduced EVAL_TOKENS
Apple Silicon (M-series) MLX port Community fork; different optimizer API
Windows RTX Community WSL2 + CUDA recommended
Key Metrics Reference
Metric Direction Description
val_bpb
Lower = better Validation bits-per-byte; vocabulary-size-independent
peak_vram_mb
Lower = more headroom Peak GPU memory during the training run
Experiments/hour Higher = faster search ~12 at TIME_BUDGET=300
References
-
GitHub — karpathy/autoresearch
-
nanochat — the underlying LLM training framework
-
Karpathy's original announcement (X/Twitter)
-
DeepWiki — autoresearch architecture
-
MIT License