autoresearch

Set up and run Karpathy's autoresearch — an autonomous AI research loop that trains a small language model overnight. An AI agent modifies train.py, runs 5-minute experiments, keeps improvements, discards failures, and repeats (~12 experiments/hour, ~100 overnight). Use when the user says "autoresearch", "run autoresearch", "set up autoresearch", or wants to run autonomous ML research experiments while AFK.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "autoresearch" with this command: npx skills add jonmumm/skills/jonmumm-skills-autoresearch

Autoresearch

Set up and run Andrej Karpathy's autoresearch — an autonomous AI research loop where an AI agent iterates on a tiny language model's training code overnight, running ~100 experiments while you sleep.

Concept

Human writes program.md (the "research org" instructions)
    ↓
AI agent reads program.md
    ↓
Agent modifies train.py (model, optimizer, hyperparameters)
    ↓
Runs 5-minute training on GPU → measures val_bpb
    ↓
If improved → git commit (keep). If not → git revert (discard).
    ↓
Repeat (~12 experiments/hour, ~100 overnight)

Key insight: You don't touch any Python files. Instead, you program program.md — the Markdown instructions that guide the AI agent. You're programming the research organization, not running individual experiments.

Prerequisites

Hardware

PlatformRequirement
Mac (recommended for this skill)Apple Silicon (M1/M2/M3/M4), 16 GB RAM minimum (32 GB+ better)
Linux/WindowsNVIDIA GPU (RTX 3060+), CUDA toolkit installed

Check your Mac chip: Apple menu → About This Mac → look for "Chip: M1/M2/M3/M4". Any Mac bought since late 2020 has Apple Silicon.

Software

ToolPurposeCheck
GitExperiment tracking (save points)git --version
uvPython + dependency manageruv --version
Claude Code (or Cursor/Codex)The AI agent brainclaude --version

Setup

Step 1: Install prerequisites

# Install uv (handles Python + all dependencies automatically)
curl -LsSf https://astral.sh/uv/install.sh | sh

# IMPORTANT: Close and reopen your terminal after installing uv

# Verify
uv --version
git --version

Step 2: Clone the repo

Mac (Apple Silicon):

cd ~/Desktop
git clone https://github.com/miolini/autoresearch-macos.git
cd autoresearch-macos

Linux/Windows (NVIDIA GPU):

cd ~/Desktop
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch

About the Mac fork: Karpathy himself links to miolini/autoresearch-macos from his README. The developer (Artem Andreenko) has 167 public projects on GitHub and a years-long track record. The fork swaps FlashAttention-3 for PyTorch's built-in SDPA and adds Apple Metal/MPS optimizations. The entire codebase is ~630 lines — fully auditable in 20 minutes.

Step 3: Install dependencies and prepare data

# Install Python + all packages
uv sync

# Download training data + build tokenizer (one-time, ~2 min)
uv run prepare.py

# Run one test training to verify setup (~5 min)
uv run train.py

If the test training finishes and shows a val_bpb score — you're ready.

Step 4: Launch the autonomous loop

# Navigate to the project
cd ~/Desktop/autoresearch-macos  # or autoresearch on Linux

# Launch Claude Code
claude

Then type this prompt:

Hi have a look at program.md and let's kick off a new experiment! Let's do the setup first.

That's it. Minimize the window and go to sleep. The agent will:

  1. Read program.md
  2. Modify train.py with an experimental change
  3. Run a 5-minute training
  4. Check val_bpb — if improved, git commit; if not, git revert
  5. Repeat all night

Pro tip: To make it fully autonomous, tell the agent upfront: "Run fully autonomously. Don't ask for confirmation between experiments. Keep going until I come back."

Project Structure

autoresearch/
├── prepare.py      ← Constants, data prep, runtime utilities (DO NOT modify)
├── train.py        ← Model + optimizer + training loop (agent modifies this)
├── program.md      ← Agent instructions (human modifies this)
├── pyproject.toml  ← Dependencies
├── results.tsv     ← Experiment log (score, memory, kept/discarded)
└── analysis.ipynb  ← Graphs showing progress over time

The three files that matter

FileModified byPurpose
prepare.pyNobodyFixed constants, one-time data prep, runtime utilities
train.pyAI agentGPT model, Muon + AdamW optimizer, training loop. Everything is fair game: architecture, hyperparameters, optimizer, batch size, etc.
program.mdHumanInstructions for the AI agent. This is your leverage point — better instructions → faster research progress.

Key Terminology

TermMeaning
val_bpbValidation bits per byte — the score measuring model quality. Lower = better. Vocab-size-independent so architectural changes are fairly compared.
train.pyThe single Python file containing all training code. The AI agent modifies only this file during experiments.
program.mdYour instruction file for the AI agent. The only file you (the human) should edit. Think of it as a mission briefing for your tireless lab assistant.
5-minute budgetEvery experiment gets exactly 5 minutes of training time. Makes experiments directly comparable regardless of what the agent changes. ~12 experiments/hour.

Design Choices

  • Single file to modify. The agent only touches train.py. Keeps scope manageable and diffs reviewable.
  • Fixed time budget. Training always runs for exactly 5 minutes, regardless of platform. This makes experiments directly comparable and means autoresearch finds the most optimal model for your specific hardware.
  • Self-contained. No external dependencies beyond PyTorch. No distributed training, no complex configs. One GPU, one file, one metric.

Tips for Best Results

  1. Start simple. Get one manual uv run train.py working first. If that doesn't work, the autonomous loop won't either.

  2. Your one job is to improve program.md. Add instructions like:

    • "Try small improvements first"
    • "Focus on making val_bpb go down"
    • "Think step by step and explain every change before making it"
    • "If an experiment direction hasn't worked after 3 attempts, try something completely different"
  3. Don't panic when experiments fail. Most will not improve the score. Out of 100 overnight experiments, maybe 10–20 are keepers. This is normal — the agent automatically keeps wins and discards losses.

  4. Check in periodically at first. Watch the first 3–4 experiments to make sure the loop is working before going AFK.

  5. More memory helps. 32 GB+ unified memory on Mac lets the agent explore larger models and more complex architectures.

Tuning for Smaller Hardware

If running on a Mac with limited memory, Karpathy recommends these adjustments (ask the agent to make them, or edit train.py/prepare.py yourself):

SettingLocationDefaultSmaller Hardware
Datasetprepare.pyFineWeb-EduUse TinyStories for better results at small scale
vocab_sizeprepare.py8192Try 4096, 2048, 1024, or even 256 (byte-level)
MAX_SEQ_LENprepare.pyLargeLower significantly, even down to 256
DEVICE_BATCH_SIZEtrain.pyDefaultIncrease slightly as you lower MAX_SEQ_LEN
EVAL_TOKENSprepare.pyDefaultDecrease so validation runs faster
DEPTHtrain.py8Lower to 4 for smaller models
WINDOW_PATTERNtrain.py"SSSL"Use just "L" (alternating banded attention may be inefficient)
TOTAL_BATCH_SIZEtrain.pyDefaultLower to 2**14 (~16K) or smaller

Checking Results

After a night of experiments:

# See the git history of successful experiments
git log --oneline

# Check the results log
cat results.tsv

# Open the analysis notebook (optional)
# Use Jupyter or Cursor to view analysis.ipynb

You'll find:

  • Git history — each commit is a successful experiment that improved val_bpb
  • Lower val_bpb — the model is genuinely smarter (baseline starts around 0.9979)
  • Modified train.py — architecture tweaks, optimizer changes, hyperparameter adjustments
  • results.tsv — every experiment with score, memory usage, and keep/discard status

Alternative Agent Options

AgentCostBest For
Claude Code$20/mo (Pro) or $100/mo (Max)Full autopilot — runs entirely in Terminal
CursorFree tier available, $20/mo ProVisual learners — AI chat panel + file editor
Codex CLIVariesAlternative to Claude Code
Claude.ai chatFree/$20/moManual only — copy-paste results back and forth

Troubleshooting

ProblemFix
command not found: uvClose terminal, open a new one
command not found: gitMac: install Xcode CLI tools. Linux: sudo apt install git
CUDA / GPU error (Linux/Windows)Search "install CUDA toolkit [your GPU]"
MPS / Metal error (Mac)Make sure you cloned miolini/autoresearch-macos, not the original
Out of memoryGPU needs more VRAM. Agent usually adapts automatically. See tuning table above.
Claude Code auth errorRequires paid Claude subscription ($20/mo minimum)
Test training works but loop doesn't startMake sure you're in the right folder when launching claude. Be explicit in your prompt.

References

ResourceLink
Original repo (NVIDIA)karpathy/autoresearch
Mac fork (Apple Silicon)miolini/autoresearch-macos
MLX fork (Mac)trevin-creator/autoresearch-mlx
Windows fork (RTX)jsegov/autoresearch-win-rtx
Karpathy's announcementTweet
Karpathy's updateTweet
TinyStories datasetHuggingFace
uv package managerastral.sh/uv

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

create-agents-md

No summary provided by upstream source.

Repository SourceNeeds Review
General

mutation-testing

No summary provided by upstream source.

Repository SourceNeeds Review
General

dont-use-use-effect

No summary provided by upstream source.

Repository SourceNeeds Review
General

react-render-performance

No summary provided by upstream source.

Repository SourceNeeds Review