ARIS — Autonomous ML Research In Sleep
Skill by ara.so — Daily 2026 Skills collection
ARIS (Auto-Research-In-Sleep) turns Claude Code into an autonomous ML research engine. It chains idea discovery → cross-model review loops → paper writing → compiled PDF into hands-off overnight pipelines. Claude Code drives execution while an external model (Codex/GPT-5.4, GLM, DeepSeek, Kimi, etc.) acts as adversarial reviewer — breaking self-play blind spots that single-model review cannot escape.
What It Does
| Workflow | Trigger | What Runs |
|---|---|---|
| Idea Discovery | /idea-discovery | Literature survey → 8–12 ideas → novelty check → pilot GPU runs → ranked report |
| Auto Review Loop | /auto-review-loop | 4-round review/fix cycle, score tracked per round (e.g. 5/10 → 7.5/10) |
| Paper Writing | /paper-writing | Narrative → outline → figures → LaTeX → PDF → 2-round auto-improvement |
| Full Pipeline | /research-pipeline | Chains all three end-to-end from a single prompt |
Installation
# 1. Clone and install skills
git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep.git
cp -r Auto-claude-code-research-in-sleep/skills/* ~/.claude/skills/
# 2. Install Codex MCP (cross-model reviewer)
npm install -g @openai/codex
codex setup # set model to gpt-5.4 when prompted
claude mcp add codex -s user -- codex mcp-server
# 3. Verify MCP is connected
claude mcp list # should show "codex" in the list
Codex Model Configuration
The reviewer model is read from ~/.codex/config.toml, not from skill files. Edit directly if needed:
# ~/.codex/config.toml
model = "gpt-5.4" # recommended — most rigorous reviewer
# model = "gpt-5.3-codex"
# model = "gpt-5.2-codex"
# model = "o3"
Core Workflows
Workflow 1 — Idea Discovery
/idea-discovery "factorized gap in discrete diffusion language models"
Be specific — "NLP" produces weak ideas; "factorized gap in discrete diffusion LMs" targets a real research gap.
What runs:
- Multi-source literature search (arXiv, Scholar, Zotero, Obsidian, local PDFs)
- Claude brainstorms 8–12 candidate ideas
- Codex reviewer cross-checks novelty against literature
- Pilot GPU experiments on top candidates
- Ranked idea report saved to
idea_discovery_report.md
Workflow 2 — Auto Review Loop
/auto-review-loop
Run from a directory containing your paper draft or experiment results.
What runs:
- Claude submits current work to Codex reviewer
- Codex returns structured critique with score
/10 - Claude implements fixes (experiments, writing, ablations)
- Repeat up to 4 rounds or until score threshold met
- Score curve saved to
docs/auto_review_score_curve.png
Workflow 3 — Paper Writing
/paper-writing "NARRATIVE_REPORT.md"
Point at a narrative markdown file describing your findings.
What runs:
- Outline generation (sections, figures, tables)
- Figure generation from experiment results
- LaTeX source assembly
pdflatexcompilation- 2-round auto-review-and-improve cycle
- Final PDF + anti-hallucination BibTeX (fetched from DBLP/CrossRef)
Full Pipeline
/research-pipeline "your research direction"
Chains Workflows 1 → 2 → 3 from a single prompt. Wake up to a scored, compiled paper.
Inline Configuration Overrides
Append — key: value to any command:
/research-pipeline "topic" — AUTO_PROCEED: false
/research-pipeline "topic" — human checkpoint: true
/research-pipeline "topic" — arxiv download: true
/research-pipeline "topic" — AUTO_PROCEED: false, human checkpoint: true
| Parameter | Default | Effect |
|---|---|---|
AUTO_PROCEED | true | false = pause at idea selection gate before committing GPU time |
human checkpoint | false | true = pause after each review round for manual feedback |
arxiv download | false | true = download full PDFs during literature survey (vs metadata only) |
DBLP_BIBTEX | true | false = use LLM-generated BibTeX (not recommended — hallucination risk) |
Alternative Model Combinations
No Claude or OpenAI API required — swap any OpenAI-compatible endpoint via the llm-chat MCP server:
# Install the bundled llm-chat MCP server
cd Auto-claude-code-research-in-sleep/mcp-servers/llm-chat
pip install -r requirements.txt
# Configure your provider
export LLM_CHAT_BASE_URL="https://open.bigmodel.cn/api/paas/v4" # GLM-4
export LLM_CHAT_API_KEY="your-key"
export LLM_CHAT_MODEL="glm-4-plus"
# Add to Claude Code
claude mcp add llm-chat -s user -- python server.py
Tested reviewer models:
| Provider | Model | Notes |
|---|---|---|
| OpenAI | gpt-5.4 | Recommended — most rigorous |
| Zhipu AI | glm-4-plus | Strong Chinese-language papers |
| MiniMax | abab6.5s-chat | Fast, cost-effective |
| Moonshot | moonshot-v1-128k | Kimi — long-context papers |
| DeepSeek | deepseek-chat | Code-heavy experiments |
| 01.ai | yi-large | LongCat — long context |
Anti-Hallucination Citations
BibTeX is fetched from real databases by default — no manual flag needed:
# skills/paper-writing/citation_fetcher.py pattern used internally
import requests
def fetch_bibtex_dblp(title: str) -> str | None:
"""Fetch real BibTeX from DBLP by paper title."""
resp = requests.get(
"https://dblp.org/search/publ/api",
params={"q": title, "format": "json", "h": 1}
)
hits = resp.json().get("result", {}).get("hits", {}).get("hit", [])
if not hits:
return None
key = hits[0]["info"].get("key", "")
bib_resp = requests.get(f"https://dblp.org/rec/{key}.bib")
return bib_resp.text if bib_resp.ok else None
def fetch_bibtex_crossref(doi: str) -> str | None:
"""Fallback: fetch BibTeX from CrossRef by DOI."""
resp = requests.get(
f"https://api.crossref.org/works/{doi}/transform/application/x-bibtex"
)
return resp.text if resp.ok else None
Disable with — DBLP_BIBTEX: false if working fully offline.
Optional Integrations
Zotero
# Install Zotero Better BibTeX plugin, then:
export ZOTERO_API_KEY="your-zotero-web-api-key"
export ZOTERO_LIBRARY_ID="your-library-id"
export ZOTERO_LIBRARY_TYPE="user" # or "group"
Literature search will query your Zotero library before hitting arXiv.
Obsidian
export OBSIDIAN_VAULT_PATH="/path/to/your/vault"
Skill will search markdown notes in the vault for related work before external queries.
Feishu / Lark Notifications
export FEISHU_WEBHOOK_URL="https://open.feishu.cn/open-apis/bot/v2/hook/your-token"
export FEISHU_MODE="push" # off | push | interactive
| Mode | Behaviour |
|---|---|
off | No notifications |
push | One-way alerts: review scores, experiment completions, checkpoints |
interactive | Mobile approval buttons at AUTO_PROCEED: false gates |
Directory Layout After a Pipeline Run
your-project/
├── idea_discovery_report.md # ranked ideas with novelty scores
├── NARRATIVE_REPORT.md # auto-generated findings narrative
├── paper/
│ ├── main.tex # assembled LaTeX
│ ├── main.pdf # compiled output
│ ├── figures/ # auto-generated plots
│ └── references.bib # real BibTeX from DBLP/CrossRef
├── experiments/
│ ├── pilot_runs/ # idea-discovery GPU pilots
│ └── review_round_*/ # per-round experiment results
└── docs/
└── auto_review_score_curve.png
Python Integration Pattern
Trigger ARIS workflows programmatically from a Python script (e.g. a cron job or CI step):
import subprocess
import json
from pathlib import Path
def run_aris_pipeline(
research_direction: str,
output_dir: str = ".",
auto_proceed: bool = True,
human_checkpoint: bool = False,
arxiv_download: bool = False,
) -> dict:
"""
Launch ARIS full pipeline via Claude Code CLI.
Returns parsed score progression from the review curve JSON.
"""
overrides = ", ".join([
f"AUTO_PROCEED: {str(auto_proceed).lower()}",
f"human checkpoint: {str(human_checkpoint).lower()}",
f"arxiv download: {str(arxiv_download).lower()}",
])
command = f'/research-pipeline "{research_direction}" — {overrides}'
result = subprocess.run(
["claude", "--print", command],
cwd=output_dir,
capture_output=True,
text=True,
)
if result.returncode != 0:
raise RuntimeError(f"ARIS pipeline failed:\n{result.stderr}")
# Parse score progression if available
score_json = Path(output_dir) / "docs" / "review_scores.json"
if score_json.exists():
return json.loads(score_json.read_text())
return {"stdout": result.stdout}
# Example: nightly research job
if __name__ == "__main__":
scores = run_aris_pipeline(
research_direction="token-level uncertainty calibration in autoregressive LMs",
output_dir="./nightly_research",
auto_proceed=True,
human_checkpoint=False,
)
print(f"Final review score: {scores.get('rounds', [{}])[-1].get('score')}/10")
Skill Composition
ARIS ships 20 composable sub-skills. Chain them manually for custom workflows:
# Literature only
/literature-survey "topic"
# Brainstorm without pilot experiments
/idea-brainstorm "topic" — pilot experiments: false
# Single review round (no loop)
/single-review "path/to/draft.md"
# Proof-writing (community skill)
/proof-writer "theorem statement"
# Write paper from existing narrative, skip review
/paper-writing "NARRATIVE.md" — auto-review: false
Troubleshooting
Codex MCP not found
claude mcp list # verify "codex" appears
codex setup # re-run setup if missing
claude mcp remove codex && \
claude mcp add codex -s user -- codex mcp-server # re-add
Skills not loading in Claude Code
ls ~/.claude/skills/ # verify files copied
# Each skill must be a directory with SKILL.md inside
ls ~/.claude/skills/auto-review-loop/SKILL.md
pdflatex not found during paper writing
# macOS
brew install --cask mactex-no-gui
# Ubuntu/Debian
sudo apt install texlive-full
# Then retry — skill auto-detects pdflatex on PATH
Reviewer returns empty critique
Check ~/.codex/config.toml — ensure model is set and your API key is valid:
codex "say hello" # quick smoke test outside Claude Code
GLM/DeepSeek reviewer not triggering
Verify llm-chat MCP server is listed:
claude mcp list # should show "llm-chat"
echo $LLM_CHAT_BASE_URL # must be set in the shell that launches claude
Score not improving after 4 rounds
- Add
— human checkpoint: trueand inspect each round's critique file inexperiments/review_round_*/ - Consider switching reviewer model — a different architecture surfaces different weaknesses
- Lower-level issues (bad data, flawed baseline) need manual intervention before another loop
Community Skills
| Skill | Description |
|---|---|
proof-writer | Rigorous theorem proof drafting with anti-hallucination citations |
Add your own skill: create skills/your-skill-name/SKILL.md and open a PR.
Cross-Model Review — Why It Works
Claude Code (executor) Codex / external LLM (reviewer)
───────────────────── ───────────────────────────────
Fast, fluid code execution ←→ Deliberate, rigorous critique
Broad context retention Adversarial probing of blind spots
Narrative generation Structural weakness detection
Single-model self-review falls into local minima — the same pattern-matching that generated the work also evaluates it. Cross-model review is adversarial: the reviewer actively probes weaknesses the executor didn't anticipate. The 1→2 model jump produces the largest quality gain; adding more reviewers yields diminishing returns.