research-harness

Cognitive discipline for AI-native scientific experimentation. Trigger when setting up controlled experiments with LLM agents, designing reproducible evaluation pipelines, or structuring research workspaces for long-running agent collaboration. Provides guardrails, not recipes — teaches agents how to reason about experiments, not which commands to run.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "research-harness" with this command: npx skills add zhelunsun/agent-research-harness

research-harness

Version: 1.0.0 Cognitive discipline for AI-native scientific experimentation — guardrails, not recipes.

When to Use

Trigger this skill when the user:

  • Sets up a new AI-native research experiment repo
  • Designs controlled experiments with LLM agents
  • Needs reproducible evaluation, statistics, and error analysis
  • Wants to structure a research workspace for long-running agent collaboration
  • Wants agent-safe research governance that prevents overclaiming
  • Says anything like: "research harness", "experiment framework", "AI科研", "对照实验", "评分体系", "可复现性", "科研workflow", "agent协作科研", "可控实验", "效应量"

Core Philosophy

This skill does not prescribe what experiment to run. It prescribes how to think while running it.

Research agents fail not because they lack capability, but because they:

  • Scale before validating the minimum loop
  • Overclaim what the data proves
  • Treat surprising results as methodology failures before checking the execution chain
  • Delete failed runs to make progress look cleaner
  • Change baselines or rubrics silently

The antidote is cognitive discipline — a set of non-negotiable mental habits enforced by repo structure, not by prompt reminders. Detailed reasoning for each discipline is in references/scientific-thinking.md.


Five Cognitive Disciplines

#DisciplineCore QuestionDeep dive
1Minimum Closed Loop Before ScaleCan the smallest version produce distinguishable signals?references/experiment-design.md
2Isolated Variables & Attributable BaselinesDoes each group add exactly one variable?references/experiment-design.md
3Dual-Track ValidationDo two independent scoring systems agree?references/scoring-statistics.md
4Effect Size Over SignificanceWhat is the magnitude, not just the p-value?references/scoring-statistics.md
5Pipeline Before InterpretationWas the execution chain verified before the hypothesis was questioned?references/scientific-thinking.md

Disciplines 1-2: experiment design. 3-4: scoring & statistics. 5: critical reasoning.


Five Governance Rules

#RulePrinciple
1Human Owns Direction; Agent Owns ExecutionAgent cannot change research questions, promote evidence without review, or make academic decisions
2Evidence Has Status; AI Output Is Not FactAll AI-generated evidence starts as candidate; only back-to-source verification promotes to verified
3Failed Runs Are Data, Not TrashRegister every run in the manifest; failures are process evidence against survivorship bias
4Protected Surfaces Change Only By ProposalBaselines, rubrics, raw results, and schema require version bump + documented proposal
5Every Handoff Needs an Alignment DocShort doc replaces long chat history for agent onboarding

Details in references/agent-collaboration.md.


Phase Workflow

Phase 0 · Scaffold

Goal: Set up the three-layer repo and root entry files.

  • thinking-space/ — research direction, claims, decisions (human)
  • execution-layer/ — briefs, logs, results, drafts (agent)
  • code-workshop/ — runnable artifacts, packages

Root files: AGENTS.md (workspace map), PLAN.md (phase panel), WORKFLOW.md (procedure), harness/README.md (governance).

Directory skeleton and rationale: references/repo-architecture.md.

Phase 1 · Harden

Goal: Make the repo self-checking before formal execution.

  1. Module contracts — Each core module gets a CONTRACT.md (purpose, inputs, outputs, invariants, local validator). Template in references/repo-architecture.md.
  2. Local validatorsscripts/validate_<module>.py per module; scripts/validate_repo_state.py as aggregator. Gate rule: 0 FAIL before any formal run.
  3. Experiment manifestexperiments/results/manifest.csv as run-level provenance ledger (run_id, wave, task_id, group, model, version metadata, status, retry_of, git_commit).
  4. Protected surfaces — Baselines, rubrics, raw results, scoring config, schema. Require version bump + proposal to change.

Phase 2 · Design

Goal: Design attributable controlled experiments.

  • Progressive building: minimum artifacts → schema validation → small task set → dry run → scoring → expand. Design details in references/experiment-design.md.
  • Controlled groups: Baseline → incremental treatments. Adjacent groups differ by exactly one variable.
  • Gold checklists: Every task has must_include, forbidden, and scoring_notes.
  • Output contract: Agent output follows a strict schema (YAML/JSON). The scorer and analysis pipeline depend on this contract.

Phase 3 · Execute & Analyze

Goal: Run experiments, score, compute statistics, analyze errors.

Preflight gate: local validators must pass. Then:

  1. Dry run — print prompt, no API call
  2. Smoke run — 1 task × 2 groups, verify output parsing
  3. Wave 1 — small set × all groups, minimum viable data
  4. Scoring: Track A (rule-based) + Track B (semantic) cross-validation. Details in references/scoring-statistics.md.
  5. Statistics: Cohen's d primary, 95% CI, paired t, Wilcoxon. --reproduce flag for one-click reproducibility.
  6. Error analysis: hallucination, output depth, specificity, task appropriateness.

Phase 4 · Handoff & Writing

Goal: Package results for the next phase or agent.

  • Alignment doc: ~1 page with state, entry files, new surfaces, preflight commands, protected surfaces. Never pass chat history.
  • Upstream proposals: Any insight affecting direction goes to sync/upstream_proposals/ first. Template in references/agent-collaboration.md.
  • Writing markers: [REF-MISSING], [CRITICAL-CHECK], [TODO]. Never use AI numbers without verification.

Non-Negotiables

  1. No unverified citation becomes a research fact
  2. No debug result becomes a formal result
  3. No agent changes baseline, rubric, or metric definitions without a proposal
  4. No raw result is overwritten
  5. No failed experiment is deleted
  6. No phase gate passes before validators report zero FAIL

References

  • references/repo-architecture.md — three-layer repo, module contracts, manifest, validators
  • references/experiment-design.md — progressive building, controlled groups, gold checklists
  • references/scoring-statistics.md — dual-track validation, effect size, reproducibility
  • references/scientific-thinking.md — cognitive disciplines for agent-led research
  • references/agent-collaboration.md — governance, evidence status, alignment docs

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

Telecom Visit Prep

China Telecom account manager enterprise visit preparation assistant. Input company name to auto-search info, generate business opportunities and visit scripts.

Registry SourceRecently Updated
Automation

OpenClaw Growth Engineer

AI Growth Engineer for mobile apps and agent runtimes including OpenClaw and Hermes. Correlate analytics, crashes, billing, feedback, store signals, and repo...

Registry SourceRecently Updated
Automation

bossskill

Startup coaching and boss secretary workflow for founders, small business owners, customer follow-up, team management, task review, business diagnosis, and .给创业者和中小企业老板用的 AI 经营顾问、客户管理助手、团队管理助手和执行闭环秘书。

Registry SourceRecently Updated
771Profile unavailable
Automation

tiktok-live-cart-automation

Automation for TikTok Live shopping, including monitoring pinned products, adding to cart, and preparing for checkout. Use for: automatically adding products...

Registry SourceRecently Updated
1380Profile unavailable