research

Deep multi-source research with confidence scoring. Auto-classifies complexity. Use for technical investigation, fact-checking. NOT for code review or simple Q&A.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "research" with this command: npx skills add wyattowalsh/agents/wyattowalsh-agents-research

Deep Research

General-purpose deep research with multi-source synthesis, confidence scoring, and anti-hallucination verification. Adopts SOTA patterns from OpenAI Deep Research (multi-agent triage pipeline), Google Gemini Deep Research (user-reviewable plans), STORM (perspective-guided conversations), Perplexity (source confidence ratings), and LangChain ODR (supervisor-researcher with reflection).

Vocabulary

TermDefinition
queryThe user's research question or topic; the unit of investigation
claimA discrete assertion to be verified; extracted from sources or user input
sourceA specific origin of information: URL, document, database record, or API response
evidenceA source-backed datum supporting or contradicting a claim; always has provenance
provenanceThe chain from evidence to source: tool used, URL, access timestamp, excerpt
confidenceScore 0.0-1.0 per claim; based on evidence strength and cross-validation
cross-validationVerifying a claim across 2+ independent sources; the core anti-hallucination mechanism
triangulationConfirming a finding using 3+ methodologically diverse sources
contradictionWhen two credible sources assert incompatible claims; must be surfaced explicitly
synthesisThe final research product: not a summary but a novel integration of evidence with analysis
journalThe saved markdown record of a research session, stored in ~/.claude/research/
sweepWave 1: broad parallel search across multiple tools and sources
deep diveWave 2: targeted follow-up on specific leads from the sweep
leadA promising source or thread identified during the sweep, warranting deeper investigation
tierComplexity classification: Quick (0-2), Standard (3-5), Deep (6-8), Exhaustive (9-10)
findingA verified claim with evidence chain, confidence score, and provenance; the atomic unit of output
gapAn identified area where evidence is insufficient, contradictory, or absent
bias markerAn explicit flag on a finding indicating potential bias (recency, authority, LLM prior, etc.)
degraded modeOperation when research tools are unavailable; confidence ceilings applied

Dispatch

$ARGUMENTSAction
Question or topic text (has verb or ?)Investigate — classify complexity, execute wave pipeline
Vague input (<5 words, no verb, no ?)Intake — ask 2-3 clarifying questions, then classify
check <claim> or verify <claim>Fact-check — verify claim against 3+ search engines
compare <A> vs <B> [vs <C>...]Compare — structured comparison with decision matrix output
survey <field or topic>Survey — landscape mapping, annotated bibliography
track <topic>Track — load prior journal, search for updates since last session
resume [number or keyword]Resume — resume a saved research session
list [active|domain|tier]List — show journal metadata table
archiveArchive — move journals older than 90 days
delete <N>Delete — delete journal N with confirmation
export [N]Export — render HTML dashboard for journal N (default: current)
EmptyGallery — show topic examples + "ask me anything" prompt

Auto-Detection Heuristic

If no mode keyword matches:

  1. Ends with ? or starts with question word (who/what/when/where/why/how/is/are/can/does/should/will) → Investigate
  2. Contains vs, versus, compared to, or between noun phrases → Compare
  3. Declarative statement with factual claim, no question syntax → Fact-check
  4. Broad field name with no specific question → ask: "Investigate a specific question, or survey the entire field?"
  5. Ambiguous → ask: "Would you like me to investigate this question, verify this claim, or survey this field?"

Gallery (Empty Arguments)

Present research examples spanning domains:

#DomainExampleLikely Tier
1Technology"What are the current best practices for LLM agent architectures?"Deep
2Academic"What is the state of evidence on intermittent fasting for longevity?"Standard
3Market"How does the competitive landscape for vector databases compare?"Deep
4Fact-check"Is it true that 90% of startups fail within the first year?"Standard
5Architecture"When should you choose event sourcing over CRUD?"Standard
6Trends"What emerging programming languages gained traction in 2025-2026?"Standard

Pick a number, paste your own question, or type guide me.

Skill Awareness

Before starting research, check if another skill is a better fit:

SignalRedirect
Code review, PR review, diff analysisSuggest /honest-review
Strategic decision with adversaries, game theorySuggest /wargame
Multi-perspective expert debateSuggest /host-panel
Prompt optimization, model-specific promptingSuggest /prompt-engineer

If the user confirms they want general research, proceed.

Complexity Classification

Score the query on 5 dimensions (0-2 each, total 0-10):

Dimension012
Scope breadthSingle fact/definitionMulti-faceted, 2-3 domainsCross-disciplinary, 4+ domains
Source difficultyTop search results sufficeSpecialized databases or multiple source typesPaywalled, fragmented, or conflicting sources
Temporal sensitivityStable/historicalEvolving field (months matter)Fast-moving (days/weeks matter), active controversy
Verification complexityEasily verifiable (official docs)2-3 independent sources neededContested claims, expert disagreement, no consensus
Synthesis demandAnswer is a fact or listCompare/contrast viewpointsNovel integration of conflicting threads
TotalTierStrategy
0-2QuickInline, 1-2 searches, fire-and-forget
3-5StandardSubagent wave, 3-5 parallel searchers, report delivered
6-8DeepAgent team (TeamCreate), 3-5 teammates, interactive session
9-10ExhaustiveAgent team, 4-6 teammates + nested subagent waves, interactive

Present the scoring to the user. User can override tier with --depth <tier>.

Wave Pipeline

All non-Quick research follows this 5-wave pipeline. Quick merges Waves 0+1+4 inline.

Wave 0: Triage (always inline, never parallelized)

  1. Run !uv run python skills/research/scripts/research-scanner.py "$ARGUMENTS" for deterministic pre-scan
  2. Decompose query into 2-5 sub-questions
  3. Score complexity on the 5-dimension rubric
  4. Check tool availability — probe key MCP tools; set degraded mode flags and confidence ceilings per references/source-selection.md
  5. Select tools per domain signals — read references/source-selection.md
  6. Check for existing journals — if track or resume, load prior state
  7. Present triage to user — show: complexity score, sub-questions, planned strategy, estimated tier. User may override.

Wave 1: Broad Sweep (parallel)

Scale by tier:

Quick (inline): 1-2 tool calls sequentially. No subagents.

Standard (subagent wave): Dispatch 3-5 parallel subagents via Task tool:

Subagent A → brave-search + duckduckgo-search for sub-question 1
Subagent B → exa + g-search for sub-question 2
Subagent C → context7 / deepwiki / arxiv / semantic-scholar for technical specifics
Subagent D → wikipedia / wikidata for factual grounding
[Subagent E → PubMed / openalex if academic domain detected]

Deep (agent team): TeamCreate "research-{slug}":

Lead: triage (Wave 0), orchestrate, judge reconcile (Wave 3), synthesize (Wave 4)
  |-- web-researcher:       brave-search, duckduckgo-search, exa, g-search
  |-- tech-researcher:      context7, deepwiki, arxiv, semantic-scholar, package-version
  |-- content-extractor:    fetcher, trafilatura, docling, wikipedia, wayback
  |-- [academic-researcher: arxiv, semantic-scholar, openalex, crossref, PubMed]
  |-- [adversarial-reviewer: devil's advocate — counter-search all emerging findings]

Spawn academic-researcher if domain signals include academic/scientific. Spawn adversarial-reviewer for Exhaustive tier or if verification complexity >= 2.

Exhaustive: Deep team + each teammate runs nested subagent waves internally.

Each subagent/teammate returns structured findings:

{
  "sub_question": "...",
  "findings": [{"claim": "...", "source_url": "...", "source_tool": "...", "excerpt": "...", "confidence_raw": 0.6}],
  "leads": ["url1", "url2"],
  "gaps": ["could not find data on X"]
}

Wave 1.5: Perspective Expansion (Deep/Exhaustive only)

STORM-style perspective-guided conversation. Spawn 2-4 perspective subagents:

PerspectiveFocusQuestion Style
SkepticWhat could be wrong? What's missing?"What evidence would disprove this?"
Domain ExpertTechnical depth, nuance, edge cases"What do practitioners actually encounter?"
PractitionerReal-world applicability, trade-offs"What matters when you actually build this?"
TheoristFirst principles, abstractions, frameworks"What underlying model explains this?"

Each perspective agent reviews Wave 1 findings and generates 2-3 additional sub-questions from their viewpoint. These sub-questions feed into Wave 2.

Wave 2: Deep Dive (parallel, targeted)

  1. Rank leads from Wave 1 by potential value (citation frequency, source authority, relevance)
  2. Dispatch deep-read subagents — use fetcher/trafilatura/docling to extract full content from top leads
  3. Follow citation chains — if a source cites another, fetch the original
  4. Fill gaps — for each gap identified in Wave 1, dispatch targeted searches
  5. Use thinking MCPs:
    • cascade-thinking for multi-perspective analysis of complex findings
    • structured-thinking for tracking evidence chains and contradictions
    • think-strategies for complex question decomposition (Standard+ only)

Wave 3: Cross-Validation (parallel)

The anti-hallucination wave. Read references/confidence-rubric.md and references/self-verification.md.

For every claim surviving Waves 1-2:

  1. Independence check — are supporting sources truly independent? Sources citing each other are NOT independent.
  2. Counter-search — explicitly search for evidence AGAINST each major claim using a different search engine
  3. Freshness check — verify sources are current (flag if >1 year old for time-sensitive topics)
  4. Contradiction scan — read references/contradiction-protocol.md, identify and classify disagreements
  5. Confidence scoring — assign 0.0-1.0 per references/confidence-rubric.md
  6. Bias sweep — check each finding against 10 bias categories (7 core + 3 LLM-specific) per references/bias-detection.md

Self-Verification (3+ findings survive): Spawn devil's advocate subagent per references/self-verification.md:

For each finding, attempt to disprove it. Search for counterarguments. Check if evidence is outdated. Verify claims actually follow from cited evidence. Flag LLM confabulations.

Adjust confidence: Survives +0.05, Weakened -0.10, Disproven set to 0.0. Adjustments are subject to hard caps — single-source claims remain capped at 0.60 even after survival adjustment.

Wave 4: Synthesis (always inline, lead only)

Produce the final research product. Read references/output-formats.md for templates.

The synthesis is NOT a summary. It must:

  1. Answer directly — answer the user's question clearly
  2. Map evidence — all verified findings with confidence and citations
  3. Surface contradictions — where sources disagree, with analysis of why
  4. Show confidence landscape — what is known confidently, what is uncertain, what is unknown
  5. Audit biases — biases detected during research
  6. Identify gaps — what evidence is missing, what further research would help
  7. Distill takeaways — 3-7 numbered key findings
  8. Cite sources — full bibliography with provenance

Output format adapts to mode:

  • Investigate → Research Brief (Standard) or Deep Report (Deep/Exhaustive)
  • Fact-check → Quick Answer with verdict + evidence
  • Compare → Decision Matrix
  • Survey → Annotated Bibliography
  • User can override with --format brief|deep|bib|matrix

Confidence Scoring

ScoreBasis
0.9-1.0Official docs + 2 independent sources agree, no contradictions
0.7-0.82+ independent sources agree, minor qualifications
0.5-0.6Single authoritative source, or 2 sources with partial agreement
0.3-0.4Single non-authoritative source, or conflicting evidence
0.2-0.3Multiple non-authoritative sources with partial agreement, or single source with significant caveats
0.1-0.2LLM reasoning only, no external evidence found
0.0Actively contradicted by evidence

Hard rules:

  • No claim reported at >= 0.7 unless supported by 2+ independent sources
  • Single-source claims cap at 0.6 regardless of source authority
  • Degraded mode (all research tools unavailable): max confidence 0.4, all findings labeled "unverified"

Merged confidence (for claims supported by multiple sources): c_merged = 1 - (1-c1)(1-c2)...(1-cN) capped at 0.99

Evidence Chain Structure

Every finding carries this structure:

FINDING RR-{seq:03d}: [claim statement]
  CONFIDENCE: [0.0-1.0]
  EVIDENCE:
    1. [source_tool] [url] [access_timestamp] — [relevant excerpt, max 100 words]
    2. [source_tool] [url] [access_timestamp] — [relevant excerpt, max 100 words]
  CROSS-VALIDATION: [agrees|contradicts|partial] across [N] independent sources
  BIAS MARKERS: [none | list of detected biases with category]
  GAPS: [none | what additional evidence would strengthen this finding]

Use !uv run python skills/research/scripts/finding-formatter.py --format markdown to normalize.

Source Selection

Read references/source-selection.md during Wave 0 for the full tool-to-domain mapping. Summary:

Domain SignalPrimary ToolsSecondary Tools
Library/API docscontext7, deepwiki, package-versionbrave-search
Academic/scientificarxiv, semantic-scholar, PubMed, openalexcrossref, brave-search
Current events/trendsbrave-search, exa, duckduckgo-search, g-searchfetcher, trafilatura
GitHub repos/OSSdeepwiki, repomixbrave-search
General knowledgewikipedia, wikidata, brave-searchfetcher
Historical contentwayback, brave-searchfetcher
Fact-checking3+ search engines mandatorywikidata for structured claims
PDF/document analysisdoclingtrafilatura

Multi-engine protocol: For any claim requiring verification, use minimum 2 different search engines. Different engines have different indices and biases. Agreement across engines increases confidence.

Bias Detection

Check every finding against 10 bias categories. Read references/bias-detection.md for full detection signals and mitigation strategies.

BiasDetection SignalMitigation
LLM priorMatches common training patterns, lacks fresh evidenceFlag; require fresh source confirmation
RecencyOverweighting recent results, ignoring historical contextSearch for historical perspective
AuthorityUncritically accepting prestigious sourcesCross-validate even authoritative claims
ConfirmationQueries constructed to confirm initial hypothesisUse neutral queries; search for counterarguments
SurvivorshipOnly finding successful examplesSearch for failures/counterexamples
SelectionSearch engine bubble, English-onlyUse multiple engines; note coverage limitations
AnchoringFirst source disproportionately shapes interpretationDocument first source separately; seek contrast

State Management

  • Journal path: ~/.claude/research/
  • Archive path: ~/.claude/research/archive/
  • Filename convention: {YYYY-MM-DD}-{domain}-{slug}.md
    • {domain}: tech, academic, market, policy, factcheck, compare, survey, track, general
    • {slug}: 3-5 word semantic summary, kebab-case
    • Collision: append -v2, -v3
  • Format: YAML frontmatter + markdown body + <!-- STATE --> blocks

Save protocol:

  • Quick: save once at end with status: Complete
  • Standard/Deep/Exhaustive: save after Wave 1 with status: In Progress, update after each wave, finalize after synthesis

Resume protocol:

  1. resume (no args): find status: In Progress journals. One → auto-resume. Multiple → show list.
  2. resume N: Nth journal from list output (reverse chronological).
  3. resume keyword: search frontmatter query and domain_tags for match.

Use !uv run python skills/research/scripts/journal-store.py for all journal operations.

State snapshot (appended after each wave save):

<!-- STATE
wave_completed: 2
findings_count: 12
leads_pending: ["url1", "url2"]
gaps: ["topic X needs more sources"]
contradictions: 1
next_action: "Wave 3: cross-validate top 8 findings"
-->

In-Session Commands (Deep/Exhaustive)

Available during active research sessions:

CommandEffect
drill <finding #>Deep dive into a specific finding with more sources
pivot <new angle>Redirect research to a new sub-question
counter <finding #>Explicitly search for evidence against a finding
exportRender HTML dashboard
statusShow current research state without advancing
sourcesList all sources consulted so far
confidenceShow confidence distribution across findings
gapsList identified knowledge gaps
?Show command menu

Read references/session-commands.md for full protocols.

Reference File Index

FileContentRead When
references/source-selection.mdTool-to-domain mapping, multi-engine protocol, degraded modeWave 0 (selecting tools)
references/confidence-rubric.mdScoring rubric, cross-validation rules, independence checksWave 3 (assigning confidence)
references/evidence-chain.mdFinding template, provenance format, citation standardsAny wave (structuring evidence)
references/bias-detection.md10 bias categories (7 core + 3 LLM-specific), detection signals, mitigation strategiesWave 3 (bias audit)
references/contradiction-protocol.md4 contradiction types, resolution frameworkWave 3 (contradiction detection)
references/self-verification.mdDevil's advocate protocol, hallucination detectionWave 3 (self-verification)
references/output-formats.mdTemplates for all 5 output formatsWave 4 (formatting output)
references/team-templates.mdTeam archetypes, subagent prompts, perspective agentsWave 0 (designing team)
references/session-commands.mdIn-session command protocolsWhen user issues in-session command
references/dashboard-schema.mdJSON data contract for HTML dashboardexport command

Loading rule: Load ONE reference at a time per the "Read When" column. Do not preload.

Critical Rules

  1. No claim >= 0.7 unless supported by 2+ independent sources — single-source claims cap at 0.6
  2. Never fabricate citations — if URL, author, title, or date cannot be verified, use vague attribution ("a study in this tradition") rather than inventing specifics
  3. Always surface contradictions explicitly — never silently resolve disagreements; present both sides with evidence
  4. Always present triage scoring before executing research — user must see and can override complexity tier
  5. Save journal after every wave in Deep/Exhaustive mode — enables resume after interruption
  6. Never skip Wave 3 (cross-validation) for Standard/Deep/Exhaustive tiers — this is the anti-hallucination mechanism
  7. Multi-engine search is mandatory for fact-checking — use minimum 2 different search tools (e.g., brave-search + duckduckgo-search)
  8. Apply the Accounting Rule after every parallel dispatch — N dispatched = N accounted for before proceeding to next wave
  9. Distinguish facts from interpretations in all output — factual claims carry evidence; interpretive claims are explicitly labeled as analysis
  10. Flag all LLM-prior findings — claims matching common training data but lacking fresh evidence must be flagged with bias marker
  11. Max confidence 0.4 in degraded mode — when all research tools are unavailable, report all findings as "unverified — based on training knowledge"
  12. Load ONE reference file at a time — do not preload all references into context
  13. Track mode must load prior journal before searching — avoid re-researching what is already known
  14. The synthesis is not a summary — it must integrate findings into novel analysis, identify patterns across sources, and surface emergent insights not present in any single source
  15. PreToolUse Edit hook is non-negotiable — the research skill never modifies source files; it only creates/updates journals in ~/.claude/research/

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

python-conventions

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

research

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

devops-engineer

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

infrastructure-coder

No summary provided by upstream source.

Repository SourceNeeds Review