swain-search

Collect, normalize, and cache source materials into reusable troves that swain-design artifacts can reference.

Mode detection

Signal	Mode
No trove exists for the topic, or user says "research X" / "gather sources"	Create — new trove
Trove exists and user provides new sources or says "add to" / "extend"	Extend — add sources to existing trove
Trove exists and user says "refresh" or sources are past TTL	Refresh — re-fetch stale sources
User asks "what troves do we have" or "find sources about X"	Discover — search existing troves by tag

Create mode

Build a new trove from scratch.

Step 1 — Gather inputs

Ask the user (or infer from context) for:

Trove ID — a slug for the topic (e.g., websocket-vs-sse). Suggest one if the context is clear.
Tags — keywords for discovery (e.g., real-time, websocket, sse)
Sources — any combination of:
- Web search queries ("search for WebSocket vs SSE comparisons")
- URLs (web pages, forum threads, docs)
- Video/audio URLs
- Local file paths
Freshness TTL overrides — optional, defaults are fine for most troves

If invoked from swain-design (e.g., spike entering Active), the artifact context provides the topic, tags, and sometimes initial sources.

Step 2 — Collect and normalize

For each source, use the appropriate capability. Read skills/swain-search/references/normalization-formats.md for the exact markdown structure per source type.

Web search queries:

Use a web search capability to find relevant results
Select the top 3-5 most relevant results
For each: fetch the page, normalize to markdown per the web page format
If no web search capability is available, tell the user and skip

Web page URLs:

Fetch the page using a browser or page-fetching capability
Strip boilerplate (nav, ads, sidebars, cookie banners)
Normalize to markdown per the web page format
If fetch fails, record the URL in manifest with a failed: true flag and move on

Video/audio URLs:

Use a media transcription capability to get the transcript
Normalize to markdown per the media format (timestamps, speaker labels, key points)
If no transcription capability is available, tell the user and skip — or accept a pre-made transcript

Local files:

Use a document conversion capability (PDF, DOCX, etc.) or read directly if already markdown
Normalize per the document format
For markdown files: add frontmatter only, preserve content

Forum threads / discussions:

Fetch and normalize per the forum format (chronological, author-attributed)
Flatten nested threads to chronological order with reply-to context

Repositories:

Clone or read the repository contents
Mirror the original directory tree under sources/<source-id>/
Default: mirror the full tree. For large repositories (thousands of files), ingest selectively and set selective: true in the manifest entry
Populate the highlights array with paths to the most important files (relative to the source-id directory)

Documentation sites:

Crawl or fetch the documentation site
Mirror the section hierarchy under sources/<source-id>/
Default: mirror the full site. For large sites, ingest selectively and set selective: true
Populate the highlights array with paths to the most important pages
Preserve internal link structure where possible

Each normalized source gets a slug-based source ID and lives in a directory-per-source layout:

Flat sources (web, forum, media, document, local): sources/<source-id>/<source-id>.md
Hierarchical sources (repository, documentation-site): sources/<source-id>/ with the original tree mirrored inside

Source ID generation:

Derive the source ID as a slug from the source title or URL (e.g., mdn-websocket-api, strangeloop-2025-realtime)
When a slug collides with an existing source ID: append __word1-word2 using two random words from skills/swain-search/references/wordlist.txt
If the wordlist is missing, append __ followed by 4 hex characters (e.g., __a3f8) as a fallback

Step 3 — Generate manifest

Create manifest.yaml following the schema in skills/swain-search/references/manifest-schema.md. Include:

Trove metadata (id, created date, tags)
Default freshness TTL per source type
One entry per source with provenance (URL/path, fetch date, content hash, type)

Compute content hashes as bare hex SHA-256 digests (no prefix) of the normalized markdown content:

shasum -a 256 sources/mdn-websocket-api/mdn-websocket-api.md | cut -d' ' -f1

Step 4 — Generate synthesis

Create synthesis.md — a structured distillation of key findings across all sources.

Structure the synthesis by theme, not by source. Group related findings together, cite sources by ID, and surface:

Key findings — what the sources collectively say about the topic
Points of agreement — where sources converge
Points of disagreement — where sources conflict or present alternatives
Gaps — what the sources don't cover that might matter

Keep it concise. The synthesis is a starting point, not a comprehensive report — the user or artifact author will refine it.

Step 5 — Report

Tell the user what was created:

Trove <trove-id> created with N sources.

docs/troves/<trove-id>/manifest.yaml — provenance and metadata

docs/troves/<trove-id>/sources/ — N normalized source files

docs/troves/<trove-id>/synthesis.md — thematic distillation

Reference from artifacts with: trove: <trove-id>@<commit-hash>

Extend mode

Add new sources to an existing trove.

Read the existing manifest.yaml
Collect and normalize new sources (same as Create step 2)
Assign slug-based source IDs to new sources (following the same ID generation rules)
Append new entries to manifest.yaml
Update refreshed date
Regenerate synthesis.md incorporating all sources (old + new)
Report what was added

Refresh mode

Re-fetch stale sources and update changed content.

Read manifest.yaml
For each source, check if fetched date + freshness-ttl has elapsed
For stale sources:
- Re-fetch the raw content
- Re-normalize to markdown
- Compute new content hash
- If hash changed: replace the source file, update manifest entry
- If hash unchanged: update only fetched date
Update refreshed date in manifest
If any content changed, regenerate synthesis.md
Report: "Refreshed N sources. M had changed content, K were unchanged."

For sources with freshness-ttl: never, skip them during refresh.

Discover mode

Help the user find existing troves relevant to their topic.

Scan docs/troves/*/manifest.yaml for all troves
Match against the user's query by:
- Tag match — trove tags contain query keywords
- Title match — trove ID slug contains query keywords
For each match, show: trove ID, tags, source count, last refreshed date, referenced-by list
If no matches, suggest creating a new trove

Graceful degradation

The skill references capabilities generically. When a capability isn't available:

Capability	Fallback
Web search	Skip search-based sources. Tell user: "No web search capability available — provide URLs directly or add a search MCP."
Browser / page fetcher	Try basic URL fetch. If that fails: "Can't fetch this URL — paste the content or provide a local file."
Media transcription	"No transcription capability available — provide a pre-made transcript file, or add a media conversion tool."
Document conversion	"Can't convert this file type — provide a markdown version, or add a document conversion tool."

Never fail the entire run because one capability is missing. Collect what you can, skip what you can't, and report clearly.

Capability detection

Before collecting sources, check what's available. Look for tools matching these patterns — the exact tool names vary by installation:

Web search: tools with "search" in the name (e.g., brave_web_search, bing-search-to-markdown)
Page fetching: tools with "fetch", "webpage", "browser" in the name (e.g., fetch_content, webpage-to-markdown, browser_navigate)
Media transcription: tools with "audio", "video", "youtube" in the name (e.g., audio-to-markdown, youtube-to-markdown)
Document conversion: tools with "pdf", "docx", "pptx", "xlsx" in the name (e.g., pdf-to-markdown, docx-to-markdown)

Report available capabilities at the start of collection so the user knows what will and won't work.

Linking from artifacts

Artifacts reference troves in frontmatter:

trove: websocket-vs-sse@abc1234

The format is <trove-id>@<commit-hash>. The commit hash pins the trove to a specific version — troves evolve over time as sources are added or refreshed, and the hash ensures reproducibility.

When creating or extending a trove, remind the user to commit and then update the referencing artifact's frontmatter with the new commit hash.

swain-search

Safety Notice

Copy this and send it to your AI assistant to learn

swain-search

Mode detection

Create mode

Step 1 — Gather inputs

Step 2 — Collect and normalize

Step 3 — Generate manifest

Step 4 — Generate synthesis

Step 5 — Report

Extend mode

Refresh mode

Discover mode

Graceful degradation

Capability detection

Linking from artifacts

Source Transparency

Related Skills

swain-do

swain-push

swain-update

swain-release