Research Archival

Scrape AI research conversations (ChatGPT, Gemini, Claude) and web pages, archive them as markdown files with YAML frontmatter, and create cross-referenced GitHub Issues — with mandatory identity verification at every step.

FIRST - TodoWrite Task Templates

MANDATORY: Select and load the appropriate template before any archival work.

Template A - Full Archival (scrape + save + issue)

Identity preflight — verify GH_ACCOUNT or resolve via curl /user
Scrape URL — route to Firecrawl or Jina per url-routing.md
Save to file — YYYY-MM-DD-{slug}-{source_type}.md with frontmatter
Survey labels — gh label list, reuse existing, max 3-6
Create GitHub Issue — use --body with heredoc or --body-file
Update frontmatter — add github_issue_url and github_issue_number
Post canonical backlink comment on Issue

Template B - Save Only (no issue)

Identity preflight (still required for consistency)
Scrape URL — route to Firecrawl or Jina per url-routing.md
Save to file — YYYY-MM-DD-{slug}-{source_type}.md with frontmatter

Template C - Issue Only (file already exists)

Identity preflight
Read existing file frontmatter
Survey labels — gh label list, reuse existing, max 3-6
Create GitHub Issue — use --body with heredoc or --body-file
Update file frontmatter with issue cross-reference
Post canonical backlink comment on Issue

Identity Preflight (MANDATORY — Step 0)

MUST execute before any gh write command. Non-negotiable.

The gh-repo-identity-guard.mjs PreToolUse hook provides a safety net, but this skill performs its own check as defense-in-depth.

Resolution Order

Fast-path — GH_ACCOUNT env var (set by mise per-directory)
Token filename — scan ~/.claude/.secrets/gh-token-* for single base match
API call — curl -sH "Authorization: token $GH_TOKEN" https://api.github.com/user

Verification

/usr/bin/env bash << 'IDENTITY_EOF'

Resolve authenticated user

if [ -n "$GH_ACCOUNT" ]; then AUTH_USER="$GH_ACCOUNT" AUTH_SOURCE="GH_ACCOUNT" else AUTH_USER=$(curl -sf --max-time 5 -H "Authorization: token $GH_TOKEN"
https://api.github.com/user 2>/dev/null | grep -o '"login":"[^"]*"' | cut -d'"' -f4) AUTH_SOURCE="API /user" fi

Resolve target repo owner

REPO_OWNER=$(git remote get-url origin 2>/dev/null | sed -n 's|.github.com[:/]([^/])/.*|\1|p')

echo "Authenticated as: $AUTH_USER (via $AUTH_SOURCE)" echo "Target repo owner: $REPO_OWNER"

if [ "$AUTH_USER" != "$REPO_OWNER" ]; then echo "" echo "MISMATCH — do NOT proceed with gh write commands" echo "Fix: export GH_TOKEN=$(cat ~/.claude/.secrets/gh-token-$REPO_OWNER)" exit 1 fi echo "Identity verified — safe to proceed" IDENTITY_EOF

BLOCK if mismatch — display diagnostic and do NOT continue to any gh write operation.

Scraping Workflow

Route scrape requests based on URL pattern. See url-routing.md for full details.

Decision Tree

URL contains chatgpt.com/share/ → Jina Reader (https://r.jina.ai/{URL}) → Use curl (not WebFetch — it summarizes instead of returning raw)

URL contains gemini.google.com/share/ → Firecrawl (JS-heavy SPA) → Preflight: ping -c1 -W2 172.25.236.1

URL contains claude.ai/artifacts/ or is a static web page → Jina Reader (https://r.jina.ai/{URL}) → Use WebFetch or curl

Firecrawl Scrape (with Health Check + Auto-Revival)

CRITICAL: Firecrawl containers can show "Up" in docker ps while internal processes are dead (RAM/CPU overload crashes the worker inside the container). Always perform a deep health check before scraping.

/usr/bin/env bash << 'SCRAPE_EOF' set -euo pipefail

Step 1: Check ZeroTier connectivity

if ! ping -c1 -W2 172.25.236.1 >/dev/null 2>&1; then echo "ERROR: Firecrawl host unreachable. Check ZeroTier: zerotier-cli status" exit 1 fi

Step 2: Deep health check — test actual API response, not just container status

Port 3003 (wrapper) may accept TCP but return empty if Firecrawl API (3002) is dead inside

HTTP_CODE=$(ssh littleblack 'curl -sf -o /dev/null -w "%{http_code}" --max-time 10
-X POST http://localhost:3002/v1/scrape
-H "Content-Type: application/json"
-d "{"url":"https://example.com\",\"formats\":[\"markdown\"]}"' 2>/dev/null || echo "000")

if [ "$HTTP_CODE" = "000" ] || [ "$HTTP_CODE" = "502" ] || [ "$HTTP_CODE" = "503" ]; then echo "WARNING: Firecrawl API unhealthy (HTTP $HTTP_CODE). Attempting revival..."

Step 2a: Check docker logs for WORKER STALLED (RAM/CPU overload)

ssh littleblack 'docker logs firecrawl-api-1 --tail 20 2>&1 | grep -i "stalled|error|exit" || true'

Step 2b: Restart the critical containers

ssh littleblack 'docker restart firecrawl-api-1 firecrawl-playwright-service-1' 2>/dev/null echo "Containers restarted. Waiting 20s for API to initialize..." sleep 20

Step 2c: Verify recovery

if [ "$HTTP_CODE" = "000" ] || [ "$HTTP_CODE" = "502" ] || [ "$HTTP_CODE" = "503" ]; then echo "ERROR: Firecrawl still unhealthy after restart (HTTP $HTTP_CODE)." echo "Manual intervention needed. Try: ssh littleblack 'cd ~/firecrawl && docker compose up -d --force-recreate'" echo "Falling back to Jina Reader: https://r.jina.ai/${URL}" exit 1 fi echo "Firecrawl recovered successfully." fi

Step 3: Scrape via wrapper

CONTENT=$(curl -s --max-time 120 "http://172.25.236.1:3003/scrape?url=${URL}&name=${SLUG}")

if [ -z "$CONTENT" ]; then echo "ERROR: Scrape returned empty. Try Jina fallback: https://r.jina.ai/${URL}" exit 1 fi

echo "$CONTENT" SCRAPE_EOF

Known Failure Mode: Container "Up" But Processes Dead

Symptom: docker ps shows containers with status "Up 4 days" but curl localhost:3002 returns connection reset.

Root cause: Firecrawl worker exhausts RAM/CPU (observed: cpuUsage=0.998, memoryUsage=0.858 ). Internal Node.js processes exit but Docker container stays alive because the entrypoint shell is still running.

Diagnosis:

ssh littleblack 'docker logs firecrawl-api-1 --tail 50 2>&1 | grep -E "STALLED|cpuUsage|exit"'

Look for: WORKER STALLED {"cpuUsage":0.998,"memoryUsage":0.858}

Fix: docker restart (not docker compose restart — may require permissions to compose directory):

ssh littleblack 'docker restart firecrawl-api-1 firecrawl-playwright-service-1' sleep 20 # Wait for API initialization

Verify:

ssh littleblack 'curl -s -o /dev/null -w "%{http_code}" http://localhost:3002/v1/scrape'

File Saving

Naming Convention

YYYY-MM-DD-{slug}-{source_type}.md

slug — kebab-case summary (max 50 chars)
source_type — from enum: chatgpt , gemini , claude , web

Default location: docs/research/ in the current project.

YAML Frontmatter

See frontmatter-schema.md for the full field contract.

source_url: https://chatgpt.com/share/... source_type: chatgpt-share scraped_at: "2026-02-09T18:30:00Z" model_name: gpt-4o custom_gpt_name: Cosmo claude_code_uuid: SESSION_UUID github_issue_url: "" github_issue_number: ""

Leave github_issue_url and github_issue_number empty — update after Issue creation.

GitHub Issue Creation

Label Survey

Survey existing labels first — reuse preferred, create only when concept is genuinely novel.

gh label list --repo owner/repo --limit 100

Policy: Max 3-6 labels per issue. Common labels: research , ai-output , chatgpt , gemini , archival .

Create Issue

Use --body with heredoc for inline composition, or --body-file for very large content.

/usr/bin/env bash << 'ISSUE_EOF'

Write body to temp file

cat > "/tmp/issue-body-${SLUG}.md" << 'BODY_EOF'

Summary

Brief description of the archived research content.

Source

URL: SOURCE_URL
Type: source_type
Model: model_name
Scraped: scraped_at

Key Findings

Finding 1
Finding 2

Archived File

docs/research/FILENAME.md BODY_EOF

Create issue

gh issue create
--repo owner/repo
--title "Research: descriptive title here"
--body-file "/tmp/issue-body-${SLUG}.md"
--label "research,ai-output"

Clean up

rm -f "/tmp/issue-body-${SLUG}.md" ISSUE_EOF

Update Frontmatter

After issue creation, update the archived file's frontmatter with the issue URL and number.

Canonical Backlink Comment

Post a comment on the Issue linking back to the archived file:

Archived: docs/research/YYYY-MM-DD-slug-source_type.md Scraped: 2026-02-09T18:30:00Z Source: chatgpt-share Session: SESSION_UUID

Post-Change Checklist

After modifying THIS skill:

YAML frontmatter valid (no colons in description)
Trigger keywords current in description
All ./references/ links resolve
Identity preflight section remains FIRST in workflow
Append changes to evolution-log.md
Validate: uv run plugins/plugin-dev/scripts/skill-creator/quick_validate.py plugins/gh-tools/skills/research-archival
Validate links: bun run plugins/plugin-dev/scripts/validate-links.ts plugins/gh-tools/skills/research-archival

Troubleshooting

Issue Cause Fix

Wrong account posting GH_TOKEN mismatch Check mise env | grep GH_TOKEN , verify GH_ACCOUNT

Body exceeds 65536 chars GitHub API limit Split across issue body + first comment

Firecrawl unreachable ZeroTier down ping 172.25.236.1 , check zerotier-cli status

Firecrawl "Up" but dead Container alive, processes crashed docker restart firecrawl-api-1 firecrawl-playwright-service-1 , wait 20s

Firecrawl WORKER STALLED RAM/CPU overload (>85% mem) Same as above; check docker logs firecrawl-api-1 --tail 50

Scrape returns empty JS-heavy page timeout Increase Firecrawl timeout, try Jina fallback

Jina returns login page shell Gemini login wall (not rendered) Must use Firecrawl for gemini.google.com/share/* URLs

mise parse error Stale .mise.toml syntax Run mise doctor , check [hooks.enter] syntax

Identity guard blocks Non-owner account export GH_TOKEN=$(cat ~/.claude/.secrets/gh-token-OWNER)

References

Frontmatter Schema — YAML field contract
URL Routing — Scraper routing table
Evolution Log — Change history

research-archival

Safety Notice

Copy this and send it to your AI assistant to learn