anydocs - Generic Documentation Indexing & Search
A powerful, reusable skill for indexing and searching ANY documentation site.
What It Does
anydocs solves a real problem: accessing documentation from code or CLI. Instead of opening a browser every time, you can:
- Index any documentation site (Discord, OpenClaw, internal docs, etc.)
- Search instantly from the command line or Python API
- Cache pages locally to avoid repeated network calls
- Configure multiple profiles for different doc sites
When to Use It
Use anydocs when you need to:
- Quickly look up API documentation without leaving the terminal
- Build agents that need to reference docs
- Extract specific information from documentation
- Search across multiple documentation sites
- Integrate docs into your workflow
Key Features
🔍 Multi-Method Search
- Keyword search: Fast, term-based matching with BM25-style scoring
- Hybrid search: Keyword + phrase proximity for better relevance
- Regex search: Advanced pattern matching for power users
🌐 Works with Any Docs Site
- Sitemap-based discovery (standard XML sitemap)
- Fallback crawling from base URL
- HTML content extraction with smart selector detection
- Automatic rate limiting to be respectful
💾 Smart Caching
- Pages cached locally with 7-day TTL (configurable)
- Search indexes cached for instant second searches
- Cache statistics and cleanup commands
- Respects cache invalidation
⚙️ Profile-Based Configuration
- Support multiple doc sites simultaneously
- Per-profile search methods and cache TTLs
- Configuration stored in
~/.anydocs/config.json - Examples for Discord, OpenClaw, and custom sites
🌐 JavaScript Rendering (Optional)
- Uses Playwright to render client-side SPAs (Single Page Apps)
- Automatically discovers links on JS-heavy sites like Discord docs
- Gracefully falls back to standard HTTP if Playwright unavailable
- Configure per-discovery session or globally per profile
Installation
cd /path/to/skills/anydocs
pip install -r requirements.txt
chmod +x anydocs.py
Optional: Browser-based rendering (for JavaScript-heavy sites)
For sites like Discord that use client-side rendering, install Playwright:
pip install playwright==1.40.0
playwright install # Downloads Chromium
If Playwright is unavailable, anydocs gracefully falls back to standard HTTP fetching.
Quick Start
1. Configure a Documentation Site
python anydocs.py config vuejs \
https://vuejs.org \
https://vuejs.org/sitemap.xml
2. Build the Index
python anydocs.py index vuejs
This discovers all pages via sitemap, scrapes content, and builds a searchable index.
3. Search
python anydocs.py search "composition api" --profile vuejs
python anydocs.py search "reactivity" --profile vuejs --limit 5
4. Fetch a Specific Page
python anydocs.py fetch "guide/introduction" --profile vuejs
CLI Commands
Configuration
# Add or update a profile
anydocs config <profile> <base_url> <sitemap_url> [--search-method hybrid] [--ttl-days 7]
# List configured profiles
anydocs list-profiles
Indexing
# Build index for a profile
anydocs index <profile>
# Force re-index (skip cache)
anydocs index <profile> --force
Search
# Basic keyword search
anydocs search "query" --profile discord
# Limit results
anydocs search "query" --profile discord --limit 5
# Regex search
anydocs search "^API" --profile discord --regex
Fetch
# Fetch a specific page (URL or path)
anydocs fetch "https://discord.com/developers/docs/resources/webhook"
anydocs fetch "resources/webhook" --profile discord
Cache Management
# Show cache statistics
anydocs cache status
# Clear all cache
anydocs cache clear
# Clear specific profile's cache
anydocs cache clear --profile discord
Python API
For use in agents and scripts:
from lib.config import ConfigManager
from lib.scraper import DiscoveryEngine
from lib.indexer import SearchIndex
# Load configuration
config_mgr = ConfigManager()
config = config_mgr.get_profile("discord")
# Scrape documentation
scraper = DiscoveryEngine(config["base_url"], config["sitemap_url"])
pages = scraper.fetch_all()
# Build search index
index = SearchIndex()
index.build(pages)
# Search
results = index.search("webhooks", limit=10)
for result in results:
print(f"{result['title']} ({result['relevance_score']})")
print(f" {result['url']}")
Configuration File Format
Configuration is stored in ~/.anydocs/config.json:
{
"discord": {
"name": "discord",
"base_url": "https://discord.com/developers/docs",
"sitemap_url": "https://discord.com/developers/docs/sitemap.xml",
"search_method": "hybrid",
"cache_ttl_days": 7
},
"openclaw": {
"name": "openclaw",
"base_url": "https://docs.openclaw.ai",
"sitemap_url": "https://docs.openclaw.ai/sitemap.xml",
"search_method": "hybrid",
"cache_ttl_days": 7
}
}
Search Methods
Keyword Search
- Speed: Fast
- Best for: Common terms, exact matches
- How it works: Term matching with position weighting (title > tags > content)
- Example:
anydocs search "webhooks"
Hybrid Search (Default)
- Speed: Fast
- Best for: Natural language queries
- How it works: Keyword search + phrase proximity scoring
- Example:
anydocs search "how to set up webhooks"
Regex Search
- Speed: Medium
- Best for: Complex patterns
- How it works: Compiled regex pattern matching across all content
- Example:
anydocs search "^(GET|POST)" --regex
Caching Behavior
- Pages: Cached as JSON with 7-day TTL (configurable)
- Indexes: Cached after indexing, invalidated on TTL expiry
- Cache location:
~/.anydocs/cache/ - Manual refresh: Use
--forceflag or clear cache
Performance Notes
- First index build takes 2-10 minutes depending on site size
- Subsequent searches are instant (cached indexes)
- Rate limit: 0.5s per page to be respectful
- Typical search returns ~100 results in <100ms
Troubleshooting
"No index for 'profile'" error
Run anydocs index <profile> first to build the index.
Sitemap not found
Check the sitemap URL. Falls back to crawling from base_url if unavailable.
Slow indexing
This is normal for large sites. Rate limiting prevents overwhelming servers.
Cache grows too large
Run anydocs cache clear or set --ttl-days to a smaller value.
Examples
Vue.js Framework Docs (SPA Example)
anydocs config vuejs \
https://vuejs.org \
https://vuejs.org/sitemap.xml
anydocs index vuejs
anydocs search "composition api"
Next.js API Docs
anydocs config nextjs \
https://nextjs.org \
https://nextjs.org/sitemap.xml
anydocs index nextjs
anydocs search "app router" --profile nextjs
Internal Company Documentation
anydocs config internal \
https://docs.company.local \
https://docs.company.local/sitemap.xml
anydocs index internal --force
anydocs search "deployment" --profile internal
Architecture
- scraper.py: Discovers URLs via sitemap, fetches and parses HTML
- indexer.py: Builds searchable indexes, implements multiple search strategies
- config.py: Manages configuration profiles
- cache.py: TTL-based file caching for pages and indexes
- cli.py: Click-based command-line interface
Contributing
To add new documentation sites, run:
anydocs config <profile> <base_url> <sitemap_url>
To extend search functionality, modify lib/indexer.py.
License
Part of the OpenClaw system.