RLAMA - Local RAG System

RLAMA (Retrieval-Augmented Language Model Adapter) provides fully local, offline RAG for semantic search over your documents.

When to Use This Skill

Building knowledge bases from local documents
Searching personal notes, research papers, or code documentation
Document-based Q&A without sending data to the cloud
Indexing project documentation for quick semantic lookup
Creating searchable archives of PDFs, markdown, or code files

Prerequisites

RLAMA requires Ollama running locally:

Verify Ollama is running

ollama list

If not running, start it

brew services start ollama # macOS

or: ollama serve

Quick Reference

Query a RAG (Default: Retrieve-Only)

Always use retrieve-only mode by default. Claude synthesizes far better answers than local 7B models. The raw chunks give Claude direct evidence to reason over and cite.

DEFAULT: Retrieve top 10 chunks — Claude reads and synthesizes

python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag-name> "your query"

More chunks for broad queries

python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag-name> "your query" -k 20

JSON output for programmatic use

python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag-name> "your query" --json

Force rebuild embedding cache

python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag-name> "your query" --rebuild-cache

List RAGs with cache status

python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py --list

First run per collection builds an embedding cache (~60s for 4K chunks). Subsequent queries are <1s.

Local LLM Query (Fallback Only)

Use rlama run only when Claude is not in the loop (e.g., standalone CLI usage, cron jobs, scripts):

Local model generates the answer (weaker than Claude synthesis)

rlama run <rag-name> --query "your question here"

With more context chunks

rlama run <rag-name> --query "explain the authentication flow" --context-size 30

Show source documents

rlama run <rag-name> --query "what are the API endpoints?" --show-context

Script wrapper for cleaner output:

python3 ~/.claude/skills/rlama/scripts/rlama_query.py <rag-name> "your query" python3 ~/.claude/skills/rlama/scripts/rlama_query.py my-docs "what is the main idea?" --show-sources

External LLM Synthesis (optional—retrieve chunks AND synthesize via OpenRouter, TogetherAI, Ollama, or any OpenAI-compatible endpoint):

Synthesize via OpenRouter (auto-detected from model with /)

python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag-name> "your query" --synthesize --synth-model anthropic/claude-sonnet-4

Synthesize via TogetherAI

python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag-name> "your query" --synthesize --provider togetherai

Synthesize via local Ollama (fully offline, uses research-grade system prompt)

python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag-name> "your query" --synthesize --provider ollama

Synthesize via custom endpoint

python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag-name> "your query" --synthesize --endpoint https://my-api.com/v1/chat/completions

Environment variables for synthesis:

Variable Provider

OPENROUTER_API_KEY

OpenRouter (default, auto-detected first)

TOGETHER_API_KEY

TogetherAI

SYNTH_API_KEY

Custom endpoint (via --endpoint )

(none needed) Ollama (local, no auth)

Provider auto-detection: model names with / → OpenRouter, otherwise → TogetherAI. Falls back to whichever API key is set.

Quality tiers:

Tier Method Quality Latency Default?

Best Retrieve-only → Claude synthesizes Strongest synthesis ~1s retrieve YES

Good --synthesize --synth-model anthropic/claude-sonnet-4

Strong, cited ~3s

Decent --synthesize --provider togetherai (Llama 70B) Solid for factual ~2s

Reasoning --synthesize --reasoning (Qwen 3.5 9B) Strong local, cited ~8s

Local --synthesize --provider ollama (Qwen 2.5 7B) Basic, may hedge ~5s

Baseline rlama_query.py (RLAMA built-in) Weakest, no prompt control ~3s

Small local models (7B) use a tuned prompt optimized for Qwen (structured output, anti-hedge, domain-keyword aware). Cloud providers use a strict research-grade prompt with mandatory citations. Reasoning mode (--reasoning ) uses qwen3.5:9b with the strict prompt and 4096 max tokens—best local option for complex cross-document synthesis.

First run builds an embedding cache (~30s for 3K chunks, ~10min for 25K chunks). Subsequent queries are <1s. Large RAGs use incremental checkpointing—if Ollama crashes mid-build, re-run to resume from the last checkpoint. Individual chunks are truncated to 5K chars to stay within nomic-embed-text's context window.

Benchmarking:

Retrieval quality only

python3 ~/.claude/skills/rlama/scripts/rlama_bench.py <rag-name> --retrieval-only

Full synthesis benchmark (8 test cases)

python3 ~/.claude/skills/rlama/scripts/rlama_bench.py <rag-name> --provider ollama --verbose

Single test case

python3 ~/.claude/skills/rlama/scripts/rlama_bench.py <rag-name> --provider ollama --case 0

JSON output for analysis

python3 ~/.claude/skills/rlama/scripts/rlama_bench.py <rag-name> --provider ollama --json

Scores: retrieval precision, topic coverage, grounding, directness (anti-hedge), composite (0-100).

Create a RAG

Index documents from a folder into a new RAG system:

Basic creation (uses llama3.2 by default)

rlama rag llama3.2 <rag-name> <folder-path>

Examples

rlama rag llama3.2 my-notes ~/Notes rlama rag llama3.2 project-docs ./docs rlama rag llama3.2 research-papers ~/Papers

With exclusions

rlama rag llama3.2 codebase ./src --exclude-dir=node_modules,dist,.git --exclude-ext=.log,.tmp

Only specific file types

rlama rag llama3.2 markdown-docs ./docs --process-ext=.md,.txt

Custom chunking strategy

rlama rag llama3.2 my-rag ./docs --chunking=semantic --chunk-size=1500 --chunk-overlap=300

Chunking strategies:

hybrid (default) - Combines semantic and fixed chunking
semantic
Respects document structure (paragraphs, sections)
fixed
Fixed character count chunks
hierarchical
Preserves document hierarchy

List RAG Systems

List all RAGs

rlama list

List documents in a specific RAG

rlama list-docs <rag-name>

Inspect chunks (debugging)

rlama list-chunks <rag-name> --document=filename.pdf

Manage Documents

Add documents to existing RAG:

rlama add-docs <rag-name> <folder-or-file>

Examples

rlama add-docs my-notes ~/Notes/new-notes rlama add-docs research ./papers/new-paper.pdf

Remove a document:

rlama remove-doc <rag-name> <document-id>

Document ID is typically the filename

rlama remove-doc my-notes old-note.md rlama remove-doc research outdated-paper.pdf

Force remove without confirmation

rlama remove-doc my-notes old-note.md --force

Delete a RAG

rlama delete <rag-name>

Or manually remove the data directory

rm -rf ~/.rlama/<rag-name>

Advanced Features

Web Crawling

Create a RAG from website content:

Crawl a website and create RAG

rlama crawl-rag llama3.2 docs-rag https://docs.example.com

Add web content to existing RAG

rlama crawl-add-docs my-rag https://blog.example.com

Directory Watching

Automatically update RAG when files change:

Enable watching

rlama watch <rag-name> <folder-path>

Check for new files manually

rlama check-watched <rag-name>

Disable watching

rlama watch-off <rag-name>

Website Watching

Monitor websites for content updates:

rlama web-watch <rag-name> https://docs.example.com rlama check-web-watched <rag-name> rlama web-watch-off <rag-name>

Reranking

Improve result relevance with reranking:

Add reranker to existing RAG

rlama add-reranker <rag-name>

Configure reranker weight (0-1, default 0.7)

rlama update-reranker <rag-name> --reranker-weight=0.8

Disable reranking

rlama rag llama3.2 my-rag ./docs --disable-reranker

API Server

Run RLAMA as an API server for programmatic access:

Start API server

rlama api --port 11249

Query via API

curl -X POST http://localhost:11249/rag
-H "Content-Type: application/json"
-d '{ "rag_name": "my-docs", "prompt": "What are the key points?", "context_size": 20 }'

Model Management

Update the model used by a RAG

rlama update-model <rag-name> <new-model>

Example: Switch to a more powerful model

rlama update-model my-rag deepseek-r1:8b

Use Hugging Face models

rlama rag hf.co/username/repo my-rag ./docs rlama rag hf.co/username/repo:Q4_K_M my-rag ./docs

Use OpenAI models (requires OPENAI_API_KEY)

export OPENAI_API_KEY="your-key" rlama rag gpt-4-turbo my-openai-rag ./docs

Configuration

Data Directory

By default, RLAMA stores data in ~/.rlama/ . Change this with --data-dir :

Use custom data directory

rlama --data-dir=/path/to/custom list rlama --data-dir=/projects/rag-data rag llama3.2 project-rag ./docs

Or set via environment (add to ~/.zshrc)

export RLAMA_DATA_DIR="/path/to/custom"

Ollama Configuration

Custom Ollama host

rlama --host=192.168.1.100 --port=11434 run my-rag

Or via environment

export OLLAMA_HOST="http://192.168.1.100:11434"

Default Model

The skill uses qwen2.5:7b by default (changed from llama3.2 in Jan 2026). For legacy mode:

Use the old llama3.2 default

python3 ~/.claude/skills/rlama/scripts/rlama_manage.py create my-rag ./docs --legacy

Per-command model override

rlama rag deepseek-r1:8b my-rag ./docs

For queries

rlama run my-rag --query "question" -m deepseek-r1:8b

Recommended models:

Model Size Best For

qwen2.5:7b

7B Default—fast RAG queries (recommended)

qwen3.5:9b

9B Reasoning mode—deeper synthesis, strict citations (--reasoning )

llama3.2

3B Fast, legacy default (use --legacy )

deepseek-r1:8b

8B Complex questions

llama3.3:70b

70B Highest quality (slow)

Reasoning mode (--reasoning flag) uses qwen3.5:9b for local Ollama synthesis with the strict research-grade prompt (normally reserved for cloud providers). This gives research-quality cited answers without leaving the machine. Override the model via RLAMA_REASONING_MODEL env var.

Reasoning mode — complex cross-document synthesis (think OFF, fast)

python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag> "complex query" --synthesize --reasoning

Reasoning mode with thinking (chain-of-thought, slower but deeper)

python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag> "complex query" --synthesize --reasoning --think

Equivalent explicit invocation

python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag> "query" --synthesize --provider ollama --synth-model qwen3.5:9b

Flag Model Think Prompt Max Tokens Timeout

(default) qwen2.5:7b off light (anti-hedge) 2048 120s

--reasoning

qwen3.5:9b off strict (cited) 4096 300s

--reasoning --think

qwen3.5:9b on strict (cited) 4096 300s

Thinking mode produces internal chain-of-thought reasoning before the answer. The thinking text is included in JSON output (synthesis.thinking field) but not printed in plain text mode. Use for ambiguous cross-document analysis where you want to see the model's working.

Supported File Types

RLAMA indexes these formats:

Text: .txt , .md , .markdown
Documents: .pdf , .docx , .doc
Code: .py , .js , .ts , .go , .rs , .java , .rb , .cpp , .c , .h
Data: .json , .yaml , .yml , .csv
Web: .html , .htm
Org-mode: .org

Example Workflows

Personal Knowledge Base

Create from multiple folders

rlama rag llama3.2 personal-kb ~/Documents rlama add-docs personal-kb ~/Notes rlama add-docs personal-kb ~/Downloads/papers

Query

rlama run personal-kb --query "what did I write about project management?"

Code Documentation

Index project docs

rlama rag llama3.2 project-docs ./docs ./README.md

Query architecture

rlama run project-docs --query "how does authentication work?" --context-size 25

Research Papers

Create research RAG

rlama rag llama3.2 papers ~/Papers --exclude-ext=.bib

Add specific paper

rlama add-docs papers ./new-paper.pdf

Query with high context

rlama run papers --query "what methods are used for evaluation?" --context-size 30

Interactive Wizard

For guided RAG creation:

rlama wizard

Resilient Indexing (Skip Problem Files)

For folders with mixed content where some files may exceed embedding context limits (e.g., large PDFs), use the resilient script that processes files individually and skips failures:

Create RAG, skipping files that fail

python3 ~/.claude/skills/rlama/scripts/rlama_resilient.py create my-rag ~/Documents

Add to existing RAG, skipping failures

python3 ~/.claude/skills/rlama/scripts/rlama_resilient.py add my-rag ~/MoreDocs

With docs-only filter

python3 ~/.claude/skills/rlama/scripts/rlama_resilient.py create research ~/Papers --docs-only

With legacy model

python3 ~/.claude/skills/rlama/scripts/rlama_resilient.py create my-rag ~/Docs --legacy

The script reports which files were added and which were skipped due to errors.

Progress Monitoring

Monitor long-running RLAMA operations in real-time using the logging system.

Tail the Log File

Watch all operations in real-time

tail -f ~/.rlama/logs/rlama.log

Filter by RAG name

tail -f ~/.rlama/logs/rlama.log | grep my-rag

Pretty-print with jq

tail -f ~/.rlama/logs/rlama.log | jq -r '"(.ts) [(.cat)] (.msg)"'

Show only progress updates

tail -f ~/.rlama/logs/rlama.log | jq -r 'select(.data.i) | "(.ts) [(.cat)] (.data.i)/(.data.total) (.data.file // .data.status)"'

Check Operation Status

Show active operations

python3 ~/.claude/skills/rlama/scripts/rlama_status.py

Show recent completed operations

python3 ~/.claude/skills/rlama/scripts/rlama_status.py --recent

Show both active and recent

python3 ~/.claude/skills/rlama/scripts/rlama_status.py --all

Follow mode (formatted tail -f)

python3 ~/.claude/skills/rlama/scripts/rlama_status.py --follow

JSON output

python3 ~/.claude/skills/rlama/scripts/rlama_status.py --json

Log File Format

Logs are written in JSON Lines format to ~/.rlama/logs/rlama.log :

{"ts": "2026-02-03T12:34:56.789", "level": "info", "cat": "INGEST", "msg": "Progress 45/100", "data": {"op_id": "ingest_abc123", "i": 45, "total": 100, "file": "doc.pdf", "eta_sec": 85}}

Operations State

Active and recent operations are tracked in ~/.rlama/logs/operations.json :

{ "active": { "ingest_abc123": { "type": "ingest", "rag_name": "my-docs", "started": "2026-02-03T12:30:00", "processed": 45, "total": 100, "eta_sec": 85 } }, "recent": [...] }

Troubleshooting

"Ollama not found"

Check Ollama status

ollama --version ollama list

Start Ollama

brew services start ollama # macOS ollama serve # Manual start

"Model not found"

Pull the required model

ollama pull llama3.2 ollama pull nomic-embed-text # Embedding model

Slow Indexing

Use smaller embedding models
Exclude large binary files: --exclude-ext=.bin,.zip,.tar
Exclude build directories: --exclude-dir=node_modules,dist,build

Poor Query Results

Increase context size: --context-size=30
Use a better model: -m deepseek-r1:8b
Re-index with semantic chunking: --chunking=semantic
Enable reranking: rlama add-reranker <rag-name>

Index Corruption

Delete and recreate

rm -rf ~/.rlama/<rag-name> rlama rag llama3.2 <rag-name> <folder-path>

CLI Reference

Full command reference available at:

rlama --help rlama <command> --help

Or see references/rlama-commands.md for complete documentation.

rlama

Safety Notice

Copy this and send it to your AI assistant to learn

Verify Ollama is running

If not running, start it

or: ollama serve

DEFAULT: Retrieve top 10 chunks — Claude reads and synthesizes

More chunks for broad queries

JSON output for programmatic use

Force rebuild embedding cache

List RAGs with cache status

Local model generates the answer (weaker than Claude synthesis)

With more context chunks

Show source documents

Synthesize via OpenRouter (auto-detected from model with /)

Synthesize via TogetherAI

Synthesize via local Ollama (fully offline, uses research-grade system prompt)

Synthesize via custom endpoint

Retrieval quality only

Full synthesis benchmark (8 test cases)

Single test case

JSON output for analysis

Basic creation (uses llama3.2 by default)

Examples

With exclusions

Only specific file types

Custom chunking strategy

List all RAGs

List documents in a specific RAG

Inspect chunks (debugging)

Examples

Document ID is typically the filename

Force remove without confirmation

Or manually remove the data directory

Crawl a website and create RAG

Add web content to existing RAG

Enable watching

Check for new files manually

Disable watching

Add reranker to existing RAG

Configure reranker weight (0-1, default 0.7)

Disable reranking

Start API server

Query via API

Update the model used by a RAG

Example: Switch to a more powerful model

Use Hugging Face models

Use OpenAI models (requires OPENAI_API_KEY)

Use custom data directory

Or set via environment (add to ~/.zshrc)

Custom Ollama host

Or via environment

Use the old llama3.2 default

Per-command model override

For queries

Reasoning mode — complex cross-document synthesis (think OFF, fast)

Reasoning mode with thinking (chain-of-thought, slower but deeper)

Equivalent explicit invocation

Create from multiple folders

Query

Index project docs

Query architecture

Create research RAG

Add specific paper

Query with high context

Create RAG, skipping files that fail

Add to existing RAG, skipping failures

With docs-only filter

With legacy model

Watch all operations in real-time

Filter by RAG name

Pretty-print with jq

Show only progress updates

Show active operations

Show recent completed operations

Show both active and recent

Follow mode (formatted tail -f)

JSON output

Check Ollama status

Start Ollama