osgrep

Semantic NLP-based code search using neural embeddings and hybrid ranking

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "osgrep" with this command: npx skills add zpankz/mcp-skillset/zpankz-mcp-skillset-osgrep

osgrep - Semantic Code Search Skill

Overview

osgrep transforms code search from exact string matching to semantic understanding. It uses neural embeddings (Granite + ColBERT) with hybrid ranking (Vector + BM25 + RRF) to find code by meaning, not just tokens.

Core principle: grep finds strings, osgrep finds concepts.

When to Use This Skill

Activate osgrep when:

  • User asks to "find" or "search for" concepts (not exact strings)
  • Exploring unfamiliar codebases
  • Finding similar implementations across files/languages
  • Questions like "how does the code handle X?"
  • Architectural understanding ("show me all API endpoints")
  • Pattern discovery ("find retry logic examples")

DO NOT use osgrep when:

  • User provides exact string/regex pattern → use rg or grep
  • Searching for specific variable/function names → use rg
  • Simple token search → use rg
  • User explicitly requests grep/ripgrep → respect their choice

Core Concepts

Semantic vs Lexical Search

# Lexical (grep/rg) - exact tokens
rg "authenticate"  # Finds: authenticate, authenticated, authentication

# Semantic (osgrep) - conceptual meaning
osgrep "user login verification"
# Finds: authenticate(), verifyCredentials(), checkUserSession(),
#        validateToken(), isAuthorized() - all semantically similar

Hybrid Architecture

Query → [Vector Search] + [BM25 Full-Text] → [RRF Fusion] → [ColBERT Rerank] → Results
         (semantic)        (lexical)          (rank merge)    (precision)

Why hybrid:

  • Vector search: Captures semantic similarity ("auth" ≈ "login" ≈ "credential")
  • BM25: Ensures exact tokens aren't missed ("JWT", "bcrypt")
  • RRF: Merges rankings without score normalization
  • ColBERT: Refines top results with token-level cross-attention

Relevance Scoring

0.9-1.0: Nearly identical (duplicates)
0.7-0.9: Highly relevant ✓ (trust these)
0.5-0.7: Moderately relevant (check manually)
0.3-0.5: Weakly relevant (likely noise)
0.0-0.3: False positive (ignore)

Best practice: Filter for scores > 0.6 in automated workflows.

Command Reference

Basic Search

osgrep "<conceptual query>" [path] [options]

# Examples
osgrep "JWT token validation"
osgrep "error handling for database connections" src/
osgrep "rate limiting middleware" --max-count 20

Server Mode (Recommended)

# Start server with live indexing
osgrep serve --port 4444 &

# All queries now use auto-updated index
osgrep "query"

Indexing

osgrep index                    # Index current directory
osgrep list                     # Show all indexed projects
osgrep doctor                   # Check installation health

Output Formats

osgrep "query"                  # Default: snippets with context
osgrep "query" --content        # Full chunk content
osgrep "query" --compact        # File paths only (for piping)
osgrep "query" --scores         # Show relevance scores
osgrep "query" --json           # JSON output (for scripting)
osgrep "query" --plain          # No ANSI colors

Result Control

osgrep "query" --max-count 5    # Total results (default: 10)
osgrep "query" --per-file 3     # Matches per file (default: 1)
osgrep "query" --sync           # Sync index before search

Integration with Claude Code

Prefer --compact for File Lists

# When Claude needs file paths for further processing
files=$(osgrep "authentication logic" --compact)
echo "$files" | xargs cat  # Read files
echo "$files" | xargs rg "specific_token"  # Further filtering

Use --json for Structured Output

# When Claude needs to parse results programmatically
osgrep "error handling" --json |
  jq -r '.results[] | select(.score > 0.7) | .path' |
  sort -u

Use --scores for Confidence

# When Claude needs to assess result relevance
osgrep "database migration" --scores |
  awk '$2 > 0.65 {print $1}'  # Only high-confidence matches

Combine with Other Tools

# osgrep → rg pipeline
osgrep "API endpoint" --compact | xargs rg "POST|PUT"

# osgrep → ast-grep pipeline
osgrep "validation function" --compact | xargs ast-grep --pattern 'validate($$$)'

# osgrep → fzf pipeline (interactive)
osgrep "component definition" --compact | fzf --preview 'bat {}'

Common Workflows

1. Exploratory Search (Learning Codebase)

# Start broad
osgrep "authentication system" --max-count 20 --scores

# Identify key files
osgrep "authentication system" --compact | sort -u

# Deep dive specific aspects
osgrep "JWT token generation" --content
osgrep "password hashing" --content
osgrep "session management" --content

2. Finding Similar Implementations

# Find pattern variations
osgrep "rate limiting with token bucket" --max-count 15

# Cross-language search
cd ../python-service && osgrep "rate limiting middleware"
cd ../go-service && osgrep "rate limiting middleware"

3. Debugging (Find Error Sources)

# Semantic error search
osgrep "connection timeout error handling" --scores

# Filter high-confidence results
osgrep "connection timeout error handling" --scores |
  awk '$2 > 0.7' |
  cut -d: -f1 |
  sort -u |
  xargs bat

4. API Discovery

# Find endpoints
osgrep "REST API endpoint handlers" --max-count 30

# Filter by HTTP method
osgrep "POST API endpoint" --compact | xargs rg "POST"

# Find middleware
osgrep "authentication middleware" --content

5. Security Audit

# Find anti-patterns
osgrep "SQL injection vulnerability" --scores
osgrep "hardcoded secrets" --scores
osgrep "plaintext password" --scores

# Verify good patterns exist
osgrep "parameterized SQL queries" --max-count 10
osgrep "bcrypt password hashing" --max-count 10

6. Refactoring Prep (Impact Analysis)

# Find old pattern
osgrep "legacy cookie authentication" --compact > legacy.txt

# Find new pattern
osgrep "JWT token authentication" --compact > modern.txt

# Identify migration candidates
comm -23 <(sort legacy.txt) <(sort modern.txt)

7. Testing Gap Analysis

# Find tests
osgrep "unit test for authentication" --compact | sort > tested.txt

# Find implementations
osgrep "authentication logic" --compact | sort > all.txt

# Find untested code
comm -13 tested.txt all.txt

Query Optimization

Effective Queries (Conceptual + Specific)

# Good: Concept + context
osgrep "JWT authentication with refresh token rotation"
osgrep "error handling for database connection timeouts"
osgrep "input validation using zod schemas"
osgrep "pagination with cursor-based approach"

Ineffective Queries (Too Vague or Too Specific)

# Too vague
osgrep "function"  # Use rg instead
osgrep "error"     # Too broad

# Too specific (exact token)
osgrep "getUserById"  # Use rg for exact names
osgrep "import { something }"  # Use rg for imports

Query Expansion Strategy

# Instead of single term
osgrep "cache"  # ✗ Vague

# Expand to concept
osgrep "caching strategy with TTL expiration"  # ✓ Clear intent

# Add domain context
osgrep "Redis caching with automatic invalidation"  # ✓ Very specific

Performance Considerations

Query Speed

Fast (<200ms):
  - osgrep "query" --max-count 3 --compact
  - osgrep "query" specific/path/ --max-count 5

Medium (~400ms):
  - osgrep "query"  # Default: 10 results
  - osgrep "query" --scores

Slow (>600ms):
  - osgrep "query" --max-count 50
  - osgrep "query" --content --max-count 20

Index Size

Small (<1k files):     ~50-100MB,    index time ~10-30s
Medium (1k-5k files):  ~100-500MB,   index time ~1-3min
Large (5k-10k files):  ~500MB-2GB,   index time ~5-15min
Monorepo (>10k files): ~2-10GB,      index time ~15-60min

Optimization Tips

# Use server mode for active development (auto-indexing)
osgrep serve &

# Limit results for speed
osgrep "query" --max-count 5

# Target specific paths
osgrep "query" src/api/

# Use --compact to skip rendering
osgrep "query" --compact

Best Practices for Claude Code

1. Always Check Index Status

# Before first search in a project
osgrep doctor  # Verify installation
osgrep list    # Check if project is indexed

# If not indexed
osgrep index

2. Start with Server Mode

# At project start
osgrep serve --port 4444 &

# Subsequent searches use live index
osgrep "query"

3. Use Appropriate Output Format

# For file lists → --compact
osgrep "concept" --compact

# For confidence assessment → --scores
osgrep "concept" --scores

# For scripting → --json
osgrep "concept" --json | jq ...

# For reading → default or --content
osgrep "concept"
osgrep "concept" --content

4. Filter by Relevance

# Only high-confidence results
osgrep "query" --scores | awk '$2 > 0.7'

# JSON filtering
osgrep "query" --json |
  jq -r '.results[] | select(.score > 0.7) | .path'

5. Combine with Traditional Tools

# Semantic → lexical pipeline
osgrep "API endpoint" --compact | xargs rg "router\.(get|post)"

# Semantic → syntax pipeline
osgrep "validation" --compact | xargs ast-grep --pattern 'validate($$$)'

6. Handle No Results Gracefully

# If no results, try broader query
results=$(osgrep "very specific query" --compact)
if [[ -z "$results" ]]; then
  results=$(osgrep "broader concept" --compact)
fi

# Or sync and retry
osgrep "query" --sync

7. Use Per-File Limits Strategically

# Finding entry points (one per file)
osgrep "main application initialization" --per-file 1

# Deep exploration (many per file)
osgrep "error handling patterns" --per-file 5

Limitations and Workarounds

Limitation: No Boolean Operators

# No AND/OR/NOT in queries
osgrep "auth AND JWT"  # ✗ Doesn't work

# Workaround: Multiple queries + set operations
comm -12 \
  <(osgrep "authentication" --compact | sort) \
  <(osgrep "JWT" --compact | sort)

Limitation: No Regex

# Cannot use regex patterns
osgrep "user_\d+"  # ✗ Doesn't work

# Workaround: osgrep then grep
osgrep "user identifier" --compact | xargs rg "user_\d+"

Limitation: English-Optimized

# Less effective on non-English code
osgrep "用户认证"  # May not work well

# Workaround: Use English equivalent
osgrep "user authentication"

Limitation: Cold Start Delay

# First query after boot ~1-2s (model loading)

# Workaround: Use server mode (models stay loaded)
osgrep serve &

Troubleshooting

No Results or Low Scores

# Check if index exists
osgrep list

# Reindex if stale
osgrep index

# Try broader query
osgrep "broader concept" --max-count 20

# Check scores to diagnose
osgrep "query" --scores

Server Issues

# Check if server is running
lsof -i :4444

# Start server if not running
osgrep serve &

# Check installation health
osgrep doctor

Index Corruption

# Clear and rebuild
cd /path/to/project
rm -rf ~/.osgrep/data/$(basename $(pwd))
osgrep index

Examples for Claude Code

Example 1: Architecture Exploration

# User: "Explain how the authentication system works"

# Step 1: Find auth components
osgrep "authentication system" --compact > auth_files.txt

# Step 2: Get detailed content
cat auth_files.txt | head -5 | xargs osgrep "authentication flow" --content

# Step 3: Find related components
osgrep "JWT token management" --content
osgrep "session storage" --content
osgrep "password validation" --content

Example 2: Bug Investigation

# User: "Why are database connections timing out?"

# Step 1: Find timeout handling
osgrep "database connection timeout" --scores --max-count 20

# Step 2: Filter high-confidence matches
osgrep "database connection timeout" --scores |
  awk '$2 > 0.7' |
  cut -d: -f1 |
  sort -u > relevant_files.txt

# Step 3: Check retry logic
cat relevant_files.txt | xargs osgrep "retry connection" --content

Example 3: API Endpoint Discovery

# User: "List all POST endpoints"

# Step 1: Find API handlers
osgrep "API endpoint handler" --compact | sort -u > endpoints.txt

# Step 2: Filter by HTTP method
cat endpoints.txt | xargs rg "POST" | cut -d: -f1 | sort -u

# Step 3: Get endpoint details
osgrep "POST endpoint implementation" --content --max-count 10

Example 4: Security Review

# User: "Audit the code for security issues"

# Step 1: Find potential vulnerabilities
osgrep "SQL injection" --scores > sql_issues.txt
osgrep "hardcoded credentials" --scores > cred_issues.txt
osgrep "plaintext password" --scores > pwd_issues.txt

# Step 2: Check for proper patterns
osgrep "parameterized SQL query" --max-count 10
osgrep "environment variable configuration" --max-count 10
osgrep "bcrypt password hashing" --max-count 10

# Step 3: Generate report
cat sql_issues.txt cred_issues.txt pwd_issues.txt |
  awk '$2 > 0.6 {print $0}'

Example 5: Refactoring Analysis

# User: "Help me refactor legacy authentication to JWT"

# Step 1: Find legacy pattern
osgrep "session cookie authentication" --compact | sort > legacy.txt

# Step 2: Find modern pattern examples
osgrep "JWT token authentication" --compact | sort > modern.txt

# Step 3: Identify migration files
comm -23 legacy.txt modern.txt > needs_migration.txt

# Step 4: Analyze migration complexity
wc -l needs_migration.txt
cat needs_migration.txt | xargs wc -l | sort -n

References

See the codebase documentation:

  • Core principles: ../osgrep-codebase/principles/semantic-search.md
  • Ranking details: ../osgrep-codebase/principles/hybrid-ranking.md
  • Workflow templates: ../osgrep-codebase/templates/search-workflow.md
  • Type definitions: ../osgrep-codebase/types/core.ts
  • Cheatsheet: ./assets/cheatsheet.md
  • Configuration: ./references/configuration.md
  • Search patterns: ./references/search-patterns.md

Scripts

  • search-validator.sh: Validate search result relevance
  • See ./scripts/ directory

Version

  • osgrep: 0.4.15
  • Models: Granite Embedding (30M) + osgrep-colbert (Q8)
  • Storage: LanceDB with Tree-sitter parsing

Quick Start for Claude Code

# 1. Check installation
osgrep doctor

# 2. Index project (if not already)
cd /path/to/project
osgrep list | grep -q "$(pwd)" || osgrep index

# 3. Start server (recommended)
osgrep serve --port 4444 &

# 4. Search semantically
osgrep "user authentication logic" --scores

# 5. Use with other tools
osgrep "API endpoint" --compact | xargs rg "router"

Key Principle: Use osgrep for semantic exploration, grep/rg for lexical precision. Combine both for powerful code understanding workflows.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

code refactoring

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

dspy-code

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

code-refactor

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

code

No summary provided by upstream source.

Repository SourceNeeds Review