spice-search

Search data using vector similarity, full-text keywords, or hybrid methods with Reciprocal Rank Fusion (RRF). Use when setting up embeddings for search, configuring full-text indexing, writing vector_search/text_search/rrf SQL queries, using the /v1/search HTTP API, or configuring vector engines like S3 Vectors.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "spice-search" with this command: npx skills add spiceai/skills/spiceai-skills-spice-search

Search Data

Spice provides integrated search capabilities: vector (semantic) search, full-text (keyword) search, and hybrid search with Reciprocal Rank Fusion (RRF) — all via SQL functions and HTTP APIs. Search indexes are built on top of accelerated datasets.

Search Methods

MethodWhen to UseRequires
Vector searchSemantic similarity, RAG, recommendationsEmbedding model + column embeddings
Full-text searchKeyword/phrase matching, exact termsfull_text_search.enabled: true on columns
Hybrid (RRF)Best of both — combines rankings from multiple methodsMultiple search methods configured
Lexical (LIKE/=)Exact pattern or value matchingNothing extra

Set Up Vector Search

1. Define an Embedding Model

embeddings:
  - name: local_embeddings
    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2

  - name: openai_embeddings
    from: openai:text-embedding-3-small
    params:
      openai_api_key: ${ secrets:OPENAI_API_KEY }

Supported Embedding Providers

ProviderFrom FormatStatus
OpenAIopenai:text-embedding-3-largeRelease Candidate
HuggingFacehuggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2Release Candidate
Local filefile:model.safetensorsRelease Candidate
Azure OpenAIazure:my-deploymentAlpha
Google AIgoogle:text-embedding-004Alpha
Amazon Bedrockbedrock:amazon.titan-embed-text-v1Alpha
Databricksdatabricks:endpointAlpha
Model2Vecmodel2vec:model-nameAlpha

2. Configure Dataset Columns for Embeddings

datasets:
  - from: postgres:documents
    name: docs
    acceleration:
      enabled: true
    columns:
      - name: content
        embeddings:
          - from: local_embeddings
            row_id: id
            chunking:
              enabled: true
              target_chunk_size: 512
              overlap_size: 64

Embedding Methods

MethodDescriptionWhen to Use
AcceleratedPrecomputed and storedFaster queries, frequently searched datasets
JIT (Just-in-Time)Computed at query time (no acceleration)Large or rarely queried datasets
PassthroughPre-existing embeddings used directlySource already has <col>_embedding columns

3. Query via HTTP API

curl -X POST http://localhost:8090/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "datasets": ["docs"],
    "text": "cutting edge AI",
    "where": "author=\"jeadie\"",
    "additional_columns": ["title", "state"],
    "limit": 5
  }'
FieldRequiredDescription
textYesSearch text
datasetsNoDatasets to search (null = all searchable)
additional_columnsNoExtra columns to return
whereNoSQL filter predicate
limitNoMax results per dataset

To retrieve full documents (not just chunks), include the embedding column name in additional_columns.

4. Query via SQL UDTF

SELECT id, title, score
FROM vector_search(docs, 'cutting edge AI')
WHERE state = 'Open'
ORDER BY score DESC
LIMIT 5;

vector_search signature:

vector_search(
  table STRING,          -- Dataset name (required)
  query STRING,          -- Search text (required)
  col STRING,            -- Column (optional if single embedding column)
  limit INTEGER,         -- Max results (default: 1000)
  include_score BOOLEAN  -- Include score column (default: TRUE)
) RETURNS TABLE

Limitation: vector_search UDTF does not yet support chunked embedding columns. Use the HTTP API for chunked data.

Set Up Full-Text Search

Full-text search uses BM25 scoring (powered by Tantivy) for keyword relevance ranking.

1. Enable Indexing on Columns

datasets:
  - from: postgres:articles
    name: articles
    acceleration:
      enabled: true
    columns:
      - name: title
        full_text_search:
          enabled: true
          row_id:
            - id
      - name: body
        full_text_search:
          enabled: true

2. Query via SQL UDTF

SELECT id, title, score
FROM text_search(articles, 'search keywords', body)
ORDER BY score DESC
LIMIT 5;

text_search signature:

text_search(
  table STRING,          -- Dataset name (required)
  query STRING,          -- Keywords/phrase (required)
  col STRING,            -- Column (required if multiple indexed columns)
  limit INTEGER,         -- Max results (default: 1000)
  include_score BOOLEAN  -- Include score column (default: TRUE)
) RETURNS TABLE

Hybrid Search with RRF

Reciprocal Rank Fusion merges rankings from multiple search methods. Each query runs independently, then results are combined:

RRF Score = Σ(rank_weight / (k + rank))

Documents appearing across multiple result sets receive higher scores.

Basic Hybrid Search

SELECT id, title, content, fused_score
FROM rrf(
    vector_search(documents, 'machine learning algorithms'),
    text_search(documents, 'neural networks deep learning', content),
    join_key => 'id'
)
ORDER BY fused_score DESC
LIMIT 5;

Weighted Ranking

SELECT fused_score, title, content
FROM rrf(
    text_search(posts, 'artificial intelligence', rank_weight => 50.0),
    vector_search(posts, 'AI machine learning', rank_weight => 200.0)
)
ORDER BY fused_score DESC
LIMIT 10;

Recency-Boosted Search

-- Exponential decay (1-hour scale)
SELECT fused_score, title, created_at
FROM rrf(
    text_search(news, 'breaking news'),
    vector_search(news, 'latest updates'),
    time_column => 'created_at',
    recency_decay => 'exponential',
    decay_constant => 0.05,
    decay_scale_secs => 3600
)
ORDER BY fused_score DESC
LIMIT 10;

-- Linear decay (24-hour window)
SELECT fused_score, content
FROM rrf(
    text_search(posts, 'trending'),
    vector_search(posts, 'viral popular'),
    time_column => 'created_at',
    recency_decay => 'linear',
    decay_window_secs => 86400
)
ORDER BY fused_score DESC;

Cross-Language Search

SELECT fused_score, text, langs
FROM rrf(
    vector_search(posts, 'ultimas noticias', rank_weight => 100),
    text_search(posts, 'news'),
    time_column => 'created_at',
    recency_decay => 'exponential',
    decay_constant => 0.05,
    decay_scale_secs => 3600
)
WHERE trim(text) != ''
ORDER BY fused_score DESC LIMIT 15;

rrf Parameters

ParameterTypeRequiredDescription
query_1, query_2, ...Search UDTFYes (2+)vector_search or text_search calls (variadic)
join_keyStringNoColumn for joining results (default: auto-hash)
kFloatNoSmoothing parameter (default: 60.0, lower = more aggressive)
time_columnStringNoTimestamp column for recency boosting
recency_decayStringNo'exponential' (default) or 'linear'
decay_constantFloatNoRate for exponential decay (default: 0.01)
decay_scale_secsFloatNoTime scale for exponential decay (default: 86400)
decay_window_secsFloatNoWindow for linear decay (default: 86400)
rank_weightFloatNoPer-query weight (specified inside search calls)

Vector Engines

Store and index embeddings at scale using dedicated vector engines:

datasets:
  - from: postgres:documents
    name: docs
    acceleration:
      enabled: true
    columns:
      - name: content
        embeddings:
          - from: embed_model
            row_id: id
        metadata:
          vectors: non-filterable
      - name: category
        metadata:
          vectors: filterable # enable filtering on this column
    vectors:
      enabled: true
      engine: s3_vectors
      params:
        s3_vectors_bucket: my-bucket
        s3_vectors_region: us-east-1

Lexical Search (SQL)

Standard SQL filtering:

SELECT * FROM my_table WHERE column LIKE '%substring%';
SELECT * FROM my_table WHERE column = 'exact value';
SELECT * FROM my_table WHERE regexp_like(column, '^spice.*ai$');

CLI Search

spice search "cutting edge AI" --dataset docs --limit 5
spice search --cache-control no-cache "search terms"

Complete Example

version: v1
kind: Spicepod
name: search_app

secrets:
  - from: env
    name: env

embeddings:
  - name: embeddings
    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2

datasets:
  - from: file:articles.parquet
    name: articles
    acceleration:
      enabled: true
      engine: duckdb
    columns:
      - name: title
        full_text_search:
          enabled: true
          row_id:
            - id
      - name: content
        embeddings:
          - from: embeddings
        full_text_search:
          enabled: true
SELECT id, title, content, fused_score
FROM rrf(
    vector_search(articles, 'machine learning best practices'),
    text_search(articles, 'neural network training', content),
    join_key => 'id',
    time_column => 'published_at',
    recency_decay => 'exponential',
    decay_constant => 0.01,
    decay_scale_secs => 86400
)
WHERE fused_score > 0.01
ORDER BY fused_score DESC
LIMIT 10;

Troubleshooting

IssueSolution
vector_search returns no resultsVerify embeddings configured on column and model is loaded
text_search returns no resultsCheck full_text_search.enabled: true; acceleration must be enabled
Poor hybrid search relevanceTune rank_weight per query and adjust k
Results missing recent contentAdd time_column and recency_decay to RRF
Chunked vector search not working via SQLUse HTTP API instead (UDTF doesn't support chunked columns yet)

Documentation

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

spice-data-connector

No summary provided by upstream source.

Repository SourceNeeds Review
General

spice-models

No summary provided by upstream source.

Repository SourceNeeds Review
General

spicepod-config

No summary provided by upstream source.

Repository SourceNeeds Review
General

spice-accelerators

No summary provided by upstream source.

Repository SourceNeeds Review