Search Data

Spice provides integrated search capabilities: vector (semantic) search, full-text (keyword) search, and hybrid search with Reciprocal Rank Fusion (RRF) — all via SQL functions and HTTP APIs. Search indexes are built on top of accelerated datasets.

Search Methods

Method	When to Use	Requires
Vector search	Semantic similarity, RAG, recommendations	Embedding model + column embeddings
Full-text search	Keyword/phrase matching, exact terms	`full_text_search.enabled: true` on columns
Hybrid (RRF)	Best of both — combines rankings from multiple methods	Multiple search methods configured
Lexical (LIKE/=)	Exact pattern or value matching	Nothing extra

Set Up Vector Search

1. Define an Embedding Model

embeddings:
  - name: local_embeddings
    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2

  - name: openai_embeddings
    from: openai:text-embedding-3-small
    params:
      openai_api_key: ${ secrets:OPENAI_API_KEY }

Supported Embedding Providers

Provider	From Format	Status
OpenAI	`openai:text-embedding-3-large`	Release Candidate
HuggingFace	`huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2`	Release Candidate
Local file	`file:model.safetensors`	Release Candidate
Azure OpenAI	`azure:my-deployment`	Alpha
Google AI	`google:text-embedding-004`	Alpha
Amazon Bedrock	`bedrock:amazon.titan-embed-text-v1`	Alpha
Databricks	`databricks:endpoint`	Alpha
Model2Vec	`model2vec:model-name`	Alpha

2. Configure Dataset Columns for Embeddings

datasets:
  - from: postgres:documents
    name: docs
    acceleration:
      enabled: true
    columns:
      - name: content
        embeddings:
          - from: local_embeddings
            row_id: id
            chunking:
              enabled: true
              target_chunk_size: 512
              overlap_size: 64

Embedding Methods

Method	Description	When to Use
Accelerated	Precomputed and stored	Faster queries, frequently searched datasets
JIT (Just-in-Time)	Computed at query time (no acceleration)	Large or rarely queried datasets
Passthrough	Pre-existing embeddings used directly	Source already has `<col>_embedding` columns

3. Query via HTTP API

curl -X POST http://localhost:8090/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "datasets": ["docs"],
    "text": "cutting edge AI",
    "where": "author=\"jeadie\"",
    "additional_columns": ["title", "state"],
    "limit": 5
  }'

Field	Required	Description
`text`	Yes	Search text
`datasets`	No	Datasets to search (null = all searchable)
`additional_columns`	No	Extra columns to return
`where`	No	SQL filter predicate
`limit`	No	Max results per dataset

To retrieve full documents (not just chunks), include the embedding column name in additional_columns.

4. Query via SQL UDTF

SELECT id, title, score
FROM vector_search(docs, 'cutting edge AI')
WHERE state = 'Open'
ORDER BY score DESC
LIMIT 5;

vector_search signature:

vector_search(
  table STRING,          -- Dataset name (required)
  query STRING,          -- Search text (required)
  col STRING,            -- Column (optional if single embedding column)
  limit INTEGER,         -- Max results (default: 1000)
  include_score BOOLEAN  -- Include score column (default: TRUE)
) RETURNS TABLE

Limitation: vector_search UDTF does not yet support chunked embedding columns. Use the HTTP API for chunked data.

Set Up Full-Text Search

Full-text search uses BM25 scoring (powered by Tantivy) for keyword relevance ranking.

1. Enable Indexing on Columns

datasets:
  - from: postgres:articles
    name: articles
    acceleration:
      enabled: true
    columns:
      - name: title
        full_text_search:
          enabled: true
          row_id:
            - id
      - name: body
        full_text_search:
          enabled: true

2. Query via SQL UDTF

SELECT id, title, score
FROM text_search(articles, 'search keywords', body)
ORDER BY score DESC
LIMIT 5;

text_search signature:

text_search(
  table STRING,          -- Dataset name (required)
  query STRING,          -- Keywords/phrase (required)
  col STRING,            -- Column (required if multiple indexed columns)
  limit INTEGER,         -- Max results (default: 1000)
  include_score BOOLEAN  -- Include score column (default: TRUE)
) RETURNS TABLE

Hybrid Search with RRF

Reciprocal Rank Fusion merges rankings from multiple search methods. Each query runs independently, then results are combined:

RRF Score = Σ(rank_weight / (k + rank))

Documents appearing across multiple result sets receive higher scores.

Basic Hybrid Search

SELECT id, title, content, fused_score
FROM rrf(
    vector_search(documents, 'machine learning algorithms'),
    text_search(documents, 'neural networks deep learning', content),
    join_key => 'id'
)
ORDER BY fused_score DESC
LIMIT 5;

Weighted Ranking

SELECT fused_score, title, content
FROM rrf(
    text_search(posts, 'artificial intelligence', rank_weight => 50.0),
    vector_search(posts, 'AI machine learning', rank_weight => 200.0)
)
ORDER BY fused_score DESC
LIMIT 10;

Recency-Boosted Search

-- Exponential decay (1-hour scale)
SELECT fused_score, title, created_at
FROM rrf(
    text_search(news, 'breaking news'),
    vector_search(news, 'latest updates'),
    time_column => 'created_at',
    recency_decay => 'exponential',
    decay_constant => 0.05,
    decay_scale_secs => 3600
)
ORDER BY fused_score DESC
LIMIT 10;

-- Linear decay (24-hour window)
SELECT fused_score, content
FROM rrf(
    text_search(posts, 'trending'),
    vector_search(posts, 'viral popular'),
    time_column => 'created_at',
    recency_decay => 'linear',
    decay_window_secs => 86400
)
ORDER BY fused_score DESC;

Cross-Language Search

SELECT fused_score, text, langs
FROM rrf(
    vector_search(posts, 'ultimas noticias', rank_weight => 100),
    text_search(posts, 'news'),
    time_column => 'created_at',
    recency_decay => 'exponential',
    decay_constant => 0.05,
    decay_scale_secs => 3600
)
WHERE trim(text) != ''
ORDER BY fused_score DESC LIMIT 15;

`rrf` Parameters

Parameter	Type	Required	Description
`query_1`, `query_2`, ...	Search UDTF	Yes (2+)	`vector_search` or `text_search` calls (variadic)
`join_key`	String	No	Column for joining results (default: auto-hash)
`k`	Float	No	Smoothing parameter (default: 60.0, lower = more aggressive)
`time_column`	String	No	Timestamp column for recency boosting
`recency_decay`	String	No	`'exponential'` (default) or `'linear'`
`decay_constant`	Float	No	Rate for exponential decay (default: 0.01)
`decay_scale_secs`	Float	No	Time scale for exponential decay (default: 86400)
`decay_window_secs`	Float	No	Window for linear decay (default: 86400)
`rank_weight`	Float	No	Per-query weight (specified inside search calls)

Vector Engines

Store and index embeddings at scale using dedicated vector engines:

datasets:
  - from: postgres:documents
    name: docs
    acceleration:
      enabled: true
    columns:
      - name: content
        embeddings:
          - from: embed_model
            row_id: id
        metadata:
          vectors: non-filterable
      - name: category
        metadata:
          vectors: filterable # enable filtering on this column
    vectors:
      enabled: true
      engine: s3_vectors
      params:
        s3_vectors_bucket: my-bucket
        s3_vectors_region: us-east-1

Lexical Search (SQL)

Standard SQL filtering:

SELECT * FROM my_table WHERE column LIKE '%substring%';
SELECT * FROM my_table WHERE column = 'exact value';
SELECT * FROM my_table WHERE regexp_like(column, '^spice.*ai$');

CLI Search

spice search "cutting edge AI" --dataset docs --limit 5
spice search --cache-control no-cache "search terms"

Complete Example

version: v1
kind: Spicepod
name: search_app

secrets:
  - from: env
    name: env

embeddings:
  - name: embeddings
    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2

datasets:
  - from: file:articles.parquet
    name: articles
    acceleration:
      enabled: true
      engine: duckdb
    columns:
      - name: title
        full_text_search:
          enabled: true
          row_id:
            - id
      - name: content
        embeddings:
          - from: embeddings
        full_text_search:
          enabled: true

SELECT id, title, content, fused_score
FROM rrf(
    vector_search(articles, 'machine learning best practices'),
    text_search(articles, 'neural network training', content),
    join_key => 'id',
    time_column => 'published_at',
    recency_decay => 'exponential',
    decay_constant => 0.01,
    decay_scale_secs => 86400
)
WHERE fused_score > 0.01
ORDER BY fused_score DESC
LIMIT 10;

Troubleshooting

Issue	Solution
`vector_search` returns no results	Verify embeddings configured on column and model is loaded
`text_search` returns no results	Check `full_text_search.enabled: true`; acceleration must be enabled
Poor hybrid search relevance	Tune `rank_weight` per query and adjust `k`
Results missing recent content	Add `time_column` and `recency_decay` to RRF
Chunked vector search not working via SQL	Use HTTP API instead (UDTF doesn't support chunked columns yet)

spice-search

Safety Notice

Copy this and send it to your AI assistant to learn

Search Data

Search Methods

Set Up Vector Search

1. Define an Embedding Model

Supported Embedding Providers

2. Configure Dataset Columns for Embeddings

Embedding Methods

3. Query via HTTP API

4. Query via SQL UDTF

Set Up Full-Text Search

1. Enable Indexing on Columns

2. Query via SQL UDTF

Hybrid Search with RRF

Basic Hybrid Search

Weighted Ranking

Recency-Boosted Search

Cross-Language Search

`rrf` Parameters

Vector Engines

Lexical Search (SQL)

CLI Search

Complete Example

Troubleshooting

Documentation

Source Transparency

Related Skills

spice-data-connector

spice-models

spicepod-config

spice-accelerators

spice-search

Safety Notice

Copy this and send it to your AI assistant to learn

Search Data

Search Methods

Set Up Vector Search

1. Define an Embedding Model

Supported Embedding Providers

2. Configure Dataset Columns for Embeddings

Embedding Methods

3. Query via HTTP API

4. Query via SQL UDTF

Set Up Full-Text Search

1. Enable Indexing on Columns

2. Query via SQL UDTF

Hybrid Search with RRF

Basic Hybrid Search

Weighted Ranking

Recency-Boosted Search

Cross-Language Search

rrf Parameters

Vector Engines

Lexical Search (SQL)

CLI Search

Complete Example

Troubleshooting

Documentation

Source Transparency

Related Skills

spice-data-connector

spice-models

spicepod-config

spice-accelerators

`rrf` Parameters