Kreuzberg Document Extraction
Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 88+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.
Use this skill when writing code that:
- Extracts text or metadata from documents
- Performs OCR on scanned documents or images
- Batch-processes multiple files
- Configures extraction options (output format, chunking, OCR, language detection)
- Implements custom plugins (post-processors, validators, OCR backends)
Installation
Python
pip install kreuzberg
# Optional OCR backends:
pip install kreuzberg[easyocr] # EasyOCR
pip install kreuzberg[paddleocr] # PaddleOCR
Node.js
npm install @kreuzberg/node
Rust
# Cargo.toml
[dependencies]
kreuzberg = { version = "4", features = ["tokio-runtime"] }
# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,
# embeddings, language-detection, keywords-yake, keywords-rake
CLI
# Download from GitHub releases, or:
cargo install kreuzberg-cli
Quick Start
Python (Async)
from kreuzberg import extract_file
result = await extract_file("document.pdf")
print(result.content) # extracted text
print(result.metadata) # document metadata
print(result.tables) # extracted tables
Python (Sync)
from kreuzberg import extract_file_sync
result = extract_file_sync("document.pdf")
print(result.content)
Node.js
import { extractFile } from '@kreuzberg/node';
const result = await extractFile('document.pdf');
console.log(result.content);
console.log(result.metadata);
console.log(result.tables);
Node.js (Sync)
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf');
Rust (Async)
use kreuzberg::{extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file("document.pdf", None, &config).await?;
println!("{}", result.content);
Ok(())
}
Rust (Sync) — requires tokio-runtime feature
use kreuzberg::{extract_file_sync, ExtractionConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
CLI
kreuzberg extract document.pdf
kreuzberg extract document.pdf --format json
kreuzberg extract document.pdf --output-format markdown
Configuration
All languages use the same configuration structure with language-appropriate naming conventions.
Python (snake_case)
from kreuzberg import (
ExtractionConfig, OcrConfig, TesseractConfig,
PdfConfig, ChunkingConfig,
)
config = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
),
pdf_options=PdfConfig(passwords=["secret123"]),
chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
output_format="markdown",
)
result = await extract_file("document.pdf", config=config)
Node.js (camelCase)
import { extractFile, type ExtractionConfig } from '@kreuzberg/node';
const config: ExtractionConfig = {
ocr: { backend: 'tesseract', language: 'eng' },
pdfOptions: { passwords: ['secret123'] },
chunking: { maxChars: 1000, maxOverlap: 200 },
outputFormat: 'markdown',
};
const result = await extractFile('document.pdf', null, config);
Rust (snake_case)
use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".into(),
language: "eng".into(),
..Default::default()
}),
chunking: Some(ChunkingConfig {
max_characters: 1000,
overlap: 200,
..Default::default()
}),
output_format: OutputFormat::Markdown,
..Default::default()
};
let result = extract_file("document.pdf", None, &config).await?;
Config File (TOML)
output_format = "markdown"
[ocr]
backend = "tesseract"
language = "eng"
[chunking]
max_chars = 1000
max_overlap = 200
[pdf_options]
passwords = ["secret123"]
# CLI: auto-discovers kreuzberg.toml in current/parent directories
kreuzberg extract doc.pdf
# or explicit:
kreuzberg extract doc.pdf --config kreuzberg.toml
kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'
Batch Processing
Python
from kreuzberg import batch_extract_files, batch_extract_files_sync
# Async
results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])
# Sync
results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])
for result in results:
print(f"{len(result.content)} chars extracted")
Node.js
import { batchExtractFiles } from '@kreuzberg/node';
const results = await batchExtractFiles(['doc1.pdf', 'doc2.docx']);
Rust — requires tokio-runtime feature
use kreuzberg::{batch_extract_file, ExtractionConfig};
let config = ExtractionConfig::default();
let paths = vec!["doc1.pdf", "doc2.docx"];
let results = batch_extract_file(paths, &config).await?;
CLI
kreuzberg batch *.pdf --format json
kreuzberg batch docs/*.docx --output-format markdown
OCR
OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).
Backends
- Tesseract (default): Built-in native binding. All Tesseract languages supported.
- EasyOCR (Python only):
pip install kreuzberg[easyocr]. Passeasyocr_kwargs={"gpu": True}. - PaddleOCR (Python only):
pip install kreuzberg[paddleocr]. Passpaddleocr_kwargs={"use_angle_cls": True}. - Guten (Node.js only): Built-in OCR backend via
GutenOcrBackend.
Language Codes
config = ExtractionConfig(ocr=OcrConfig(language="eng")) # English
config = ExtractionConfig(ocr=OcrConfig(language="eng+deu")) # Multiple
config = ExtractionConfig(ocr=OcrConfig(language="all")) # All installed
Force OCR
config = ExtractionConfig(force_ocr=True) # OCR even if text is extractable
ExtractionResult Fields
| Field | Python | Node.js | Rust | Description |
|---|---|---|---|---|
| Text content | result.content | result.content | result.content | Extracted text (str/String) |
| MIME type | result.mime_type | result.mimeType | result.mime_type | Input document MIME type |
| Metadata | result.metadata | result.metadata | result.metadata | Document metadata (dict/object/HashMap) |
| Tables | result.tables | result.tables | result.tables | Extracted tables with cells + markdown |
| Languages | result.detected_languages | result.detectedLanguages | result.detected_languages | Detected languages (if enabled) |
| Chunks | result.chunks | result.chunks | result.chunks | Text chunks (if chunking enabled) |
| Images | result.images | result.images | result.images | Extracted images (if enabled) |
| Elements | result.elements | result.elements | result.elements | Semantic elements (if element_based format) |
| Pages | result.pages | result.pages | result.pages | Per-page content (if page extraction enabled) |
| Keywords | result.keywords | result.keywords | result.keywords | Extracted keywords (if enabled) |
Error Handling
Python
from kreuzberg import (
extract_file_sync, KreuzbergError, ParsingError,
OCRError, ValidationError, MissingDependencyError,
)
try:
result = extract_file_sync("file.pdf")
except ParsingError as e:
print(f"Failed to parse: {e}")
except OCRError as e:
print(f"OCR failed: {e}")
except ValidationError as e:
print(f"Invalid input: {e}")
except MissingDependencyError as e:
print(f"Missing dependency: {e}")
except KreuzbergError as e:
print(f"Extraction failed: {e}")
Node.js
import {
extractFile, KreuzbergError, ParsingError,
OcrError, ValidationError, MissingDependencyError,
} from '@kreuzberg/node';
try {
const result = await extractFile('file.pdf');
} catch (e) {
if (e instanceof ParsingError) { /* ... */ }
else if (e instanceof OcrError) { /* ... */ }
else if (e instanceof ValidationError) { /* ... */ }
else if (e instanceof KreuzbergError) { /* ... */ }
}
Rust
use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};
let config = ExtractionConfig::default();
match extract_file("file.pdf", None, &config).await {
Ok(result) => println!("{}", result.content),
Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
Err(e) => eprintln!("Error: {e}"),
}
Common Pitfalls
- Python ChunkingConfig fields: Use
max_charsandmax_overlap, NOTmax_charactersoroverlap. - Rust extract_file signature: Third argument is
&ExtractionConfig(a reference), notOption. Use&ExtractionConfig::default()for defaults. - Rust feature gates:
extract_file_sync,batch_extract_file, andbatch_extract_file_syncall requirefeatures = ["tokio-runtime"]in Cargo.toml. - Rust async context:
extract_fileis async. Use#[tokio::main]or call from an async context. - CLI --format vs --output-format:
--formatcontrols CLI output (text/json).--output-formatcontrols content format (plain/markdown/djot/html). - Node.js extractFile signature:
extractFile(path, mimeType?, config?)— mimeType is the second arg (passnullto skip). - Python detect_mime_type: The function for detecting from bytes is
detect_mime_type(data). For paths usedetect_mime_type_from_path(path). - Config file field names: Use snake_case in TOML/YAML/JSON config files (e.g.,
max_chars,max_overlap,pdf_options).
Supported Formats (Summary)
| Category | Extensions |
|---|---|
.pdf | |
| Word | .docx, .odt |
| Spreadsheets | .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .ods |
| Presentations | .pptx, .ppt, .ppsx |
| eBooks | .epub, .fb2 |
| Images | .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif, .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm, .svg |
| Markup | .html, .htm, .xhtml, .xml |
| Data | .json, .yaml, .yml, .toml, .csv, .tsv |
| Text | .txt, .md, .markdown, .djot, .rst, .org, .rtf |
.eml, .msg | |
| Archives | .zip, .tar, .tgz, .gz, .7z |
| Academic | .bib, .biblatex, .ris, .nbib, .enw, .csl, .tex, .latex, .typ, .jats, .ipynb, .docbook, .opml, .pod, .mdoc, .troff |
See references/supported-formats.md for the complete format reference with MIME types.
Additional Resources
Detailed reference files for specific topics:
- Python API Reference — All functions, config classes, plugin protocols, exact signatures
- Node.js API Reference — All functions, TypeScript interfaces, worker pool APIs
- Rust API Reference — All functions with feature gates, structs, Cargo.toml examples
- CLI Reference — All commands, flags, config precedence, exit codes
- Configuration Reference — TOML/YAML/JSON formats, auto-discovery, env vars, full schema
- Supported Formats — All 85+ formats with file extensions and MIME types
- Advanced Features — Plugins, embeddings, MCP server, API server, security limits
- Other Language Bindings — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker
Full documentation: https://docs.kreuzberg.dev GitHub: https://github.com/kreuzberg-dev/kreuzberg