Long Document LLM Processing Pipeline

Extracted: 2026-02-08 (updated 2026-02-09) Context: When processing documents over ~50K characters through LLM APIs for extraction, generation, or analysis tasks.

Problem

Sending large documents (>50K chars) as a single LLM prompt causes:

Lost in the Middle - LLMs lose attention on content in the middle of long inputs (30%+ accuracy drop, per Liu et al. 2023)
High cost - Entire document becomes input tokens even if only portions are relevant
No partial retry - If generation fails, must re-process the entire document
No parallelism - Single sequential API call

Solution: 6-Step Pipeline

Document | v [1] Text Extraction (pymupdf4llm, page_chunks=True) | v [2] Structure Detection (Markdown headers, TOC, Japanese patterns) | v [3] Section Splitting (5K-30K chars per section) | v [4] Breadcrumb Context (prepend section path to each chunk) | v [5] Batch API / Async Parallel (50% cost reduction with Batch) | v [6] Merge + Deduplicate Results

Step 1: Structured Extraction (pymupdf4llm)

Use page_chunks=True to get structured per-page data with metadata:

import pymupdf4llm

BAD: Flat string, loses structure

text = pymupdf4llm.to_markdown("input.pdf")

GOOD: Structured per-page data with TOC and metadata

chunks = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)

Returns: list[dict] with keys:

- "metadata": {file_path, page_count, page_number, ...}

- "toc_items": [[level, title, page_number], ...]

- "text": "# Heading\n\nContent..."

- "tables": [...], "images": [...], "page_boxes": [...]

Key Parameters

Parameter Type Description

page_chunks

bool Return list of page dicts instead of string

hdr_info

callable/None Custom header detection. None = auto-detect by font size

page_separators

bool Insert --- end of page=n --- markers

margins

float/seq Page margins (exclude headers/footers)

Heading Detection

hdr_info=None auto-detects headings by font size via IdentifyHeaders and prefixes them with # markers. toc_items returns [level, title, page_number] from the PDF's built-in TOC.

Steps 2-3: Heading-Stack Sectioning with Breadcrumb

Use a dictionary-keyed heading stack where keys are heading levels. When a new heading appears, clear all levels >= its own level, then set the new heading.

heading_stack: dict[int, str] = {}

for heading_text, level in headings: # Clear deeper/same levels (new H1 clears H2, H3; new H2 clears H3) keys_to_remove = [k for k in heading_stack if k >= level] for k in keys_to_remove: del heading_stack[k] heading_stack[level] = heading_text

# Build breadcrumb from remaining stack (sorted by level)
stack_list = [document_title] if document_title else []
for lvl in sorted(heading_stack.keys()):
    stack_list.append(heading_stack[lvl])
breadcrumb = " > ".join(stack_list)

Behavior

Input: heading_stack breadcrumb

本論 {1: "本論"} "本論"

第1章 {1: "本論", 2: "第1章"} "本論 > 第1章"

第1節 {1,2,3} "本論 > 第1章 > 第1節"

第2章 ← clears H3 {1: "本論", 2: "第2章"} "本論 > 第2章"

結論 ← clears H2, H3 {1: "結論"} "結論"

Fallback Chain

Markdown headings (#, ##, ###) → preferred
Japanese headings (第X章, 序論/本論/結論, 1. etc.) → fallback
Single preamble section (level=0) → last resort

Oversized Section Sub-splitting

After heading-based splitting, any section exceeding max_chars gets sub-split at \n\n paragraph boundaries. Sub-sections inherit the parent's breadcrumb.

Data Model

@dataclass(frozen=True, slots=True) class Section: id: str # "section-0", "section-1-2" heading: str # "第1章概要" level: int # 1=H1, 2=H2, 3=H3, 0=preamble breadcrumb: str # "正理の海 > 本論 > 第1章" text: str # Section body (including heading line) page_range: str # "pp.3-18" or "" char_count: int # len(text), precomputed

Step 4: Breadcrumb Context in Prompts

Always prepend section hierarchy to LLM prompts:

prompt = ( f"Document: {title}\n" f"Section: {breadcrumb}\n" # e.g., "Chapter 3 > Section 2" f"Pages: {page_range}\n\n" f"---\n\n{section_text}" )

Step 5: API Call Strategy

Decision Matrix: When to Chunk

Document Size Approach Rationale

< 50K chars Single prompt Within attention sweet spot

50K - 200K chars Structure-aware chunking Avoid Lost in the Middle

200K chars Structure-aware + model routing Cost optimization critical

Anthropic Batch API (50% Cost Reduction)

For non-real-time processing:

from anthropic.types.message_create_params import MessageCreateParamsNonStreaming from anthropic.types.messages.batch_create_params import Request

requests = [ Request( custom_id=f"section-{i}", params=MessageCreateParamsNonStreaming( model="claude-sonnet-4-5-20250929", max_tokens=8192, system=[{ "type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}, }], messages=[{"role": "user", "content": section_prompt}], ), ) for i, section_prompt in enumerate(section_prompts) ] batch = client.messages.batches.create(requests=requests)

Key facts: 50% discount, max 100K requests/256MB per batch, prompt caching stacks with discount.

Model Routing per Section

def select_model(section_text: str) -> str: if len(section_text) < 5_000: return "claude-haiku-4-5-20251001" # Simple/short return "claude-sonnet-4-5-20250929" # Complex

Cost Example

572K char Japanese document (20 sections):

Approach Estimated Cost

Single chunk, Sonnet ~$0.90

Structured + Batch + Sonnet ~$0.45

Structured + Batch + mixed models ~$0.35

When to Use

Processing PDFs/documents >50K characters through any LLM API
Building document-to-X pipelines (flashcards, summaries, Q&A datasets)
Japanese/multilingual documents with chapter/section structure
Any task where hierarchical context improves LLM output quality

References

Lost in the Middle (Liu et al. 2023)
Claude Batch Processing API

long-document-llm-pipeline

Safety Notice

Copy this and send it to your AI assistant to learn

BAD: Flat string, loses structure

GOOD: Structured per-page data with TOC and metadata

Returns: list[dict] with keys:

- "metadata": {file_path, page_count, page_number, ...}

- "toc_items": [[level, title, page_number], ...]

- "text": "# Heading\n\nContent..."

- "tables": [...], "images": [...], "page_boxes": [...]

本論 {1: "本論"} "本論"

第1章 {1: "本論", 2: "第1章"} "本論 > 第1章"

第1節 {1,2,3} "本論 > 第1章 > 第1節"

第2章 ← clears H3 {1: "本論", 2: "第2章"} "本論 > 第2章"

結論 ← clears H2, H3 {1: "結論"} "結論"

Source Transparency

Related Skills

content-hash-cache-pattern

cross-source-fact-verification

parallel-subagent-batch-merge

data-generation-quality-metrics-loop