long-document-llm-pipeline

Long Document LLM Processing Pipeline

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "long-document-llm-pipeline" with this command: npx skills add shimo4228/claude-code-learned-skills/shimo4228-claude-code-learned-skills-long-document-llm-pipeline

Long Document LLM Processing Pipeline

Extracted: 2026-02-08 (updated 2026-02-09) Context: When processing documents over ~50K characters through LLM APIs for extraction, generation, or analysis tasks.

Problem

Sending large documents (>50K chars) as a single LLM prompt causes:

  • Lost in the Middle - LLMs lose attention on content in the middle of long inputs (30%+ accuracy drop, per Liu et al. 2023)

  • High cost - Entire document becomes input tokens even if only portions are relevant

  • No partial retry - If generation fails, must re-process the entire document

  • No parallelism - Single sequential API call

Solution: 6-Step Pipeline

Document | v [1] Text Extraction (pymupdf4llm, page_chunks=True) | v [2] Structure Detection (Markdown headers, TOC, Japanese patterns) | v [3] Section Splitting (5K-30K chars per section) | v [4] Breadcrumb Context (prepend section path to each chunk) | v [5] Batch API / Async Parallel (50% cost reduction with Batch) | v [6] Merge + Deduplicate Results

Step 1: Structured Extraction (pymupdf4llm)

Use page_chunks=True to get structured per-page data with metadata:

import pymupdf4llm

BAD: Flat string, loses structure

text = pymupdf4llm.to_markdown("input.pdf")

GOOD: Structured per-page data with TOC and metadata

chunks = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)

Returns: list[dict] with keys:

- "metadata": {file_path, page_count, page_number, ...}

- "toc_items": [[level, title, page_number], ...]

- "text": "# Heading\n\nContent..."

- "tables": [...], "images": [...], "page_boxes": [...]

Key Parameters

Parameter Type Description

page_chunks

bool Return list of page dicts instead of string

hdr_info

callable/None Custom header detection. None = auto-detect by font size

page_separators

bool Insert --- end of page=n --- markers

margins

float/seq Page margins (exclude headers/footers)

Heading Detection

hdr_info=None auto-detects headings by font size via IdentifyHeaders and prefixes them with # markers. toc_items returns [level, title, page_number] from the PDF's built-in TOC.

Steps 2-3: Heading-Stack Sectioning with Breadcrumb

Use a dictionary-keyed heading stack where keys are heading levels. When a new heading appears, clear all levels >= its own level, then set the new heading.

heading_stack: dict[int, str] = {}

for heading_text, level in headings: # Clear deeper/same levels (new H1 clears H2, H3; new H2 clears H3) keys_to_remove = [k for k in heading_stack if k >= level] for k in keys_to_remove: del heading_stack[k] heading_stack[level] = heading_text

# Build breadcrumb from remaining stack (sorted by level)
stack_list = [document_title] if document_title else []
for lvl in sorted(heading_stack.keys()):
    stack_list.append(heading_stack[lvl])
breadcrumb = " > ".join(stack_list)

Behavior

Input: heading_stack breadcrumb

本論 {1: "本論"} "本論"

第1章 {1: "本論", 2: "第1章"} "本論 > 第1章"

第1節 {1,2,3} "本論 > 第1章 > 第1節"

第2章 ← clears H3 {1: "本論", 2: "第2章"} "本論 > 第2章"

結論 ← clears H2, H3 {1: "結論"} "結論"

Fallback Chain

  1. Markdown headings (#, ##, ###) → preferred
  2. Japanese headings (第X章, 序論/本論/結論, 1. etc.) → fallback
  3. Single preamble section (level=0) → last resort

Oversized Section Sub-splitting

After heading-based splitting, any section exceeding max_chars gets sub-split at \n\n paragraph boundaries. Sub-sections inherit the parent's breadcrumb.

Data Model

@dataclass(frozen=True, slots=True) class Section: id: str # "section-0", "section-1-2" heading: str # "第1章 概要" level: int # 1=H1, 2=H2, 3=H3, 0=preamble breadcrumb: str # "正理の海 > 本論 > 第1章" text: str # Section body (including heading line) page_range: str # "pp.3-18" or "" char_count: int # len(text), precomputed

Step 4: Breadcrumb Context in Prompts

Always prepend section hierarchy to LLM prompts:

prompt = ( f"Document: {title}\n" f"Section: {breadcrumb}\n" # e.g., "Chapter 3 > Section 2" f"Pages: {page_range}\n\n" f"---\n\n{section_text}" )

Step 5: API Call Strategy

Decision Matrix: When to Chunk

Document Size Approach Rationale

< 50K chars Single prompt Within attention sweet spot

50K - 200K chars Structure-aware chunking Avoid Lost in the Middle

200K chars Structure-aware + model routing Cost optimization critical

Anthropic Batch API (50% Cost Reduction)

For non-real-time processing:

from anthropic.types.message_create_params import MessageCreateParamsNonStreaming from anthropic.types.messages.batch_create_params import Request

requests = [ Request( custom_id=f"section-{i}", params=MessageCreateParamsNonStreaming( model="claude-sonnet-4-5-20250929", max_tokens=8192, system=[{ "type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}, }], messages=[{"role": "user", "content": section_prompt}], ), ) for i, section_prompt in enumerate(section_prompts) ] batch = client.messages.batches.create(requests=requests)

Key facts: 50% discount, max 100K requests/256MB per batch, prompt caching stacks with discount.

Model Routing per Section

def select_model(section_text: str) -> str: if len(section_text) < 5_000: return "claude-haiku-4-5-20251001" # Simple/short return "claude-sonnet-4-5-20250929" # Complex

Cost Example

572K char Japanese document (20 sections):

Approach Estimated Cost

Single chunk, Sonnet ~$0.90

Structured + Batch + Sonnet ~$0.45

Structured + Batch + mixed models ~$0.35

When to Use

  • Processing PDFs/documents >50K characters through any LLM API

  • Building document-to-X pipelines (flashcards, summaries, Q&A datasets)

  • Japanese/multilingual documents with chapter/section structure

  • Any task where hierarchical context improves LLM output quality

References

  • Lost in the Middle (Liu et al. 2023)

  • Claude Batch Processing API

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

content-hash-cache-pattern

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

cross-source-fact-verification

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

parallel-subagent-batch-merge

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

data-generation-quality-metrics-loop

No summary provided by upstream source.

Repository SourceNeeds Review