PDF Processing Skill

Overview

This skill enables comprehensive PDF operations through Python libraries and command-line tools. Use it for reading, creating, modifying, and analyzing PDF documents.

Quick Start

from pypdf import PdfReader

reader = PdfReader("document.pdf") for page in reader.pages: text = page.extract_text() print(text)

When to Use

Converting PDFs to Markdown - Use OpenAI Codex for intelligent conversion (RECOMMENDED FIRST STEP)
Extracting text and metadata from PDF files
Merging multiple PDFs into a single document
Splitting large PDFs into individual pages
Adding watermarks or annotations to PDFs
Password-protecting or decrypting PDFs
Extracting images from PDF documents
OCR processing for scanned documents
Creating new PDFs with reportlab
Extracting tables from structured PDFs

PDF to Markdown Conversion (OpenAI Codex)

IMPORTANT: For all PDF documents, utilize OpenAI Codex to convert contents to .md file first, then use the markdown for further work.

Why Convert to Markdown First?

Better structure preservation - Maintains headings, lists, tables
Easier text processing - Standard markdown format
Improved AI understanding - Codex understands document structure
Format flexibility - Markdown can be converted to any format
Version control friendly - Plain text, diff-friendly

OpenAI Codex Conversion

Prerequisites:

pip install openai pypdf export OPENAI_API_KEY="your-api-key-here"

Basic Conversion:

import openai from pypdf import PdfReader from pathlib import Path

def pdf_to_markdown_codex(pdf_path, output_md_path=None, model="gpt-4.1"): """ Convert PDF to markdown using OpenAI Codex.

Args:
    pdf_path: Path to PDF file
    output_md_path: Optional path for output .md file (auto-generated if None)
    model: OpenAI model to use (gpt-4.1, gpt-4.1-mini, etc.)

Returns:
    Path to generated markdown file
"""
# Extract text from PDF
reader = PdfReader(pdf_path)
pdf_text = ""

for page_num, page in enumerate(reader.pages, 1):
    text = page.extract_text()
    pdf_text += f"\n\n--- Page {page_num} ---\n\n{text}"

# Generate markdown using OpenAI Codex
client = openai.OpenAI()

response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": """You are an expert document converter. Convert the provided PDF text
            to well-structured markdown format. Preserve:
            - Document structure (headings, sections)
            - Lists and bullet points
            - Tables (convert to markdown tables)
            - Code blocks and technical content
            - Links and references

            Format the output as clean, readable markdown."""
        },
        {
            "role": "user",
            "content": f"Convert this PDF text to markdown:\n\n{pdf_text}"
        }
    ],
    temperature=0.3,  # Lower temperature for more consistent formatting
)

markdown_content = response.choices[0].message.content

# Save to file
if output_md_path is None:
    pdf_stem = Path(pdf_path).stem
    output_md_path = Path(pdf_path).parent / f"{pdf_stem}.md"

# Ensure parent directory exists
Path(output_md_path).parent.mkdir(parents=True, exist_ok=True)
Path(output_md_path).write_text(markdown_content, encoding='utf-8')

return output_md_path

Usage

md_file = pdf_to_markdown_codex("document.pdf") print(f"Markdown saved to: {md_file}")

Batch Conversion:

from pathlib import Path import logging

logging.basicConfig(level=logging.INFO) logger = logging.getLogger(name)

def batch_pdf_to_markdown(pdf_directory, output_directory=None, model="gpt-4.1"): """ Convert all PDFs in a directory to markdown.

Args:
    pdf_directory: Directory containing PDF files
    output_directory: Optional output directory (defaults to pdf_directory/markdown)
    model: OpenAI model to use
"""
pdf_dir = Path(pdf_directory)

if output_directory is None:
    output_dir = pdf_dir / "markdown"
else:
    output_dir = Path(output_directory)

output_dir.mkdir(parents=True, exist_ok=True)

pdf_files = list(pdf_dir.glob("*.pdf"))
total = len(pdf_files)

logger.info(f"Found {total} PDF files to convert")

for i, pdf_file in enumerate(pdf_files, 1):
    try:
        output_md = output_dir / f"{pdf_file.stem}.md"

        logger.info(f"[{i}/{total}] Converting {pdf_file.name}...")
        pdf_to_markdown_codex(pdf_file, output_md, model=model)
        logger.info(f"✓ Saved to {output_md.name}")

    except Exception as e:
        logger.error(f"✗ Failed to convert {pdf_file.name}: {e}")

logger.info(f"\nConversion complete! Files in: {output_dir}")

Usage

batch_pdf_to_markdown("/path/to/pdfs", model="gpt-4.1")

Chunked Conversion for Large PDFs:

def pdf_to_markdown_chunked(pdf_path, output_md_path=None, chunk_pages=10, model="gpt-4.1"): """ Convert large PDF by processing in chunks.

Args:
    pdf_path: Path to PDF file
    output_md_path: Optional output path
    chunk_pages: Number of pages per chunk
    model: OpenAI model to use
"""
reader = PdfReader(pdf_path)
total_pages = len(reader.pages)

markdown_sections = []

for start_page in range(0, total_pages, chunk_pages):
    end_page = min(start_page + chunk_pages, total_pages)

    # Extract chunk
    chunk_text = ""
    for page_num in range(start_page, end_page):
        text = reader.pages[page_num].extract_text()
        chunk_text += f"\n\n--- Page {page_num + 1} ---\n\n{text}"

    # Convert chunk
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": "Convert PDF text to markdown. Maintain structure and formatting."
            },
            {
                "role": "user",
                "content": f"Convert pages {start_page + 1}-{end_page} to markdown:\n\n{chunk_text}"
            }
        ],
        temperature=0.3,
    )

    markdown_sections.append(response.choices[0].message.content)
    print(f"Processed pages {start_page + 1}-{end_page}/{total_pages}")

# Combine sections
full_markdown = "\n\n---\n\n".join(markdown_sections)

# Save
if output_md_path is None:
    output_md_path = Path(pdf_path).with_suffix('.md')

# Ensure parent directory exists
Path(output_md_path).parent.mkdir(parents=True, exist_ok=True)
Path(output_md_path).write_text(full_markdown, encoding='utf-8')

return output_md_path

Usage

md_file = pdf_to_markdown_chunked("large_document.pdf", chunk_pages=20)

Workflow: PDF → Markdown → Further Processing:

from pathlib import Path

def pdf_workflow(pdf_path): """ Complete workflow: PDF → Markdown → Process markdown.

Returns:
    dict with paths to original PDF, markdown, and processed content
"""
# Step 1: Convert PDF to markdown using Codex
print("Step 1: Converting PDF to markdown...")
md_path = pdf_to_markdown_codex(pdf_path)

# Step 2: Read markdown for further processing
print("Step 2: Reading markdown content...")
markdown_content = Path(md_path).read_text(encoding='utf-8')

# Step 3: Further processing (example: extract headings)
print("Step 3: Processing markdown...")
headings = [line for line in markdown_content.split('\n') if line.startswith('#')]

# Step 4: Additional analysis
word_count = len(markdown_content.split())

return {
    'pdf_path': pdf_path,
    'markdown_path': md_path,
    'markdown_content': markdown_content,
    'headings': headings,
    'word_count': word_count,
}

Usage

result = pdf_workflow("technical_document.pdf") print(f"Markdown saved: {result['markdown_path']}") print(f"Found {len(result['headings'])} headings") print(f"Word count: {result['word_count']}")

Now work with the markdown

with open(result['markdown_path']) as f: markdown = f.read() # Do further processing with clean markdown

Cost-Effective Options:

Use GPT-4.1-mini for cost savings

md_file = pdf_to_markdown_codex("document.pdf", model="gpt-4.1-mini")

Or use local extraction + Codex for formatting only

from pypdf import PdfReader

def hybrid_conversion(pdf_path): """Extract text locally, use Codex only for formatting.""" # Extract text (free) reader = PdfReader(pdf_path) raw_text = "" for page in reader.pages: raw_text += page.extract_text()

# Use Codex just for markdown formatting (lower cost)
client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {
            "role": "system",
            "content": "Format the text as markdown. Add appropriate headings, lists, and structure."
        },
        {
            "role": "user",
            "content": raw_text
        }
    ],
    temperature=0.3,
)

markdown = response.choices[0].message.content
output_path = Path(pdf_path).with_suffix('.md')
output_path.write_text(markdown, encoding='utf-8')

return output_path

Best Practices:

Always convert to markdown first - Makes downstream processing easier
Use chunking for large PDFs - Avoids token limits and API timeouts
Cache conversions - Store markdown files to avoid re-conversion
Choose model based on complexity - GPT-4.1 for complex docs, GPT-4.1-mini for simple ones
Validate output - Check that markdown structure makes sense
Handle errors gracefully - Log failures, continue batch processing

CLI Tool:

#!/usr/bin/env python3 """PDF to Markdown converter using OpenAI Codex."""

import argparse import logging from pathlib import Path import openai from pypdf import PdfReader

logging.basicConfig(level=logging.INFO) logger = logging.getLogger(name)

def pdf_to_markdown_codex(pdf_path, output_md_path=None, model="gpt-4.1"): """ Convert PDF to markdown using OpenAI Codex.

Args:
    pdf_path: Path to PDF file
    output_md_path: Optional path for output .md file (auto-generated if None)
    model: OpenAI model to use (gpt-4.1, gpt-4.1-mini, etc.)

Returns:
    Path to generated markdown file
"""
# Extract text from PDF
reader = PdfReader(pdf_path)
pdf_text = ""

for page_num, page in enumerate(reader.pages, 1):
    text = page.extract_text()
    pdf_text += f"\n\n--- Page {page_num} ---\n\n{text}"

# Generate markdown using OpenAI Codex
client = openai.OpenAI()

response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": """You are an expert document converter. Convert the provided PDF text
            to well-structured markdown format. Preserve:
            - Document structure (headings, sections)
            - Lists and bullet points
            - Tables (convert to markdown tables)
            - Code blocks and technical content
            - Links and references

            Format the output as clean, readable markdown."""
        },
        {
            "role": "user",
            "content": f"Convert this PDF text to markdown:\n\n{pdf_text}"
        }
    ],
    temperature=0.3,
)

markdown_content = response.choices[0].message.content

# Save to file
if output_md_path is None:
    pdf_stem = Path(pdf_path).stem
    output_md_path = Path(pdf_path).parent / f"{pdf_stem}.md"

# Ensure parent directory exists
Path(output_md_path).parent.mkdir(parents=True, exist_ok=True)
Path(output_md_path).write_text(markdown_content, encoding='utf-8')

return output_md_path

def batch_pdf_to_markdown(pdf_directory, output_directory=None, model="gpt-4.1"): """ Convert all PDFs in a directory to markdown.

Args:
    pdf_directory: Directory containing PDF files
    output_directory: Optional output directory (defaults to pdf_directory/markdown)
    model: OpenAI model to use
"""
pdf_dir = Path(pdf_directory)

if output_directory is None:
    output_dir = pdf_dir / "markdown"
else:
    output_dir = Path(output_directory)

output_dir.mkdir(parents=True, exist_ok=True)

pdf_files = list(pdf_dir.glob("*.pdf"))
total = len(pdf_files)

logger.info(f"Found {total} PDF files to convert")

for i, pdf_file in enumerate(pdf_files, 1):
    try:
        output_md = output_dir / f"{pdf_file.stem}.md"

        logger.info(f"[{i}/{total}] Converting {pdf_file.name}...")
        pdf_to_markdown_codex(pdf_file, output_md, model=model)
        logger.info(f"✓ Saved to {output_md.name}")

    except Exception as e:
        logger.error(f"✗ Failed to convert {pdf_file.name}: {e}")

logger.info(f"\nConversion complete! Files in: {output_dir}")

def main(): parser = argparse.ArgumentParser(description='Convert PDF to Markdown using OpenAI') parser.add_argument('input', help='PDF file or directory') parser.add_argument('-o', '--output', help='Output directory or file') parser.add_argument('-m', '--model', default='gpt-4.1', help='OpenAI model (gpt-4.1, gpt-4.1-mini)') parser.add_argument('--chunk-pages', type=int, default=10, help='Pages per chunk (unused in basic mode)')

args = parser.parse_args()

input_path = Path(args.input)

if input_path.is_file():
    # Single file
    output = args.output or input_path.with_suffix('.md')
    md_path = pdf_to_markdown_codex(input_path, output, model=args.model)
    print(f"✓ Converted: {md_path}")
else:
    # Directory
    batch_pdf_to_markdown(input_path, args.output, model=args.model)

if name == 'main': main()

Save as pdf2md.py and use:

Single file

python pdf2md.py document.pdf

With GPT-4.1-mini (cheaper)

python pdf2md.py document.pdf --model gpt-4.1-mini

Python Libraries

pypdf - Core PDF Operations

Merging PDFs:

from pypdf import PdfMerger

merger = PdfMerger() merger.append("file1.pdf") merger.append("file2.pdf") merger.write("merged.pdf") merger.close()

Splitting PDFs:

from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf") for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) writer.write(f"page_{i+1}.pdf")

Extracting Metadata:

reader = PdfReader("document.pdf") info = reader.metadata print(f"Author: {info.author}") print(f"Title: {info.title}") print(f"Pages: {len(reader.pages)}")

pdfplumber - Advanced Text Extraction

Text with Layout Preservation:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: text = page.extract_text() print(text)

Table Extraction:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf: page = pdf.pages[0] tables = page.extract_tables() for table in tables: for row in table: print(row)

reportlab - Creating PDFs

Create PDF from Scratch:

from reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas

c = canvas.Canvas("output.pdf", pagesize=letter) c.drawString(100, 750, "Hello, World!") c.showPage() c.save()

Multi-page Documents:

from reportlab.lib.pagesizes import letter from reportlab.platypus import SimpleDocTemplate, Paragraph from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("output.pdf", pagesize=letter) styles = getSampleStyleSheet() story = []

story.append(Paragraph("Title", styles['Heading1'])) story.append(Paragraph("Body text here.", styles['Normal']))

doc.build(story)

Command-Line Tools

pdftotext (Poppler)

pdftotext document.pdf output.txt pdftotext -layout document.pdf output.txt # Preserve layout

qpdf

Merge PDFs

qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

Split pages

qpdf document.pdf --pages . 1-5 -- first_five.pdf

Decrypt

qpdf --decrypt encrypted.pdf decrypted.pdf

pdftk

Merge

pdftk file1.pdf file2.pdf cat output merged.pdf

Split

pdftk document.pdf burst output page_%02d.pdf

Rotate

pdftk document.pdf cat 1-endeast output rotated.pdf

PDF-Large-Reader - Robust Extraction for Large Files

For large PDFs (100MB+, 1000+ pages), use the pdf-large-reader library with robust table extraction.

Why Use PDF-Large-Reader?

Memory-efficient - Handles 100MB+ PDFs without memory issues
Robust table extraction - Handles irregular tables with column count normalization
Multiple output formats - Generator (streaming), List, or Plain Text
Automatic strategy selection - Intelligent chunk size calculation
Complete extraction - Text, images, tables, and metadata in one pass
High test coverage - 93.58% coverage with 215 tests

Installation

From the pdf-large-reader repository

cd /mnt/github/workspace-hub/pdf-large-reader pip install -e .

Or with extras

pip install -e ".[dev,progress]"

Quick Start

from pdf_large_reader import process_large_pdf, extract_text_only, extract_everything

Simple text extraction

text = extract_text_only("large_document.pdf") print(text)

Process with automatic strategy selection

pages = process_large_pdf( "large_document.pdf", output_format="list", extract_images=True, extract_tables=True )

Memory-efficient streaming for very large files

for page in process_large_pdf("huge_file.pdf", output_format="generator"): print(f"Page {page.page_number}: {len(page.text)} characters")

Robust Table Extraction

NEW: Column Count Normalization (v1.3.0+)

The table extraction now handles irregular tables with different column counts:

from pdf_large_reader import extract_everything

Extract everything including tables with robust error handling

pages = extract_everything("technical_standard.pdf")

for page in pages: if 'tables' in page.metadata: tables = page.metadata['tables'] print(f"Page {page.page_number}: Found {len(tables)} tables")

    for i, table_df in enumerate(tables):
        print(f"  Table {i+1}: {table_df.shape[0]} rows x {table_df.shape[1]} cols")
        print(table_df.head())

How It Works:

Detects table-like structures from text positioning
Normalizes column counts across all rows
Pads short rows with empty strings
Gracefully handles malformed tables with try-except
Logs warnings instead of crashing

Typical Performance:

API Std 650 (28 MB, 461 pages): 14,648 chars/sec, 5.18 pages/sec
API RP 579 (41 MB, 966 pages): 2,090 chars/sec, 8.48 pages/sec

Command Line Usage

Extract text from PDF

pdf-large-reader document.pdf

Save to file

pdf-large-reader document.pdf --output result.txt

Extract with images and tables

pdf-large-reader document.pdf --extract-images --extract-tables

Use generator format for large files

pdf-large-reader huge.pdf --output-format generator

Verbose output

pdf-large-reader document.pdf --verbose

API Reference

Main entry point with automatic strategy

process_large_pdf( pdf_path, output_format="generator", # "generator" (default), "list", or "text" extract_images=False, # Extract images extract_tables=False, # Extract tables with normalization chunk_size=None, # Auto-calculated if None fallback_api_key=None, # OpenAI API key for complex pages fallback_model="gpt-4.1", # Model for fallback extraction progress_callback=None, # Progress tracking function auto_strategy=True # Enable automatic strategy selection )

Quick text extraction

extract_text_only(pdf_path) -> str

Extract with images

extract_pages_with_images(pdf_path) -> List[PDFPage]

Extract with tables

extract_pages_with_tables(pdf_path) -> List[PDFPage]

Extract everything

extract_everything(pdf_path) -> List[PDFPage]

PDFPage Data Structure

@dataclass class PDFPage: page_number: int # Page number (1-indexed) text: str # Extracted text from page images: List[dict] # Extracted images with metadata metadata: dict # Page metadata including tables

Performance Benchmarks

Tested on Ubuntu 22.04, Python 3.11, 16GB RAM:

File Size Pages Time Memory Strategy

5 MB 10 < 5s ~50 MB batch_all

50 MB 100 < 30s ~150 MB chunked

100 MB 500 < 60s ~200 MB stream_pages

200 MB 1000 < 2min ~250 MB stream_pages

Real-World Validation

Tested with actual API standards:

✅ API RP 579 (2000) - 41 MB, 966 pages
✅ API Std 650 (2001) - 28 MB, 461 pages
✅ All extraction methods working (text, auto strategy, generator, complete)
✅ Table extraction with column normalization
✅ Image extraction (461-966 images per document)

Common Tasks

OCR for Scanned Documents

import pytesseract from pdf2image import convert_from_path

images = convert_from_path("scanned.pdf") for i, image in enumerate(images): text = pytesseract.image_to_string(image) print(f"Page {i+1}:\n{text}")

Add Watermark

from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf") watermark = PdfReader("watermark.pdf") writer = PdfWriter()

for page in reader.pages: page.merge_page(watermark.pages[0]) writer.add_page(page)

writer.write("watermarked.pdf")

Extract Images

from pypdf import PdfReader

reader = PdfReader("document.pdf") for page_num, page in enumerate(reader.pages): for img_num, image in enumerate(page.images): with open(f"image_{page_num}_{img_num}.png", "wb") as f: f.write(image.data)

Password Protection

from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf") writer = PdfWriter()

for page in reader.pages: writer.add_page(page)

writer.encrypt("user_password", "owner_password") writer.write("protected.pdf")

Execution Checklist

Verify input PDF exists and is readable
Check if PDF is encrypted/DRM-protected
Choose appropriate library for task (pypdf vs pdfplumber)
Handle multi-page documents correctly
Validate output file was created
Clean up temporary files

Error Handling

Common Errors

Error: FileNotFoundError

Cause: PDF file path is incorrect
Solution: Verify file path and ensure file exists

Error: PdfReadError (encrypted)

Cause: PDF is password-protected or DRM-encrypted
Solution: Provide password or use qpdf to decrypt

Error: Empty text extraction

Cause: PDF contains scanned images, not text
Solution: Use OCR with pytesseract and pdf2image

Error: DependencyError (Tesseract)

Cause: Tesseract OCR not installed
Solution: sudo apt-get install tesseract-ocr or brew install tesseract

Metrics

Metric Typical Value

Text extraction speed ~50 pages/second

OCR processing speed ~2-5 pages/minute

Memory usage (pypdf) ~10MB per 100 pages

Merge operation ~100 PDFs/second

Quick Reference

Task Tool

Read text pypdf, pdfplumber

Extract tables pdfplumber

Create PDFs reportlab

Merge/split pypdf, qpdf, pdftk

OCR pytesseract + pdf2image

Fill forms pypdf, pdfrw

Watermark pypdf

Encrypt/decrypt pypdf, qpdf

Dependencies

Core PDF libraries

pip install pypdf pdfplumber reportlab pytesseract pdf2image

OpenAI Codex for PDF to Markdown conversion

pip install openai

System tools:

Poppler (pdftotext, pdftoppm)
qpdf
pdftk
Tesseract OCR

Environment variables:

export OPENAI_API_KEY="your-api-key-here"

Version History

1.2.2 (2026-01-04): Fixed P2 issue - added parents=True to all mkdir() calls to handle nested output paths; prevents FileNotFoundError when creating directories with non-existent parent paths
1.2.1 (2026-01-04): Fixed CLI tool missing imports - added complete standalone script with all required imports (openai, pypdf, logging) and function definitions; resolved P1 issue from Codex review
1.2.0 (2026-01-04): MAJOR UPDATE - Added OpenAI Codex integration for PDF-to-Markdown conversion as recommended first step for all PDF processing; includes batch conversion, chunking for large files, cost-effective options, and complete CLI tool
1.1.0 (2026-01-02): Added Quick Start, When to Use, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
1.0.0 (2024-10-15): Initial release with pypdf, pdfplumber, reportlab, CLI tools

pdf

Safety Notice

Copy this and send it to your AI assistant to learn

Usage

Usage

Usage

Usage

Now work with the markdown

Use GPT-4.1-mini for cost savings

Or use local extraction + Codex for formatting only

Single file

Directory

With GPT-4.1-mini (cheaper)

Merge PDFs

Split pages

Decrypt

Merge

Split

Rotate

From the pdf-large-reader repository

Or with extras

Simple text extraction

Process with automatic strategy selection

Memory-efficient streaming for very large files

Extract everything including tables with robust error handling

Extract text from PDF

Save to file

Extract with images and tables

Use generator format for large files

Verbose output

Main entry point with automatic strategy

Quick text extraction

Extract with images

Extract with tables

Extract everything

Core PDF libraries

OpenAI Codex for PDF to Markdown conversion

Source Transparency

Related Skills

cli-productivity

python-docx

python-scientific-computing

python-pptx