PDF Processing Skill
Overview
This skill enables comprehensive PDF operations through Python libraries and command-line tools. Use it for reading, creating, modifying, and analyzing PDF documents.
Quick Start
from pypdf import PdfReader
reader = PdfReader("document.pdf") for page in reader.pages: text = page.extract_text() print(text)
When to Use
-
Converting PDFs to Markdown - Use OpenAI Codex for intelligent conversion (RECOMMENDED FIRST STEP)
-
Extracting text and metadata from PDF files
-
Merging multiple PDFs into a single document
-
Splitting large PDFs into individual pages
-
Adding watermarks or annotations to PDFs
-
Password-protecting or decrypting PDFs
-
Extracting images from PDF documents
-
OCR processing for scanned documents
-
Creating new PDFs with reportlab
-
Extracting tables from structured PDFs
PDF to Markdown Conversion (OpenAI Codex)
IMPORTANT: For all PDF documents, utilize OpenAI Codex to convert contents to .md file first, then use the markdown for further work.
Why Convert to Markdown First?
-
Better structure preservation - Maintains headings, lists, tables
-
Easier text processing - Standard markdown format
-
Improved AI understanding - Codex understands document structure
-
Format flexibility - Markdown can be converted to any format
-
Version control friendly - Plain text, diff-friendly
OpenAI Codex Conversion
Prerequisites:
pip install openai pypdf export OPENAI_API_KEY="your-api-key-here"
Basic Conversion:
import openai from pypdf import PdfReader from pathlib import Path
def pdf_to_markdown_codex(pdf_path, output_md_path=None, model="gpt-4.1"): """ Convert PDF to markdown using OpenAI Codex.
Args:
pdf_path: Path to PDF file
output_md_path: Optional path for output .md file (auto-generated if None)
model: OpenAI model to use (gpt-4.1, gpt-4.1-mini, etc.)
Returns:
Path to generated markdown file
"""
# Extract text from PDF
reader = PdfReader(pdf_path)
pdf_text = ""
for page_num, page in enumerate(reader.pages, 1):
text = page.extract_text()
pdf_text += f"\n\n--- Page {page_num} ---\n\n{text}"
# Generate markdown using OpenAI Codex
client = openai.OpenAI()
response = client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": """You are an expert document converter. Convert the provided PDF text
to well-structured markdown format. Preserve:
- Document structure (headings, sections)
- Lists and bullet points
- Tables (convert to markdown tables)
- Code blocks and technical content
- Links and references
Format the output as clean, readable markdown."""
},
{
"role": "user",
"content": f"Convert this PDF text to markdown:\n\n{pdf_text}"
}
],
temperature=0.3, # Lower temperature for more consistent formatting
)
markdown_content = response.choices[0].message.content
# Save to file
if output_md_path is None:
pdf_stem = Path(pdf_path).stem
output_md_path = Path(pdf_path).parent / f"{pdf_stem}.md"
# Ensure parent directory exists
Path(output_md_path).parent.mkdir(parents=True, exist_ok=True)
Path(output_md_path).write_text(markdown_content, encoding='utf-8')
return output_md_path
Usage
md_file = pdf_to_markdown_codex("document.pdf") print(f"Markdown saved to: {md_file}")
Batch Conversion:
from pathlib import Path import logging
logging.basicConfig(level=logging.INFO) logger = logging.getLogger(name)
def batch_pdf_to_markdown(pdf_directory, output_directory=None, model="gpt-4.1"): """ Convert all PDFs in a directory to markdown.
Args:
pdf_directory: Directory containing PDF files
output_directory: Optional output directory (defaults to pdf_directory/markdown)
model: OpenAI model to use
"""
pdf_dir = Path(pdf_directory)
if output_directory is None:
output_dir = pdf_dir / "markdown"
else:
output_dir = Path(output_directory)
output_dir.mkdir(parents=True, exist_ok=True)
pdf_files = list(pdf_dir.glob("*.pdf"))
total = len(pdf_files)
logger.info(f"Found {total} PDF files to convert")
for i, pdf_file in enumerate(pdf_files, 1):
try:
output_md = output_dir / f"{pdf_file.stem}.md"
logger.info(f"[{i}/{total}] Converting {pdf_file.name}...")
pdf_to_markdown_codex(pdf_file, output_md, model=model)
logger.info(f"✓ Saved to {output_md.name}")
except Exception as e:
logger.error(f"✗ Failed to convert {pdf_file.name}: {e}")
logger.info(f"\nConversion complete! Files in: {output_dir}")
Usage
batch_pdf_to_markdown("/path/to/pdfs", model="gpt-4.1")
Chunked Conversion for Large PDFs:
def pdf_to_markdown_chunked(pdf_path, output_md_path=None, chunk_pages=10, model="gpt-4.1"): """ Convert large PDF by processing in chunks.
Args:
pdf_path: Path to PDF file
output_md_path: Optional output path
chunk_pages: Number of pages per chunk
model: OpenAI model to use
"""
reader = PdfReader(pdf_path)
total_pages = len(reader.pages)
markdown_sections = []
for start_page in range(0, total_pages, chunk_pages):
end_page = min(start_page + chunk_pages, total_pages)
# Extract chunk
chunk_text = ""
for page_num in range(start_page, end_page):
text = reader.pages[page_num].extract_text()
chunk_text += f"\n\n--- Page {page_num + 1} ---\n\n{text}"
# Convert chunk
client = openai.OpenAI()
response = client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": "Convert PDF text to markdown. Maintain structure and formatting."
},
{
"role": "user",
"content": f"Convert pages {start_page + 1}-{end_page} to markdown:\n\n{chunk_text}"
}
],
temperature=0.3,
)
markdown_sections.append(response.choices[0].message.content)
print(f"Processed pages {start_page + 1}-{end_page}/{total_pages}")
# Combine sections
full_markdown = "\n\n---\n\n".join(markdown_sections)
# Save
if output_md_path is None:
output_md_path = Path(pdf_path).with_suffix('.md')
# Ensure parent directory exists
Path(output_md_path).parent.mkdir(parents=True, exist_ok=True)
Path(output_md_path).write_text(full_markdown, encoding='utf-8')
return output_md_path
Usage
md_file = pdf_to_markdown_chunked("large_document.pdf", chunk_pages=20)
Workflow: PDF → Markdown → Further Processing:
from pathlib import Path
def pdf_workflow(pdf_path): """ Complete workflow: PDF → Markdown → Process markdown.
Returns:
dict with paths to original PDF, markdown, and processed content
"""
# Step 1: Convert PDF to markdown using Codex
print("Step 1: Converting PDF to markdown...")
md_path = pdf_to_markdown_codex(pdf_path)
# Step 2: Read markdown for further processing
print("Step 2: Reading markdown content...")
markdown_content = Path(md_path).read_text(encoding='utf-8')
# Step 3: Further processing (example: extract headings)
print("Step 3: Processing markdown...")
headings = [line for line in markdown_content.split('\n') if line.startswith('#')]
# Step 4: Additional analysis
word_count = len(markdown_content.split())
return {
'pdf_path': pdf_path,
'markdown_path': md_path,
'markdown_content': markdown_content,
'headings': headings,
'word_count': word_count,
}
Usage
result = pdf_workflow("technical_document.pdf") print(f"Markdown saved: {result['markdown_path']}") print(f"Found {len(result['headings'])} headings") print(f"Word count: {result['word_count']}")
Now work with the markdown
with open(result['markdown_path']) as f: markdown = f.read() # Do further processing with clean markdown
Cost-Effective Options:
Use GPT-4.1-mini for cost savings
md_file = pdf_to_markdown_codex("document.pdf", model="gpt-4.1-mini")
Or use local extraction + Codex for formatting only
from pypdf import PdfReader
def hybrid_conversion(pdf_path): """Extract text locally, use Codex only for formatting.""" # Extract text (free) reader = PdfReader(pdf_path) raw_text = "" for page in reader.pages: raw_text += page.extract_text()
# Use Codex just for markdown formatting (lower cost)
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[
{
"role": "system",
"content": "Format the text as markdown. Add appropriate headings, lists, and structure."
},
{
"role": "user",
"content": raw_text
}
],
temperature=0.3,
)
markdown = response.choices[0].message.content
output_path = Path(pdf_path).with_suffix('.md')
output_path.write_text(markdown, encoding='utf-8')
return output_path
Best Practices:
-
Always convert to markdown first - Makes downstream processing easier
-
Use chunking for large PDFs - Avoids token limits and API timeouts
-
Cache conversions - Store markdown files to avoid re-conversion
-
Choose model based on complexity - GPT-4.1 for complex docs, GPT-4.1-mini for simple ones
-
Validate output - Check that markdown structure makes sense
-
Handle errors gracefully - Log failures, continue batch processing
CLI Tool:
#!/usr/bin/env python3 """PDF to Markdown converter using OpenAI Codex."""
import argparse import logging from pathlib import Path import openai from pypdf import PdfReader
logging.basicConfig(level=logging.INFO) logger = logging.getLogger(name)
def pdf_to_markdown_codex(pdf_path, output_md_path=None, model="gpt-4.1"): """ Convert PDF to markdown using OpenAI Codex.
Args:
pdf_path: Path to PDF file
output_md_path: Optional path for output .md file (auto-generated if None)
model: OpenAI model to use (gpt-4.1, gpt-4.1-mini, etc.)
Returns:
Path to generated markdown file
"""
# Extract text from PDF
reader = PdfReader(pdf_path)
pdf_text = ""
for page_num, page in enumerate(reader.pages, 1):
text = page.extract_text()
pdf_text += f"\n\n--- Page {page_num} ---\n\n{text}"
# Generate markdown using OpenAI Codex
client = openai.OpenAI()
response = client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": """You are an expert document converter. Convert the provided PDF text
to well-structured markdown format. Preserve:
- Document structure (headings, sections)
- Lists and bullet points
- Tables (convert to markdown tables)
- Code blocks and technical content
- Links and references
Format the output as clean, readable markdown."""
},
{
"role": "user",
"content": f"Convert this PDF text to markdown:\n\n{pdf_text}"
}
],
temperature=0.3,
)
markdown_content = response.choices[0].message.content
# Save to file
if output_md_path is None:
pdf_stem = Path(pdf_path).stem
output_md_path = Path(pdf_path).parent / f"{pdf_stem}.md"
# Ensure parent directory exists
Path(output_md_path).parent.mkdir(parents=True, exist_ok=True)
Path(output_md_path).write_text(markdown_content, encoding='utf-8')
return output_md_path
def batch_pdf_to_markdown(pdf_directory, output_directory=None, model="gpt-4.1"): """ Convert all PDFs in a directory to markdown.
Args:
pdf_directory: Directory containing PDF files
output_directory: Optional output directory (defaults to pdf_directory/markdown)
model: OpenAI model to use
"""
pdf_dir = Path(pdf_directory)
if output_directory is None:
output_dir = pdf_dir / "markdown"
else:
output_dir = Path(output_directory)
output_dir.mkdir(parents=True, exist_ok=True)
pdf_files = list(pdf_dir.glob("*.pdf"))
total = len(pdf_files)
logger.info(f"Found {total} PDF files to convert")
for i, pdf_file in enumerate(pdf_files, 1):
try:
output_md = output_dir / f"{pdf_file.stem}.md"
logger.info(f"[{i}/{total}] Converting {pdf_file.name}...")
pdf_to_markdown_codex(pdf_file, output_md, model=model)
logger.info(f"✓ Saved to {output_md.name}")
except Exception as e:
logger.error(f"✗ Failed to convert {pdf_file.name}: {e}")
logger.info(f"\nConversion complete! Files in: {output_dir}")
def main(): parser = argparse.ArgumentParser(description='Convert PDF to Markdown using OpenAI') parser.add_argument('input', help='PDF file or directory') parser.add_argument('-o', '--output', help='Output directory or file') parser.add_argument('-m', '--model', default='gpt-4.1', help='OpenAI model (gpt-4.1, gpt-4.1-mini)') parser.add_argument('--chunk-pages', type=int, default=10, help='Pages per chunk (unused in basic mode)')
args = parser.parse_args()
input_path = Path(args.input)
if input_path.is_file():
# Single file
output = args.output or input_path.with_suffix('.md')
md_path = pdf_to_markdown_codex(input_path, output, model=args.model)
print(f"✓ Converted: {md_path}")
else:
# Directory
batch_pdf_to_markdown(input_path, args.output, model=args.model)
if name == 'main': main()
Save as pdf2md.py and use:
Single file
python pdf2md.py document.pdf
Directory
python pdf2md.py /path/to/pdfs -o /path/to/markdown
With GPT-4.1-mini (cheaper)
python pdf2md.py document.pdf --model gpt-4.1-mini
Python Libraries
pypdf - Core PDF Operations
Merging PDFs:
from pypdf import PdfMerger
merger = PdfMerger() merger.append("file1.pdf") merger.append("file2.pdf") merger.write("merged.pdf") merger.close()
Splitting PDFs:
from pypdf import PdfReader, PdfWriter
reader = PdfReader("document.pdf") for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) writer.write(f"page_{i+1}.pdf")
Extracting Metadata:
reader = PdfReader("document.pdf") info = reader.metadata print(f"Author: {info.author}") print(f"Title: {info.title}") print(f"Pages: {len(reader.pages)}")
pdfplumber - Advanced Text Extraction
Text with Layout Preservation:
import pdfplumber
with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: text = page.extract_text() print(text)
Table Extraction:
import pdfplumber
with pdfplumber.open("document.pdf") as pdf: page = pdf.pages[0] tables = page.extract_tables() for table in tables: for row in table: print(row)
reportlab - Creating PDFs
Create PDF from Scratch:
from reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas
c = canvas.Canvas("output.pdf", pagesize=letter) c.drawString(100, 750, "Hello, World!") c.showPage() c.save()
Multi-page Documents:
from reportlab.lib.pagesizes import letter from reportlab.platypus import SimpleDocTemplate, Paragraph from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("output.pdf", pagesize=letter) styles = getSampleStyleSheet() story = []
story.append(Paragraph("Title", styles['Heading1'])) story.append(Paragraph("Body text here.", styles['Normal']))
doc.build(story)
Command-Line Tools
pdftotext (Poppler)
pdftotext document.pdf output.txt pdftotext -layout document.pdf output.txt # Preserve layout
qpdf
Merge PDFs
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
Split pages
qpdf document.pdf --pages . 1-5 -- first_five.pdf
Decrypt
qpdf --decrypt encrypted.pdf decrypted.pdf
pdftk
Merge
pdftk file1.pdf file2.pdf cat output merged.pdf
Split
pdftk document.pdf burst output page_%02d.pdf
Rotate
pdftk document.pdf cat 1-endeast output rotated.pdf
PDF-Large-Reader - Robust Extraction for Large Files
For large PDFs (100MB+, 1000+ pages), use the pdf-large-reader library with robust table extraction.
Why Use PDF-Large-Reader?
-
Memory-efficient - Handles 100MB+ PDFs without memory issues
-
Robust table extraction - Handles irregular tables with column count normalization
-
Multiple output formats - Generator (streaming), List, or Plain Text
-
Automatic strategy selection - Intelligent chunk size calculation
-
Complete extraction - Text, images, tables, and metadata in one pass
-
High test coverage - 93.58% coverage with 215 tests
Installation
From the pdf-large-reader repository
cd /mnt/github/workspace-hub/pdf-large-reader pip install -e .
Or with extras
pip install -e ".[dev,progress]"
Quick Start
from pdf_large_reader import process_large_pdf, extract_text_only, extract_everything
Simple text extraction
text = extract_text_only("large_document.pdf") print(text)
Process with automatic strategy selection
pages = process_large_pdf( "large_document.pdf", output_format="list", extract_images=True, extract_tables=True )
Memory-efficient streaming for very large files
for page in process_large_pdf("huge_file.pdf", output_format="generator"): print(f"Page {page.page_number}: {len(page.text)} characters")
Robust Table Extraction
NEW: Column Count Normalization (v1.3.0+)
The table extraction now handles irregular tables with different column counts:
from pdf_large_reader import extract_everything
Extract everything including tables with robust error handling
pages = extract_everything("technical_standard.pdf")
for page in pages: if 'tables' in page.metadata: tables = page.metadata['tables'] print(f"Page {page.page_number}: Found {len(tables)} tables")
for i, table_df in enumerate(tables):
print(f" Table {i+1}: {table_df.shape[0]} rows x {table_df.shape[1]} cols")
print(table_df.head())
How It Works:
-
Detects table-like structures from text positioning
-
Normalizes column counts across all rows
-
Pads short rows with empty strings
-
Gracefully handles malformed tables with try-except
-
Logs warnings instead of crashing
Typical Performance:
-
API Std 650 (28 MB, 461 pages): 14,648 chars/sec, 5.18 pages/sec
-
API RP 579 (41 MB, 966 pages): 2,090 chars/sec, 8.48 pages/sec
Command Line Usage
Extract text from PDF
pdf-large-reader document.pdf
Save to file
pdf-large-reader document.pdf --output result.txt
Extract with images and tables
pdf-large-reader document.pdf --extract-images --extract-tables
Use generator format for large files
pdf-large-reader huge.pdf --output-format generator
Verbose output
pdf-large-reader document.pdf --verbose
API Reference
Main entry point with automatic strategy
process_large_pdf( pdf_path, output_format="generator", # "generator" (default), "list", or "text" extract_images=False, # Extract images extract_tables=False, # Extract tables with normalization chunk_size=None, # Auto-calculated if None fallback_api_key=None, # OpenAI API key for complex pages fallback_model="gpt-4.1", # Model for fallback extraction progress_callback=None, # Progress tracking function auto_strategy=True # Enable automatic strategy selection )
Quick text extraction
extract_text_only(pdf_path) -> str
Extract with images
extract_pages_with_images(pdf_path) -> List[PDFPage]
Extract with tables
extract_pages_with_tables(pdf_path) -> List[PDFPage]
Extract everything
extract_everything(pdf_path) -> List[PDFPage]
PDFPage Data Structure
@dataclass class PDFPage: page_number: int # Page number (1-indexed) text: str # Extracted text from page images: List[dict] # Extracted images with metadata metadata: dict # Page metadata including tables
Performance Benchmarks
Tested on Ubuntu 22.04, Python 3.11, 16GB RAM:
File Size Pages Time Memory Strategy
5 MB 10 < 5s ~50 MB batch_all
50 MB 100 < 30s ~150 MB chunked
100 MB 500 < 60s ~200 MB stream_pages
200 MB 1000 < 2min ~250 MB stream_pages
Real-World Validation
Tested with actual API standards:
-
✅ API RP 579 (2000) - 41 MB, 966 pages
-
✅ API Std 650 (2001) - 28 MB, 461 pages
-
✅ All extraction methods working (text, auto strategy, generator, complete)
-
✅ Table extraction with column normalization
-
✅ Image extraction (461-966 images per document)
Common Tasks
OCR for Scanned Documents
import pytesseract from pdf2image import convert_from_path
images = convert_from_path("scanned.pdf") for i, image in enumerate(images): text = pytesseract.image_to_string(image) print(f"Page {i+1}:\n{text}")
Add Watermark
from pypdf import PdfReader, PdfWriter
reader = PdfReader("document.pdf") watermark = PdfReader("watermark.pdf") writer = PdfWriter()
for page in reader.pages: page.merge_page(watermark.pages[0]) writer.add_page(page)
writer.write("watermarked.pdf")
Extract Images
from pypdf import PdfReader
reader = PdfReader("document.pdf") for page_num, page in enumerate(reader.pages): for img_num, image in enumerate(page.images): with open(f"image_{page_num}_{img_num}.png", "wb") as f: f.write(image.data)
Password Protection
from pypdf import PdfReader, PdfWriter
reader = PdfReader("document.pdf") writer = PdfWriter()
for page in reader.pages: writer.add_page(page)
writer.encrypt("user_password", "owner_password") writer.write("protected.pdf")
Execution Checklist
-
Verify input PDF exists and is readable
-
Check if PDF is encrypted/DRM-protected
-
Choose appropriate library for task (pypdf vs pdfplumber)
-
Handle multi-page documents correctly
-
Validate output file was created
-
Clean up temporary files
Error Handling
Common Errors
Error: FileNotFoundError
-
Cause: PDF file path is incorrect
-
Solution: Verify file path and ensure file exists
Error: PdfReadError (encrypted)
-
Cause: PDF is password-protected or DRM-encrypted
-
Solution: Provide password or use qpdf to decrypt
Error: Empty text extraction
-
Cause: PDF contains scanned images, not text
-
Solution: Use OCR with pytesseract and pdf2image
Error: DependencyError (Tesseract)
-
Cause: Tesseract OCR not installed
-
Solution: sudo apt-get install tesseract-ocr or brew install tesseract
Metrics
Metric Typical Value
Text extraction speed ~50 pages/second
OCR processing speed ~2-5 pages/minute
Memory usage (pypdf) ~10MB per 100 pages
Merge operation ~100 PDFs/second
Quick Reference
Task Tool
Read text pypdf, pdfplumber
Extract tables pdfplumber
Create PDFs reportlab
Merge/split pypdf, qpdf, pdftk
OCR pytesseract + pdf2image
Fill forms pypdf, pdfrw
Watermark pypdf
Encrypt/decrypt pypdf, qpdf
Dependencies
Core PDF libraries
pip install pypdf pdfplumber reportlab pytesseract pdf2image
OpenAI Codex for PDF to Markdown conversion
pip install openai
System tools:
-
Poppler (pdftotext, pdftoppm)
-
qpdf
-
pdftk
-
Tesseract OCR
Environment variables:
export OPENAI_API_KEY="your-api-key-here"
Version History
-
1.2.2 (2026-01-04): Fixed P2 issue - added parents=True to all mkdir() calls to handle nested output paths; prevents FileNotFoundError when creating directories with non-existent parent paths
-
1.2.1 (2026-01-04): Fixed CLI tool missing imports - added complete standalone script with all required imports (openai, pypdf, logging) and function definitions; resolved P1 issue from Codex review
-
1.2.0 (2026-01-04): MAJOR UPDATE - Added OpenAI Codex integration for PDF-to-Markdown conversion as recommended first step for all PDF processing; includes batch conversion, chunking for large files, cost-effective options, and complete CLI tool
-
1.1.0 (2026-01-02): Added Quick Start, When to Use, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
-
1.0.0 (2024-10-15): Initial release with pypdf, pdfplumber, reportlab, CLI tools