PDF Skill
Read, create, and manipulate PDF documents with support for text extraction, document generation, merging, and form filling.
Capabilities
-
Read PDFs: Extract text, tables, and metadata from PDF files
-
Create PDFs: Generate PDF documents from scratch using ReportLab
-
Merge PDFs: Combine multiple PDFs into a single document
-
Split PDFs: Extract specific pages from PDF documents
-
Form Operations: Fill PDF forms programmatically
-
Watermarks: Add watermarks and headers/footers to documents
-
Convert: Convert between PDF and other formats
Quick Start
import pdfplumber from PyPDF2 import PdfReader, PdfWriter
Read text from PDF
with pdfplumber.open('document.pdf') as pdf: for page in pdf.pages: print(page.extract_text())
Merge PDFs
merger = PdfWriter() for pdf_file in ['doc1.pdf', 'doc2.pdf']: merger.append(pdf_file) merger.write('merged.pdf')
Usage
Extracting Text from PDFs
Extract text content from PDF files with layout preservation.
Input: Path to a PDF file
Process:
-
Open PDF with pdfplumber for accurate text extraction
-
Iterate through pages
-
Extract text, optionally preserving layout
Example:
import pdfplumber from pathlib import Path
def extract_text(pdf_path: Path) -> str: """Extract all text from a PDF file.""" text_content = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
text = page.extract_text()
if text:
text_content.append(text)
return '\n\n'.join(text_content)
Usage
text = extract_text(Path('report.pdf')) print(text)
Extracting Tables from PDFs
Extract tabular data from PDF files into structured formats.
Input: Path to PDF file containing tables
Process:
-
Open PDF with pdfplumber
-
Detect and extract tables from each page
-
Return as list of lists (rows and cells)
Example:
import pdfplumber import csv
def extract_tables(pdf_path: str, output_csv: str = None): """Extract tables from PDF, optionally save to CSV.""" all_tables = []
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages, 1):
tables = page.extract_tables()
for table_num, table in enumerate(tables, 1):
all_tables.append({
'page': page_num,
'table_num': table_num,
'data': table
})
# Optionally save to CSV
if output_csv and all_tables:
with open(output_csv, 'w', newline='') as f:
writer = csv.writer(f)
for table in all_tables:
writer.writerow([f"Page {table['page']}, Table {table['table_num']}"])
writer.writerows(table['data'])
writer.writerow([]) # Empty row between tables
return all_tables
Usage
tables = extract_tables('financial_report.pdf', 'extracted_tables.csv') for table in tables: print(f"Page {table['page']}, Table {table['table_num']}") for row in table['data']: print(row)
Creating PDF Documents
Generate PDF documents from scratch using ReportLab.
Input: Content to include in the PDF
Process:
-
Create a canvas or use higher-level constructs
-
Add text, tables, images
-
Save to file
Example:
from reportlab.lib import colors from reportlab.lib.pagesizes import letter, A4 from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle from reportlab.lib.units import inch from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle
def create_report(output_path: str, title: str, content: list): """Create a formatted PDF report.""" doc = SimpleDocTemplate(output_path, pagesize=letter) styles = getSampleStyleSheet() story = []
# Add title
title_style = ParagraphStyle(
'CustomTitle',
parent=styles['Heading1'],
fontSize=24,
spaceAfter=30
)
story.append(Paragraph(title, title_style))
# Add content paragraphs
for item in content:
if isinstance(item, str):
story.append(Paragraph(item, styles['Normal']))
story.append(Spacer(1, 12))
elif isinstance(item, list):
# Treat as table data
table = Table(item)
table.setStyle(TableStyle([
('BACKGROUND', (0, 0), (-1, 0), colors.grey),
('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
('ALIGN', (0, 0), (-1, -1), 'CENTER'),
('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
('FONTSIZE', (0, 0), (-1, 0), 12),
('BOTTOMPADDING', (0, 0), (-1, 0), 12),
('BACKGROUND', (0, 1), (-1, -1), colors.beige),
('GRID', (0, 0), (-1, -1), 1, colors.black)
]))
story.append(table)
story.append(Spacer(1, 20))
doc.build(story)
Usage
create_report( 'security_report.pdf', 'Security Assessment Report', [ 'This report summarizes the findings from our security assessment.', [ ['Finding', 'Severity', 'Status'], ['SQL Injection', 'Critical', 'Open'], ['XSS Vulnerability', 'High', 'Remediated'], ['Weak Password Policy', 'Medium', 'In Progress'] ], 'Immediate remediation is recommended for all critical findings.' ] )
Merging PDF Documents
Combine multiple PDF files into a single document.
Input: List of PDF file paths
Process:
-
Create a PdfWriter object
-
Append each PDF
-
Write to output file
Example:
from PyPDF2 import PdfWriter, PdfReader from pathlib import Path
def merge_pdfs(pdf_list: list, output_path: str, add_bookmarks: bool = True): """Merge multiple PDFs into one document.""" writer = PdfWriter()
for pdf_path in pdf_list:
reader = PdfReader(pdf_path)
# Add bookmark for this document
if add_bookmarks:
bookmark_title = Path(pdf_path).stem
writer.add_outline_item(bookmark_title, len(writer.pages))
# Add all pages from this PDF
for page in reader.pages:
writer.add_page(page)
# Write the merged PDF
with open(output_path, 'wb') as output_file:
writer.write(output_file)
return output_path
Usage
pdfs_to_merge = [ 'cover_page.pdf', 'executive_summary.pdf', 'detailed_findings.pdf', 'appendix.pdf' ] merge_pdfs(pdfs_to_merge, 'complete_report.pdf')
Splitting PDF Documents
Extract specific pages from a PDF into new documents.
Input: PDF path and page ranges
Process:
-
Open source PDF
-
Select specific pages
-
Write to new PDF
Example:
from PyPDF2 import PdfReader, PdfWriter
def split_pdf(input_path: str, page_ranges: list, output_prefix: str): """ Split a PDF into multiple files based on page ranges.
Args:
input_path: Source PDF file
page_ranges: List of tuples (start, end) - 1-indexed, inclusive
output_prefix: Prefix for output files
Returns:
List of created file paths
"""
reader = PdfReader(input_path)
output_files = []
for i, (start, end) in enumerate(page_ranges, 1):
writer = PdfWriter()
# Pages are 0-indexed in PyPDF2
for page_num in range(start - 1, min(end, len(reader.pages))):
writer.add_page(reader.pages[page_num])
output_path = f"{output_prefix}_part{i}.pdf"
with open(output_path, 'wb') as output_file:
writer.write(output_file)
output_files.append(output_path)
return output_files
Usage - Split a 20-page document
split_pdf('large_report.pdf', [(1, 5), (6, 10), (11, 20)], 'report')
Creates: report_part1.pdf, report_part2.pdf, report_part3.pdf
Adding Watermarks
Add watermarks to PDF pages.
Input: PDF file and watermark content
Process:
-
Create watermark PDF
-
Overlay on each page
-
Save result
Example:
from PyPDF2 import PdfReader, PdfWriter from reportlab.pdfgen import canvas from reportlab.lib.pagesizes import letter from io import BytesIO
def add_watermark(input_path: str, output_path: str, watermark_text: str): """Add a text watermark to all pages of a PDF.""" # Create watermark watermark_buffer = BytesIO() c = canvas.Canvas(watermark_buffer, pagesize=letter)
# Configure watermark appearance
c.setFont("Helvetica", 50)
c.setFillColorRGB(0.5, 0.5, 0.5, alpha=0.3)
c.saveState()
c.translate(300, 400)
c.rotate(45)
c.drawCentredString(0, 0, watermark_text)
c.restoreState()
c.save()
watermark_buffer.seek(0)
watermark_pdf = PdfReader(watermark_buffer)
watermark_page = watermark_pdf.pages[0]
# Apply watermark to each page
reader = PdfReader(input_path)
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark_page)
writer.add_page(page)
with open(output_path, 'wb') as output_file:
writer.write(output_file)
Usage
add_watermark('report.pdf', 'report_confidential.pdf', 'CONFIDENTIAL')
Extracting Metadata
Read and modify PDF metadata.
Example:
from PyPDF2 import PdfReader, PdfWriter
def get_pdf_metadata(pdf_path: str) -> dict: """Extract metadata from a PDF file.""" reader = PdfReader(pdf_path) metadata = reader.metadata
return {
'title': metadata.get('/Title', ''),
'author': metadata.get('/Author', ''),
'subject': metadata.get('/Subject', ''),
'creator': metadata.get('/Creator', ''),
'producer': metadata.get('/Producer', ''),
'creation_date': metadata.get('/CreationDate', ''),
'modification_date': metadata.get('/ModDate', ''),
'page_count': len(reader.pages)
}
def set_pdf_metadata(input_path: str, output_path: str, metadata: dict): """Set metadata on a PDF file.""" reader = PdfReader(input_path) writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.add_metadata(metadata)
with open(output_path, 'wb') as output_file:
writer.write(output_file)
Usage
meta = get_pdf_metadata('document.pdf') print(f"Title: {meta['title']}") print(f"Pages: {meta['page_count']}")
set_pdf_metadata('input.pdf', 'output.pdf', { '/Title': 'Security Assessment Report', '/Author': 'Security Team', '/Subject': 'Q1 2024 Assessment' })
Configuration
Environment Variables
Variable Description Required Default
PDF_TEMPLATE_DIR
Default template directory No ./assets/templates
PDF_OUTPUT_DIR
Default output directory No ./output
Script Options
Option Type Description
--input
path Input PDF file
--output
path Output file path
--pages
string Page range (e.g., "1-5,8,10-12")
--verbose
flag Enable verbose logging
Examples
Example 1: Generate a Security Report PDF
Scenario: Create a professional security assessment report as PDF.
from reportlab.lib import colors from reportlab.lib.pagesizes import letter from reportlab.lib.styles import getSampleStyleSheet from reportlab.platypus import SimpleDocTemplate, Paragraph, Table, TableStyle, Spacer
def generate_security_report(findings: list, output_path: str): """Generate a security report PDF from findings data.""" doc = SimpleDocTemplate(output_path, pagesize=letter) styles = getSampleStyleSheet() story = []
# Title
story.append(Paragraph("Security Assessment Report", styles['Title']))
story.append(Spacer(1, 20))
# Executive Summary
story.append(Paragraph("Executive Summary", styles['Heading1']))
critical = sum(1 for f in findings if f['severity'] == 'Critical')
high = sum(1 for f in findings if f['severity'] == 'High')
story.append(Paragraph(
f"This assessment identified {critical} critical and {high} high severity findings.",
styles['Normal']
))
story.append(Spacer(1, 20))
# Findings Table
story.append(Paragraph("Findings Summary", styles['Heading1']))
table_data = [['ID', 'Finding', 'Severity', 'Status']]
for i, f in enumerate(findings, 1):
table_data.append([str(i), f['title'], f['severity'], f['status']])
table = Table(table_data, colWidths=[40, 250, 80, 80])
table.setStyle(TableStyle([
('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#2c3e50')),
('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
('ALIGN', (0, 0), (-1, -1), 'CENTER'),
('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.HexColor('#ecf0f1')])
]))
story.append(table)
doc.build(story)
Usage
findings = [ {'title': 'SQL Injection in Login Form', 'severity': 'Critical', 'status': 'Open'}, {'title': 'Reflected XSS', 'severity': 'High', 'status': 'Open'}, {'title': 'Missing Security Headers', 'severity': 'Medium', 'status': 'Fixed'} ] generate_security_report(findings, 'pentest_report.pdf')
Example 2: Extract and Analyze PDF Content
Scenario: Extract text and tables from a vendor security questionnaire.
import pdfplumber import json
def analyze_questionnaire(pdf_path: str) -> dict: """Extract and analyze a security questionnaire PDF.""" results = { 'total_pages': 0, 'questions': [], 'tables': [], 'text_content': '' }
with pdfplumber.open(pdf_path) as pdf:
results['total_pages'] = len(pdf.pages)
for page_num, page in enumerate(pdf.pages, 1):
# Extract text
text = page.extract_text() or ''
results['text_content'] += f"\n--- Page {page_num} ---\n{text}"
# Find questions (lines ending with ?)
for line in text.split('\n'):
if line.strip().endswith('?'):
results['questions'].append({
'page': page_num,
'question': line.strip()
})
# Extract tables
for table in page.extract_tables():
if table:
results['tables'].append({
'page': page_num,
'rows': len(table),
'data': table
})
return results
Usage
analysis = analyze_questionnaire('vendor_questionnaire.pdf') print(f"Total pages: {analysis['total_pages']}") print(f"Questions found: {len(analysis['questions'])}") print(f"Tables found: {len(analysis['tables'])}")
Example 3: Batch PDF Processing
Scenario: Process multiple PDFs, extract metadata, and generate a summary.
from PyPDF2 import PdfReader from pathlib import Path import csv
def batch_analyze_pdfs(directory: str, output_csv: str): """Analyze all PDFs in a directory and create a summary CSV.""" pdf_dir = Path(directory) results = []
for pdf_path in pdf_dir.glob('*.pdf'):
try:
reader = PdfReader(pdf_path)
meta = reader.metadata or {}
results.append({
'filename': pdf_path.name,
'pages': len(reader.pages),
'title': meta.get('/Title', ''),
'author': meta.get('/Author', ''),
'encrypted': reader.is_encrypted,
'size_kb': pdf_path.stat().st_size / 1024
})
except Exception as e:
results.append({
'filename': pdf_path.name,
'error': str(e)
})
# Write CSV summary
with open(output_csv, 'w', newline='') as f:
if results:
writer = csv.DictWriter(f, fieldnames=results[0].keys())
writer.writeheader()
writer.writerows(results)
return results
Usage
summary = batch_analyze_pdfs('./reports/', 'pdf_inventory.csv')
Limitations
-
Scanned PDFs: Text extraction requires OCR for image-based PDFs (not included by default)
-
Complex Layouts: Multi-column or heavily formatted PDFs may have extraction issues
-
Form Fields: Complex interactive forms may not be fully supported
-
Digital Signatures: Cannot create or verify digital signatures
-
Encryption: Limited support for encrypted PDFs (password-protected reading only)
-
Large Files: Very large PDFs (1000+ pages) may require streaming approaches
Troubleshooting
Text Extraction Returns Empty
Problem: extract_text() returns empty or garbled text
Solutions:
The PDF may be image-based (scanned). Use OCR:
Install: pip install pdf2image pytesseract
from pdf2image import convert_from_path import pytesseract
images = convert_from_path('scanned.pdf') text = '\n'.join(pytesseract.image_to_string(img) for img in images)
Try different extraction settings:
with pdfplumber.open('document.pdf') as pdf: page = pdf.pages[0] text = page.extract_text(layout=True) # Preserve layout
Merge Fails with Encrypted PDF
Problem: Cannot merge password-protected PDFs
Solution:
reader = PdfReader('protected.pdf') if reader.is_encrypted: reader.decrypt('password')
Then proceed with merging
Table Extraction Incorrect
Problem: Table cells are misaligned or merged incorrectly
Solution: Use explicit table settings:
with pdfplumber.open('document.pdf') as pdf: page = pdf.pages[0] tables = page.extract_tables(table_settings={ "vertical_strategy": "text", "horizontal_strategy": "text", "snap_tolerance": 3, })
Related Skills
-
docx: Convert between DOCX and PDF formats
-
xlsx: Extract tabular data for spreadsheet analysis
-
image-generation: Generate charts and diagrams for PDF reports
References
-
Detailed API Reference
-
PyPDF2 Documentation
-
pdfplumber Documentation
-
ReportLab User Guide