PDF Skill

Read, create, and manipulate PDF documents with support for text extraction, document generation, merging, and form filling.

Capabilities

Read PDFs: Extract text, tables, and metadata from PDF files
Create PDFs: Generate PDF documents from scratch using ReportLab
Merge PDFs: Combine multiple PDFs into a single document
Split PDFs: Extract specific pages from PDF documents
Form Operations: Fill PDF forms programmatically
Watermarks: Add watermarks and headers/footers to documents
Convert: Convert between PDF and other formats

Quick Start

import pdfplumber from PyPDF2 import PdfReader, PdfWriter

Read text from PDF

with pdfplumber.open('document.pdf') as pdf: for page in pdf.pages: print(page.extract_text())

Merge PDFs

merger = PdfWriter() for pdf_file in ['doc1.pdf', 'doc2.pdf']: merger.append(pdf_file) merger.write('merged.pdf')

Usage

Extracting Text from PDFs

Extract text content from PDF files with layout preservation.

Input: Path to a PDF file

Process:

Open PDF with pdfplumber for accurate text extraction
Iterate through pages
Extract text, optionally preserving layout

Example:

import pdfplumber from pathlib import Path

def extract_text(pdf_path: Path) -> str: """Extract all text from a PDF file.""" text_content = []

with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        if text:
            text_content.append(text)

return '\n\n'.join(text_content)

Usage

text = extract_text(Path('report.pdf')) print(text)

Extracting Tables from PDFs

Extract tabular data from PDF files into structured formats.

Input: Path to PDF file containing tables

Process:

Open PDF with pdfplumber
Detect and extract tables from each page
Return as list of lists (rows and cells)

Example:

import pdfplumber import csv

def extract_tables(pdf_path: str, output_csv: str = None): """Extract tables from PDF, optionally save to CSV.""" all_tables = []

with pdfplumber.open(pdf_path) as pdf:
    for page_num, page in enumerate(pdf.pages, 1):
        tables = page.extract_tables()
        for table_num, table in enumerate(tables, 1):
            all_tables.append({
                'page': page_num,
                'table_num': table_num,
                'data': table
            })

# Optionally save to CSV
if output_csv and all_tables:
    with open(output_csv, 'w', newline='') as f:
        writer = csv.writer(f)
        for table in all_tables:
            writer.writerow([f"Page {table['page']}, Table {table['table_num']}"])
            writer.writerows(table['data'])
            writer.writerow([])  # Empty row between tables

return all_tables

Usage

tables = extract_tables('financial_report.pdf', 'extracted_tables.csv') for table in tables: print(f"Page {table['page']}, Table {table['table_num']}") for row in table['data']: print(row)

Creating PDF Documents

Generate PDF documents from scratch using ReportLab.

Input: Content to include in the PDF

Process:

Create a canvas or use higher-level constructs
Add text, tables, images
Save to file

Example:

from reportlab.lib import colors from reportlab.lib.pagesizes import letter, A4 from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle from reportlab.lib.units import inch from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle

def create_report(output_path: str, title: str, content: list): """Create a formatted PDF report.""" doc = SimpleDocTemplate(output_path, pagesize=letter) styles = getSampleStyleSheet() story = []

# Add title
title_style = ParagraphStyle(
    'CustomTitle',
    parent=styles['Heading1'],
    fontSize=24,
    spaceAfter=30
)
story.append(Paragraph(title, title_style))

# Add content paragraphs
for item in content:
    if isinstance(item, str):
        story.append(Paragraph(item, styles['Normal']))
        story.append(Spacer(1, 12))
    elif isinstance(item, list):
        # Treat as table data
        table = Table(item)
        table.setStyle(TableStyle([
            ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
            ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
            ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
            ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
            ('FONTSIZE', (0, 0), (-1, 0), 12),
            ('BOTTOMPADDING', (0, 0), (-1, 0), 12),
            ('BACKGROUND', (0, 1), (-1, -1), colors.beige),
            ('GRID', (0, 0), (-1, -1), 1, colors.black)
        ]))
        story.append(table)
        story.append(Spacer(1, 20))

doc.build(story)

Usage

create_report( 'security_report.pdf', 'Security Assessment Report', [ 'This report summarizes the findings from our security assessment.', [ ['Finding', 'Severity', 'Status'], ['SQL Injection', 'Critical', 'Open'], ['XSS Vulnerability', 'High', 'Remediated'], ['Weak Password Policy', 'Medium', 'In Progress'] ], 'Immediate remediation is recommended for all critical findings.' ] )

Merging PDF Documents

Combine multiple PDF files into a single document.

Input: List of PDF file paths

Process:

Create a PdfWriter object
Append each PDF
Write to output file

Example:

from PyPDF2 import PdfWriter, PdfReader from pathlib import Path

def merge_pdfs(pdf_list: list, output_path: str, add_bookmarks: bool = True): """Merge multiple PDFs into one document.""" writer = PdfWriter()

for pdf_path in pdf_list:
    reader = PdfReader(pdf_path)

    # Add bookmark for this document
    if add_bookmarks:
        bookmark_title = Path(pdf_path).stem
        writer.add_outline_item(bookmark_title, len(writer.pages))

    # Add all pages from this PDF
    for page in reader.pages:
        writer.add_page(page)

# Write the merged PDF
with open(output_path, 'wb') as output_file:
    writer.write(output_file)

return output_path

Usage

pdfs_to_merge = [ 'cover_page.pdf', 'executive_summary.pdf', 'detailed_findings.pdf', 'appendix.pdf' ] merge_pdfs(pdfs_to_merge, 'complete_report.pdf')

Splitting PDF Documents

Extract specific pages from a PDF into new documents.

Input: PDF path and page ranges

Process:

Open source PDF
Select specific pages
Write to new PDF

Example:

from PyPDF2 import PdfReader, PdfWriter

def split_pdf(input_path: str, page_ranges: list, output_prefix: str): """ Split a PDF into multiple files based on page ranges.

Args:
    input_path: Source PDF file
    page_ranges: List of tuples (start, end) - 1-indexed, inclusive
    output_prefix: Prefix for output files

Returns:
    List of created file paths
"""
reader = PdfReader(input_path)
output_files = []

for i, (start, end) in enumerate(page_ranges, 1):
    writer = PdfWriter()

    # Pages are 0-indexed in PyPDF2
    for page_num in range(start - 1, min(end, len(reader.pages))):
        writer.add_page(reader.pages[page_num])

    output_path = f"{output_prefix}_part{i}.pdf"
    with open(output_path, 'wb') as output_file:
        writer.write(output_file)

    output_files.append(output_path)

return output_files

Usage - Split a 20-page document

split_pdf('large_report.pdf', [(1, 5), (6, 10), (11, 20)], 'report')

Creates: report_part1.pdf, report_part2.pdf, report_part3.pdf

Adding Watermarks

Add watermarks to PDF pages.

Input: PDF file and watermark content

Process:

Create watermark PDF
Overlay on each page
Save result

Example:

from PyPDF2 import PdfReader, PdfWriter from reportlab.pdfgen import canvas from reportlab.lib.pagesizes import letter from io import BytesIO

def add_watermark(input_path: str, output_path: str, watermark_text: str): """Add a text watermark to all pages of a PDF.""" # Create watermark watermark_buffer = BytesIO() c = canvas.Canvas(watermark_buffer, pagesize=letter)

# Configure watermark appearance
c.setFont("Helvetica", 50)
c.setFillColorRGB(0.5, 0.5, 0.5, alpha=0.3)
c.saveState()
c.translate(300, 400)
c.rotate(45)
c.drawCentredString(0, 0, watermark_text)
c.restoreState()
c.save()

watermark_buffer.seek(0)
watermark_pdf = PdfReader(watermark_buffer)
watermark_page = watermark_pdf.pages[0]

# Apply watermark to each page
reader = PdfReader(input_path)
writer = PdfWriter()

for page in reader.pages:
    page.merge_page(watermark_page)
    writer.add_page(page)

with open(output_path, 'wb') as output_file:
    writer.write(output_file)

Usage

add_watermark('report.pdf', 'report_confidential.pdf', 'CONFIDENTIAL')

Extracting Metadata

Read and modify PDF metadata.

Example:

from PyPDF2 import PdfReader, PdfWriter

def get_pdf_metadata(pdf_path: str) -> dict: """Extract metadata from a PDF file.""" reader = PdfReader(pdf_path) metadata = reader.metadata

return {
    'title': metadata.get('/Title', ''),
    'author': metadata.get('/Author', ''),
    'subject': metadata.get('/Subject', ''),
    'creator': metadata.get('/Creator', ''),
    'producer': metadata.get('/Producer', ''),
    'creation_date': metadata.get('/CreationDate', ''),
    'modification_date': metadata.get('/ModDate', ''),
    'page_count': len(reader.pages)
}

def set_pdf_metadata(input_path: str, output_path: str, metadata: dict): """Set metadata on a PDF file.""" reader = PdfReader(input_path) writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

writer.add_metadata(metadata)

with open(output_path, 'wb') as output_file:
    writer.write(output_file)

Usage

meta = get_pdf_metadata('document.pdf') print(f"Title: {meta['title']}") print(f"Pages: {meta['page_count']}")

set_pdf_metadata('input.pdf', 'output.pdf', { '/Title': 'Security Assessment Report', '/Author': 'Security Team', '/Subject': 'Q1 2024 Assessment' })

Configuration

Environment Variables

Variable Description Required Default

PDF_TEMPLATE_DIR

Default template directory No ./assets/templates

PDF_OUTPUT_DIR

Default output directory No ./output

Script Options

Option Type Description

--input

path Input PDF file

--output

path Output file path

--pages

string Page range (e.g., "1-5,8,10-12")

--verbose

flag Enable verbose logging

Examples

Example 1: Generate a Security Report PDF

Scenario: Create a professional security assessment report as PDF.

from reportlab.lib import colors from reportlab.lib.pagesizes import letter from reportlab.lib.styles import getSampleStyleSheet from reportlab.platypus import SimpleDocTemplate, Paragraph, Table, TableStyle, Spacer

def generate_security_report(findings: list, output_path: str): """Generate a security report PDF from findings data.""" doc = SimpleDocTemplate(output_path, pagesize=letter) styles = getSampleStyleSheet() story = []

# Title
story.append(Paragraph("Security Assessment Report", styles['Title']))
story.append(Spacer(1, 20))

# Executive Summary
story.append(Paragraph("Executive Summary", styles['Heading1']))
critical = sum(1 for f in findings if f['severity'] == 'Critical')
high = sum(1 for f in findings if f['severity'] == 'High')
story.append(Paragraph(
    f"This assessment identified {critical} critical and {high} high severity findings.",
    styles['Normal']
))
story.append(Spacer(1, 20))

# Findings Table
story.append(Paragraph("Findings Summary", styles['Heading1']))
table_data = [['ID', 'Finding', 'Severity', 'Status']]
for i, f in enumerate(findings, 1):
    table_data.append([str(i), f['title'], f['severity'], f['status']])

table = Table(table_data, colWidths=[40, 250, 80, 80])
table.setStyle(TableStyle([
    ('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#2c3e50')),
    ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
    ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
    ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
    ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
    ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.HexColor('#ecf0f1')])
]))
story.append(table)

doc.build(story)

Usage

findings = [ {'title': 'SQL Injection in Login Form', 'severity': 'Critical', 'status': 'Open'}, {'title': 'Reflected XSS', 'severity': 'High', 'status': 'Open'}, {'title': 'Missing Security Headers', 'severity': 'Medium', 'status': 'Fixed'} ] generate_security_report(findings, 'pentest_report.pdf')

Example 2: Extract and Analyze PDF Content

Scenario: Extract text and tables from a vendor security questionnaire.

import pdfplumber import json

def analyze_questionnaire(pdf_path: str) -> dict: """Extract and analyze a security questionnaire PDF.""" results = { 'total_pages': 0, 'questions': [], 'tables': [], 'text_content': '' }

with pdfplumber.open(pdf_path) as pdf:
    results['total_pages'] = len(pdf.pages)

    for page_num, page in enumerate(pdf.pages, 1):
        # Extract text
        text = page.extract_text() or ''
        results['text_content'] += f"\n--- Page {page_num} ---\n{text}"

        # Find questions (lines ending with ?)
        for line in text.split('\n'):
            if line.strip().endswith('?'):
                results['questions'].append({
                    'page': page_num,
                    'question': line.strip()
                })

        # Extract tables
        for table in page.extract_tables():
            if table:
                results['tables'].append({
                    'page': page_num,
                    'rows': len(table),
                    'data': table
                })

return results

Usage

analysis = analyze_questionnaire('vendor_questionnaire.pdf') print(f"Total pages: {analysis['total_pages']}") print(f"Questions found: {len(analysis['questions'])}") print(f"Tables found: {len(analysis['tables'])}")

Example 3: Batch PDF Processing

Scenario: Process multiple PDFs, extract metadata, and generate a summary.

from PyPDF2 import PdfReader from pathlib import Path import csv

def batch_analyze_pdfs(directory: str, output_csv: str): """Analyze all PDFs in a directory and create a summary CSV.""" pdf_dir = Path(directory) results = []

for pdf_path in pdf_dir.glob('*.pdf'):
    try:
        reader = PdfReader(pdf_path)
        meta = reader.metadata or {}

        results.append({
            'filename': pdf_path.name,
            'pages': len(reader.pages),
            'title': meta.get('/Title', ''),
            'author': meta.get('/Author', ''),
            'encrypted': reader.is_encrypted,
            'size_kb': pdf_path.stat().st_size / 1024
        })
    except Exception as e:
        results.append({
            'filename': pdf_path.name,
            'error': str(e)
        })

# Write CSV summary
with open(output_csv, 'w', newline='') as f:
    if results:
        writer = csv.DictWriter(f, fieldnames=results[0].keys())
        writer.writeheader()
        writer.writerows(results)

return results

Usage

summary = batch_analyze_pdfs('./reports/', 'pdf_inventory.csv')

Limitations

Scanned PDFs: Text extraction requires OCR for image-based PDFs (not included by default)
Complex Layouts: Multi-column or heavily formatted PDFs may have extraction issues
Form Fields: Complex interactive forms may not be fully supported
Digital Signatures: Cannot create or verify digital signatures
Encryption: Limited support for encrypted PDFs (password-protected reading only)
Large Files: Very large PDFs (1000+ pages) may require streaming approaches

Troubleshooting

Text Extraction Returns Empty

Problem: extract_text() returns empty or garbled text

Solutions:

The PDF may be image-based (scanned). Use OCR:

Install: pip install pdf2image pytesseract

from pdf2image import convert_from_path import pytesseract

images = convert_from_path('scanned.pdf') text = '\n'.join(pytesseract.image_to_string(img) for img in images)

Try different extraction settings:

with pdfplumber.open('document.pdf') as pdf: page = pdf.pages[0] text = page.extract_text(layout=True) # Preserve layout

Merge Fails with Encrypted PDF

Problem: Cannot merge password-protected PDFs

Solution:

reader = PdfReader('protected.pdf') if reader.is_encrypted: reader.decrypt('password')

Then proceed with merging

Table Extraction Incorrect

Problem: Table cells are misaligned or merged incorrectly

Solution: Use explicit table settings:

with pdfplumber.open('document.pdf') as pdf: page = pdf.pages[0] tables = page.extract_tables(table_settings={ "vertical_strategy": "text", "horizontal_strategy": "text", "snap_tolerance": 3, })

Related Skills

docx: Convert between DOCX and PDF formats
xlsx: Extract tabular data for spreadsheet analysis
image-generation: Generate charts and diagrams for PDF reports

References

Detailed API Reference
PyPDF2 Documentation
pdfplumber Documentation
ReportLab User Guide

pdf

Safety Notice

Copy this and send it to your AI assistant to learn

Read text from PDF

Merge PDFs

Usage

Usage

Usage

Usage

Usage - Split a 20-page document

Creates: report_part1.pdf, report_part2.pdf, report_part3.pdf

Usage

Usage

Usage

Usage

Usage

Install: pip install pdf2image pytesseract

Then proceed with merging

Source Transparency

Related Skills

email-forensics

disk-forensics

forensic-reporting

log-forensics