pdf

This guide covers essential PDF processing operations using Python libraries and command-line tools.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "pdf" with this command: npx skills add 5dlabs/cto/5dlabs-cto-pdf

PDF Processing Guide

Overview

This guide covers essential PDF processing operations using Python libraries and command-line tools.

Quick Start

from pypdf import PdfReader, PdfWriter

Read a PDF

reader = PdfReader("document.pdf") print(f"Pages: {len(reader.pages)}")

Extract text

text = "" for page in reader.pages: text += page.extract_text()

Python Libraries

pypdf - Basic Operations

Merge PDFs

from pypdf import PdfWriter, PdfReader

writer = PdfWriter() for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]: reader = PdfReader(pdf_file) for page in reader.pages: writer.add_page(page)

with open("merged.pdf", "wb") as output: writer.write(output)

Split PDF

reader = PdfReader("input.pdf") for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) with open(f"page_{i+1}.pdf", "wb") as output: writer.write(output)

Extract Metadata

reader = PdfReader("document.pdf") meta = reader.metadata print(f"Title: {meta.title}") print(f"Author: {meta.author}") print(f"Subject: {meta.subject}")

Rotate Pages

reader = PdfReader("input.pdf") writer = PdfWriter() page = reader.pages[0] page.rotate(90) # Rotate 90 degrees clockwise writer.add_page(page) with open("rotated.pdf", "wb") as output: writer.write(output)

pdfplumber - Text and Table Extraction

Extract Text with Layout

import pdfplumber

with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: text = page.extract_text() print(text)

Extract Tables

import pandas as pd

with pdfplumber.open("document.pdf") as pdf: all_tables = [] for page in pdf.pages: tables = page.extract_tables() for table in tables: if table: df = pd.DataFrame(table[1:], columns=table[0]) all_tables.append(df)

if all_tables:
    combined_df = pd.concat(all_tables, ignore_index=True)
    combined_df.to_excel("extracted_tables.xlsx", index=False)

reportlab - Create PDFs

from reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas

c = canvas.Canvas("hello.pdf", pagesize=letter) width, height = letter

Add text

c.drawString(100, height - 100, "Hello World!") c.drawString(100, height - 120, "This is a PDF created with reportlab")

Add a line

c.line(100, height - 140, 400, height - 140)

c.save()

Command-Line Tools

pdftotext (poppler-utils)

Extract text

pdftotext input.pdf output.txt

Extract text preserving layout

pdftotext -layout input.pdf output.txt

Extract specific pages

pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5

qpdf

Merge PDFs

qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

Split pages

qpdf input.pdf --pages . 1-5 -- pages1-5.pdf

Rotate pages

qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees

Remove password

qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf

Common Tasks

Extract Text from Scanned PDFs (OCR)

Requires: pip install pytesseract pdf2image

import pytesseract from pdf2image import convert_from_path

Convert PDF to images

images = convert_from_path('scanned.pdf')

OCR each page

text = "" for i, image in enumerate(images): text += f"Page {i + 1}:\n" text += pytesseract.image_to_string(image) text += "\n\n"

print(text)

Add Watermark

from pypdf import PdfReader, PdfWriter

Create watermark (or load existing)

watermark = PdfReader("watermark.pdf").pages[0]

Apply to all pages

reader = PdfReader("document.pdf") writer = PdfWriter() for page in reader.pages: page.merge_page(watermark) writer.add_page(page)

with open("watermarked.pdf", "wb") as output: writer.write(output)

Extract Images

Using pdfimages (poppler-utils)

pdfimages -j input.pdf output_prefix

Creates output_prefix-000.jpg, output_prefix-001.jpg, etc.

Password Protection

from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf") writer = PdfWriter() for page in reader.pages: writer.add_page(page)

Add password

writer.encrypt("userpassword", "ownerpassword") with open("encrypted.pdf", "wb") as output: writer.write(output)

Quick Reference

Task Best Tool Command/Code

Merge PDFs pypdf writer.add_page(page)

Split PDFs pypdf One page per file

Extract text pdfplumber page.extract_text()

Extract tables pdfplumber page.extract_tables()

Create PDFs reportlab Canvas or Platypus

Command line merge qpdf qpdf --empty --pages ...

OCR scanned PDFs pytesseract Convert to image first

Dependencies

Python libraries

pip install pypdf pdfplumber reportlab pytesseract pdf2image

System tools

macOS

brew install poppler tesseract qpdf

Ubuntu/Debian

sudo apt-get install poppler-utils tesseract-ocr qpdf

Attribution

Based on anthropics/skills pdf skill.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

code-review

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

github-mcp

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

test-driven-development

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

project-development

No summary provided by upstream source.

Repository SourceNeeds Review