pdf

Essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see reference.md. For filling PDF forms, read forms.md and follow its instructions.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "pdf" with this command: npx skills add kortix-ai/kortix-registry/kortix-ai-kortix-registry-pdf

PDF Processing Guide

Overview

Essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see reference.md. For filling PDF forms, read forms.md and follow its instructions.

Quick Start

from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf") print(f"Pages: {len(reader.pages)}")

text = "" for page in reader.pages: text += page.extract_text()

Python Libraries

pypdf - Basic Operations

Merge PDFs

from pypdf import PdfWriter, PdfReader

writer = PdfWriter() for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]: reader = PdfReader(pdf_file) for page in reader.pages: writer.add_page(page)

with open("merged.pdf", "wb") as output: writer.write(output)

Split PDF

reader = PdfReader("input.pdf") for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) with open(f"page_{i+1}.pdf", "wb") as output: writer.write(output)

Extract Metadata

reader = PdfReader("document.pdf") meta = reader.metadata print(f"Title: {meta.title}") print(f"Author: {meta.author}")

Rotate Pages

reader = PdfReader("input.pdf") writer = PdfWriter() page = reader.pages[0] page.rotate(90) writer.add_page(page) with open("rotated.pdf", "wb") as output: writer.write(output)

pdfplumber - Text and Table Extraction

Extract Text with Layout

import pdfplumber

with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: text = page.extract_text() print(text)

Extract Tables

import pandas as pd

with pdfplumber.open("document.pdf") as pdf: all_tables = [] for page in pdf.pages: tables = page.extract_tables() for table in tables: if table: df = pd.DataFrame(table[1:], columns=table[0]) all_tables.append(df)

if all_tables:
    combined_df = pd.concat(all_tables, ignore_index=True)
    combined_df.to_excel("extracted_tables.xlsx", index=False)

reportlab - Create PDFs

Basic PDF Creation

from reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas

c = canvas.Canvas("hello.pdf", pagesize=letter) width, height = letter c.drawString(100, height - 100, "Hello World!") c.line(100, height - 140, 400, height - 140) c.save()

Multi-Page with Platypus

from reportlab.lib.pagesizes import letter from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf", pagesize=letter) styles = getSampleStyleSheet() story = []

story.append(Paragraph("Report Title", styles['Title'])) story.append(Spacer(1, 12)) story.append(Paragraph("This is the body of the report. " * 20, styles['Normal'])) story.append(PageBreak()) story.append(Paragraph("Page 2", styles['Heading1']))

doc.build(story)

Subscripts and Superscripts

IMPORTANT: Never use Unicode subscript/superscript characters in ReportLab PDFs. The built-in fonts do not include these glyphs, causing them to render as solid black boxes.

Use ReportLab's XML markup tags in Paragraph objects:

from reportlab.platypus import Paragraph from reportlab.lib.styles import getSampleStyleSheet styles = getSampleStyleSheet()

chemical = Paragraph("H<sub>2</sub>O", styles['Normal']) squared = Paragraph("x<super>2</super> + y<super>2</super>", styles['Normal'])

Command-Line Tools

pdftotext (poppler-utils)

pdftotext input.pdf output.txt pdftotext -layout input.pdf output.txt pdftotext -f 1 -l 5 input.pdf output.txt

qpdf

qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf qpdf input.pdf --pages . 1-5 -- pages1-5.pdf qpdf input.pdf output.pdf --rotate=+90:1 qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf

Common Tasks

OCR Scanned PDFs

import pytesseract from pdf2image import convert_from_path

images = convert_from_path('scanned.pdf') text = "" for i, image in enumerate(images): text += f"Page {i+1}:\n" text += pytesseract.image_to_string(image) text += "\n\n"

Add Watermark

from pypdf import PdfReader, PdfWriter

watermark = PdfReader("watermark.pdf").pages[0] reader = PdfReader("document.pdf") writer = PdfWriter()

for page in reader.pages: page.merge_page(watermark) writer.add_page(page)

with open("watermarked.pdf", "wb") as output: writer.write(output)

Extract Images

pdfimages -j input.pdf output_prefix

Password Protection

from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf") writer = PdfWriter() for page in reader.pages: writer.add_page(page) writer.encrypt("userpassword", "ownerpassword") with open("encrypted.pdf", "wb") as output: writer.write(output)

Quick Reference

Task Best Tool Command/Code

Merge PDFs pypdf writer.add_page(page)

Split PDFs pypdf One page per file

Extract text pdfplumber page.extract_text()

Extract tables pdfplumber page.extract_tables()

Create PDFs reportlab Canvas or Platypus

Command line merge qpdf qpdf --empty --pages ...

OCR scanned PDFs pytesseract Convert to image first

Fill PDF forms pypdf or pdf-lib See forms.md

Next Steps

  • For advanced pypdfium2 usage, see reference.md

  • For JavaScript libraries (pdf-lib), see reference.md

  • For filling PDF forms, follow instructions in forms.md

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

opencode

No summary provided by upstream source.

Repository SourceNeeds Review
Research

openalex-paper-search

No summary provided by upstream source.

Repository SourceNeeds Review
General

elevenlabs

No summary provided by upstream source.

Repository SourceNeeds Review
General

presentations

No summary provided by upstream source.

Repository SourceNeeds Review