Working with Documents
Quick Reference
Format Read Create Edit
DOCX pandoc, python-docx docx-js OOXML (unpack/edit/pack)
PDF pdfplumber, pypdf reportlab pypdf (merge/split)
PPTX markitdown html2pptx OOXML (unpack/edit/pack)
Word Documents (.docx)
Reading Content
Convert to markdown (preserves structure)
pandoc document.docx -o output.md
With tracked changes visible
pandoc --track-changes=all document.docx -o output.md
Creating New Documents
Use docx-js (JavaScript):
const { Document, Packer, Paragraph, TextRun } = require('docx');
const doc = new Document({ sections: [{ children: [ new Paragraph({ children: [ new TextRun({ text: "Hello World", bold: true }), ], }), ], }], });
Packer.toBuffer(doc).then(buffer => { fs.writeFileSync("output.docx", buffer); });
Editing Existing Documents (Tracked Changes)
1. Unpack
python ooxml/scripts/unpack.py document.docx unpacked/
2. Edit XML files in unpacked/word/document.xml
Key files:
- word/document.xml (main content)
- word/comments.xml (comments)
- word/media/ (images)
3. Pack
python ooxml/scripts/pack.py unpacked/ edited.docx
Tracked changes XML pattern:
<!-- Deletion --> <w:del><w:r><w:delText>old text</w:delText></w:r></w:del>
<!-- Insertion --> <w:ins><w:r><w:t>new text</w:t></w:r></w:ins>
PDF Documents
Reading PDFs
import pdfplumber
Extract text
with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: print(page.extract_text())
Extract tables
with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: tables = page.extract_tables() for table in tables: for row in table: print(row)
Creating PDFs
from reportlab.lib.pagesizes import letter from reportlab.platypus import SimpleDocTemplate, Paragraph from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("output.pdf", pagesize=letter) styles = getSampleStyleSheet() story = [ Paragraph("Report Title", styles['Title']), Paragraph("Body text goes here.", styles['Normal']), ] doc.build(story)
Merging/Splitting PDFs
from pypdf import PdfReader, PdfWriter
Merge
writer = PdfWriter() for pdf_file in ["doc1.pdf", "doc2.pdf"]: reader = PdfReader(pdf_file) for page in reader.pages: writer.add_page(page) writer.write(open("merged.pdf", "wb"))
Split
reader = PdfReader("input.pdf") for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) writer.write(open(f"page_{i+1}.pdf", "wb"))
Command-Line Tools
Extract text
pdftotext input.pdf output.txt pdftotext -layout input.pdf output.txt # Preserve layout
Merge with qpdf
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
PowerPoint Presentations (.pptx)
Reading Content
Convert to markdown
python -m markitdown presentation.pptx
Creating New Presentations
Use html2pptx workflow:
-
Create HTML slides (720pt × 405pt for 16:9)
-
Convert with html2pptx.js library
-
Validate with thumbnail grid
Create thumbnails for validation
python scripts/thumbnail.py output.pptx --cols 4
Editing Existing Presentations
1. Unpack
python ooxml/scripts/unpack.py presentation.pptx unpacked/
Key files:
- ppt/slides/slide1.xml, slide2.xml, etc.
- ppt/notesSlides/ (speaker notes)
- ppt/media/ (images)
2. Edit XML
3. Validate
python ooxml/scripts/validate.py unpacked/ --original presentation.pptx
4. Pack
python ooxml/scripts/pack.py unpacked/ edited.pptx
Rearranging Slides
Duplicate, reorder, delete slides
python scripts/rearrange.py template.pptx output.pptx 0,3,3,5,7
Creates: slide 0, slide 3 (twice), slide 5, slide 7
Converting Between Formats
DOCX/PPTX to PDF
soffice --headless --convert-to pdf document.docx
PDF to images
pdftoppm -jpeg -r 150 document.pdf page
Creates: page-1.jpg, page-2.jpg, etc.
DOCX to Markdown
pandoc document.docx -o output.md
OCR for Scanned Documents
import pytesseract from pdf2image import convert_from_path
images = convert_from_path('scanned.pdf') text = "" for image in images: text += pytesseract.image_to_string(image)
Design Guidelines (Presentations)
Color Palettes
Pick 3-5 colors that work together:
Palette Colors
Classic Blue Navy #1C2833, Slate #2E4053, Silver #AAB7B8
Teal & Coral Teal #5EA8A7, Coral #FE4447, White #FFFFFF
Black & Gold Gold #BF9A4A, Black #000000, Cream #F4F6F6
Web-Safe Fonts Only
Arial, Helvetica, Times New Roman, Georgia, Verdana, Tahoma, Trebuchet MS, Courier New, Impact
Layout Rules
-
Two-column: Use for exactly 2 distinct items
-
Three-column: Use for exactly 3 items
-
Never vertically stack charts below text
-
Full-bleed images with text overlays work well
Dependencies
Python
pip install pypdf pdfplumber reportlab python-docx openpyxl
System tools
apt-get install pandoc poppler-utils libreoffice
Node.js (for docx-js)
npm install docx
Verification
Run: python scripts/verify.py
Related Skills
-
working-with-spreadsheets
-
Excel file handling
-
building-nextjs-apps
-
Frontend for document uploads