markitdown

Document to Markdown Conversion

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "markitdown" with this command: npx skills add rysweet/amplihack/rysweet-amplihack-markitdown

Document to Markdown Conversion

Overview

Convert various document formats to clean Markdown using Microsoft's MarkItDown tool. Optimized for LLM processing, content extraction, and document analysis workflows.

Supported Formats: PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx/.xls), Images (with OCR/LLM), HTML, Audio (with transcription), CSV, JSON, XML, ZIP archives, EPubs

Quick Start

Basic Usage

from markitdown import MarkItDown

md = MarkItDown() result = md.convert("document.pdf") print(result.text_content)

Command Line

Convert single file

markitdown document.pdf > output.md markitdown document.pdf -o output.md

Pipe input

cat document.pdf | markitdown

🔒 Security Considerations

Before using in production:

  • ✅ Validate file types (MIME, not extension)

  • ✅ Limit file sizes (prevent DoS)

  • ✅ Sanitize file paths (prevent traversal)

  • ✅ Protect API keys (never hardcode)

  • ✅ Consider data privacy (external services)

See patterns.md for implementation details.

API Key Security

❌ NEVER:

  • Hardcode keys in code

  • Commit .env files to git

  • Log environment variables

✅ ALWAYS:

  • Use environment variables: export OPENAI_API_KEY="sk-..."

pragma: allowlist secret

  • Use secret management (AWS Secrets Manager, Azure Key Vault)

  • Rotate keys regularly

Common Patterns

PDF Documents

Basic PDF conversion

md = MarkItDown() result = md.convert("report.pdf")

With Azure Document Intelligence (better quality)

md = MarkItDown(docintel_endpoint="<your-endpoint>") result = md.convert("report.pdf")

Office Documents

Word documents - preserves structure

result = md.convert("document.docx")

Excel - converts tables to markdown tables

result = md.convert("spreadsheet.xlsx")

PowerPoint - extracts slide content

result = md.convert("presentation.pptx")

Images with Descriptions

✅ SECURE: Using environment variables for API keys

import os from openai import OpenAI

api_key = os.getenv("OPENAI_API_KEY") if not api_key: raise RuntimeError("OPENAI_API_KEY not set")

client = OpenAI(api_key=api_key) md = MarkItDown(llm_client=client, llm_model="gpt-4o") result = md.convert("diagram.jpg") # Gets AI-generated description

Batch Processing

from pathlib import Path

md = MarkItDown() documents = Path(".").glob("*.pdf")

for doc in documents: result = md.convert(str(doc)) output_path = doc.with_suffix(".md") output_path.write_text(result.text_content)

Installation

Full installation (all features)

pip install 'markitdown[all]'

Selective features

pip install 'markitdown[pdf, docx, pptx]'

Requirements: Python 3.10 or higher

Key Features

  • Structure Preservation: Maintains headings, lists, tables, links

  • Plugin System: Extend with custom converters

  • Docker Support: Containerized deployments

  • MCP Integration: Model Context Protocol server for LLM apps

When to Read Supporting Files

reference.md - Read when you need:

  • Complete API reference and all configuration options

  • Azure Document Intelligence integration details

  • Plugin development guide

  • Docker and MCP server setup

  • Troubleshooting and error handling

examples.md - Read when you need:

  • Working examples for specific file types

  • Batch processing workflows

  • Error handling patterns

  • Integration with existing pipelines

patterns.md - Read when you need:

  • Production deployment patterns

  • Performance optimization strategies

  • Security considerations

  • Anti-patterns to avoid

Quick Reference

File Type Use Case Command

PDF Reports, papers md.convert("file.pdf")

Word Documents md.convert("file.docx")

Excel Data tables md.convert("file.xlsx")

PowerPoint Presentations md.convert("file.pptx")

Images Diagrams with OCR md = MarkItDown(llm_client=client); md.convert("img.jpg")

HTML Web pages md.convert("page.html")

ZIP Archives md.convert("archive.zip")

  • processes contents

⚠️ Common Mistakes to Avoid

Anti-Pattern 1: Hardcoded API Keys

❌ NEVER DO THIS

md = MarkItDown(llm_client=OpenAI(api_key="sk-hardcoded-key"))

✅ ALWAYS DO THIS

api_key = os.getenv("OPENAI_API_KEY") md = MarkItDown(llm_client=OpenAI(api_key=api_key))

Anti-Pattern 2: Unvalidated File Paths

❌ Vulnerable to path traversal

user_input = "../../../etc/passwd" md.convert(user_input)

✅ Validate and sanitize

from pathlib import Path safe_path = Path(user_input).resolve() if not safe_path.is_relative_to(allowed_dir): raise ValueError("Invalid path") md.convert(str(safe_path))

Anti-Pattern 3: Ignoring File Size Limits

❌ Can cause DoS

md.convert("huge_file.pdf") # No size check

✅ Check size first

max_size = 50 * 1024 * 1024 # 50MB if Path("file.pdf").stat().st_size > max_size: raise ValueError("File too large")

Common Issues

Import Error: Ensure Python >= 3.10 and markitdown installed Missing Dependencies: Install with pip install 'markitdown[all]'

Image Descriptions Not Working: Requires LLM client (OpenAI or compatible)

For detailed troubleshooting, see reference.md.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

pptx

No summary provided by upstream source.

Repository SourceNeeds Review
General

lawyer-analyst

No summary provided by upstream source.

Repository SourceNeeds Review
General

economist-analyst

No summary provided by upstream source.

Repository SourceNeeds Review