to-markdown

Convert any file or URL to clean Markdown. Handles PDF, Word (DOCX), Excel (XLSX), PowerPoint (PPTX), HTML, images (EXIF + OCR), audio (transcription), CSV, JSON, XML, YouTube URLs, EPubs, and more. Output is optimised for LLM pipelines, knowledge bases, and document ingestion workflows. Use this skill whenever the user wants to: convert a file to markdown, extract text from a document, scrape a URL to markdown, turn a PDF into readable text, "get the content of this file", ingest documents for RAG, prepare files for an LLM, or says "convert to md / markdown". Trigger on: "convert to markdown", "extract text from PDF", "turn this into markdown", "scrape this URL", "file to md", "document ingestion", "read this PDF", "get contents of", "parse this document", "ingest for RAG".

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "to-markdown" with this command: npx skills add mathews-tom/praxis-skills/mathews-tom-praxis-skills-to-markdown

To Markdown

Convert any file or URL to clean Markdown using MarkItDown as the conversion engine, with a lightweight fetch layer for URLs.

Reference Files

FilePurpose
references/formats.mdPer-format handling notes, internal engines, known gaps
references/fetch.mdURL fetch layer: trafilatura + Playwright strategies
references/install.mdDependency install guide for all variants

Decision Tree

Determine the input type before touching any tool:

Input type?
  Local file path        -> markitdown directly
  URL
    YouTube URL          -> markitdown directly (transcript extraction built-in)
    Static page          -> trafilatura fetch -> markitdown on HTML result
    JS-rendered / auth   -> Playwright fetch -> markitdown on result
  Pasted HTML string     -> markitdown directly on string

Do not use web_fetch or WebFetch for URLs — route through the fetch layer described in references/fetch.md to preserve the conversion pipeline.

Core Conversion Workflow

Step 1: Ensure dependencies

uv pip show markitdown || uv pip install 'markitdown[all]' trafilatura

See references/install.md for selective installs and full dependency table.

Step 2: Convert

from markitdown import MarkItDown

md = MarkItDown(enable_plugins=False)
result = md.convert("path/to/file.pdf")
print(result.text_content)

Step 3: Workflow

  1. Detect input type (file path, URL, raw HTML).
  2. If URL, run fetch layer first (see references/fetch.md).
  3. Run markitdown conversion on the local file or fetched content.
  4. Post-process if needed (strip boilerplate, trim to main content).
  5. Write output or return inline per output conventions below.

Output Conventions

ContextOutput behaviour
Single file, user wants fileWrite <input_stem>.md to same directory
Single file, inline requestReturn Markdown in conversation
Batch (multiple files)Write each to <stem>.md, summarise what was produced
URLWrite <slug>.md to current directory or return inline
Piped into another workflowReturn result.text_content string only

Default: "convert this file" -> write a file. "Read this" or "what does this say" -> return inline.

Output Example

Source (two-column PDF with a table):

Annual Report 2024                    Financial Highlights
Revenue grew 12% year-over-year...    | Metric   | 2023  | 2024  |
                                      | Revenue  | $4.2B | $4.7B |
                                      | EBITDA   | $1.1B | $1.3B |

Converted Markdown:

# Annual Report 2024

Revenue grew 12% year-over-year...

## Financial Highlights

| Metric  | 2023  | 2024  |
| ------- | ----- | ----- |
| Revenue | $4.2B | $4.7B |
| EBITDA  | $1.1B | $1.3B |

Multi-column layouts merge into linear flow. Tables are preserved as Markdown tables. Headings are inferred from font size/weight.

LLM Image Description (opt-in)

Markitdown supports an llm_client for image description in PPTX and image files. Never enable by default — it incurs cost, latency, and unexpected API calls. Prompt the user first: "This file contains images. Do you want me to use Claude to describe them? This will make additional API calls."

import anthropic
from markitdown import MarkItDown

client = anthropic.Anthropic()
md = MarkItDown(llm_client=client, llm_model="claude-sonnet-4-20250514")
result = md.convert("presentation.pptx")

Error Handling

SeverityConditionAction
TerminalUnsupported format (no converter exists)Report to user immediately; do not retry
TerminalPassword-protected Office fileReport to user; no programmatic workaround
TerminalFile not found / path invalidReport exact path; ask user to verify
RecoverEmpty output from PDFLikely scanned — escalate to OCR path in references/formats.md
RecoverMissing optional dependency (e.g. playwright)Install the dependency, then retry the conversion
RecoverURL fetch returns paywall pageReport fetch limitation; do not retry or attempt bypass
Recovertrafilatura returns emptyEscalate to Playwright fetch strategy per references/fetch.md
result = md.convert(path)
if not result.text_content.strip():
    raise ValueError(f"No text extracted from {path}. See references/formats.md for OCR options.")

Never silently return empty Markdown. Surface the failure with the severity and a pointer to the relevant reference file.

Known Gaps and Escalation

  • HTML fidelity: markitdown uses html2text internally — complex layouts lose structure. For high-fidelity HTML conversion where DOM structure matters, suggest Turndown via Node subprocess.
  • Hard paywalls: The fetch layer returns the regwall page, not the content. This is a fetch limitation, not a conversion problem.
  • Scanned PDFs (image-only, no text layer): markitdown returns near-empty output. Escalate to OCR workflow (Azure Document Intelligence or Tesseract). See references/formats.md.
  • Protected Office files: Password-protected DOCX/XLSX will fail. Inform the user.

Calibration Rules

  1. Converted output must contain at least 10 words per page of source document. Below this threshold, treat as empty extraction and escalate per the error handling table.
  2. Tables in the source must appear as Markdown tables in the output — if a table is present in the original but missing in the conversion, flag it to the user.
  3. Heading hierarchy from the source document must be preserved (H1 > H2 > H3). Flat output with no headings from a structured document indicates a conversion quality issue.
  4. For URL conversions, output must not contain navigation elements, cookie banners, or footer boilerplate. If present, re-run through trafilatura with include_tables=True to strip boilerplate.
  5. Multi-sheet XLSX must produce one clearly labeled section per sheet. Missing sheets indicate a partial conversion — report which sheets were extracted.

Limitations

  • No paywall bypass. Document it, don't attempt it.
  • No Turndown integration built-in. Different runtime (Node.js).
  • No scheduled/batch crawling. One conversion per invocation.
  • No output format other than Markdown.
  • Auto-generated YouTube captions may contain errors for technical terms.
  • Scanned PDFs require external OCR — markitdown alone returns empty output.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

youtube-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
General

manuscript-review

No summary provided by upstream source.

Repository SourceNeeds Review
General

html-presentation

No summary provided by upstream source.

Repository SourceNeeds Review
General

concept-to-image

No summary provided by upstream source.

Repository SourceNeeds Review