Upstage Document Parse
Convert documents into structured HTML/Markdown. Recognizes layout elements such as tables, images, equations, and charts with bounding box coordinates.
Quick Start
import os
import requests
with open("report.pdf", "rb") as f:
response = requests.post(
"https://api.upstage.ai/v1/document-digitization",
headers={"Authorization": f"Bearer {os.environ['UPSTAGE_API_KEY']}"},
files={"document": f},
data={"model": "document-parse", "output_formats": "['markdown']"}
)
print(response.json()["content"]["markdown"])
API Key: Always use os.environ["UPSTAGE_API_KEY"]. Get your key at console.upstage.ai.
Supported Formats
JPEG, PNG, BMP, PDF (up to 1000 pages with async), TIFF, HEIC, DOCX, PPTX, XLSX, HWP, HWPX
Sync vs Async
| Mode | Endpoint | Max pages | Max file size | Notes |
|---|---|---|---|---|
| Sync | /v1/document-digitization | 100 | 50 MB | Result returned in response (5 min server timeout). Best for ≤ 100 pages and quick turnaround. |
| Async | /v1/document-digitization/async | 1000 | 50 MB | Returns request_id; processed in 10-page batches. Use when document exceeds sync limits or sync would time out. |
Decision rule:
- ≤ 100 pages and expected to finish within 5 min → sync.
- 100 pages, scanned/complex content, or batch jobs → async.
For async submit/poll workflow, see references/async-workflow.md.
Key Parameters (Sync)
| Parameter | Default | Common Values |
|---|---|---|
model | required | document-parse |
output_formats | ['html'] | ['markdown'], ['html', 'markdown'] |
mode | standard | enhanced (complex tables), auto |
ocr | auto | force (always OCR scanned PDFs) |
coordinates | true | false to omit bounding boxes |
For full parameter reference and curl variations (enhanced mode, force OCR, base64 table images, LangChain integration), see references/sync-options.md.
Response Structure
{
"api": "2.0",
"model": "document-parse-251217",
"content": {
"html": "<h1>...</h1>",
"markdown": "# ...",
"text": "..."
},
"elements": [
{
"id": 0,
"category": "heading1",
"content": { "html": "...", "markdown": "...", "text": "..." },
"page": 1,
"coordinates": [{"x": 0.06, "y": 0.05}, ...]
}
],
"usage": { "pages": 1 }
}
Element Categories
paragraph, heading1, heading2, heading3, list, table, figure, chart, equation, caption, header, footer, index, footnote
Output Files
- Default: write to
<system-temp>/<input-stem>.parsed.<ext>where<ext>matchesoutput_formats(mdorhtml). Example:/tmp/report.parsed.md. Usetempfile.gettempdir()for cross-platform code. - Override: if the user specifies an output path, use it.
- Always print the resolved absolute path in your response so the user can locate the file.
Tips
- Use
mode=enhancedfor complex tables, charts, images - Use
mode=autoto let API decide per page - Use async API for documents > 100 pages, > 50 MB, or when sync would exceed the 5-min timeout (async caps at 1000 pages)
- Use
ocr=forcefor scanned PDFs or images merge_multipage_tables=truecombines split tables (max 20 pages with enhanced mode)- Standard documents process in ~3 seconds; sync API timeout is 5 minutes
Detailed References
| File | Content |
|---|---|
references/sync-options.md | Full sync parameter reference, mode selection, curl variations, LangChain |
references/async-workflow.md | Async submit/poll/status, Python polling pattern, retention rules |