Gemini Document Processing

Process and analyze PDF documents using Google Gemini's native vision capabilities. Extract structured information, summarize content, answer questions, and understand complex documents with text, images, diagrams, charts, and tables.

Core Capabilities

PDF Vision Processing: Native understanding of PDFs up to 1,000 pages (258 tokens/page)
Multimodal Analysis: Process text, images, diagrams, charts, and tables
Structured Extraction: Output to JSON with schema validation
Document Q&A: Answer questions based on document content
Summarization: Generate summaries preserving context
Format Conversion: Transcribe to HTML while preserving layout

When to Use This Skill

Use this skill when you need to:

Extract structured data from PDF documents (invoices, resumes, forms)
Summarize long documents or reports
Answer questions about PDF content
Analyze documents with complex layouts, charts, or diagrams
Convert PDFs to structured formats (JSON, HTML)
Process multiple documents in batch
Build document processing pipelines

Quick Setup

API Key Configuration

The skill supports both Google AI Studio and Vertex AI endpoints.

Option 1: Google AI Studio (Default)

The skill checks for GEMINI_API_KEY in this priority order:

Process environment variable
Project root .env
.claude/.env
.claude/skills/.env
.env file in skill directory (.claude/skills/gemini-document-processing/.env )

Get your API key: https://aistudio.google.com/apikey

Environment Variable (Recommended)

export GEMINI_API_KEY="your-api-key-here"

Or in .env file:

echo "GEMINI_API_KEY=your-api-key-here" > .env

Option 2: Vertex AI

To use Vertex AI instead:

Enable Vertex AI

export GEMINI_USE_VERTEX=true export VERTEX_PROJECT_ID=your-gcp-project-id export VERTEX_LOCATION=us-central1 # Optional, defaults to us-central1

Or in .env file:

GEMINI_USE_VERTEX=true VERTEX_PROJECT_ID=your-gcp-project-id VERTEX_LOCATION=us-central1

Install Dependencies

pip install google-genai python-dotenv

Common Use Cases

Extract Structured Data from PDF

Use the provided script

python .claude/skills/gemini-document-processing/scripts/process-document.py
--file invoice.pdf
--prompt "Extract invoice details as JSON"
--format json

Summarize Long Document

Process and summarize

python .claude/skills/gemini-document-processing/scripts/process-document.py
--file report.pdf
--prompt "Provide a concise executive summary"

Answer Questions About Document

Q&A on document content

python .claude/skills/gemini-document-processing/scripts/process-document.py
--file contract.pdf
--prompt "What are the key terms and conditions?"

Process with Python SDK

from google import genai

client = genai.Client()

Read PDF

with open('document.pdf', 'rb') as f: pdf_data = f.read()

Process document

response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Extract key information from this document', genai.types.Part.from_bytes( data=pdf_data, mime_type='application/pdf' ) ] )

print(response.text)

Structured Output with JSON Schema

from google import genai from pydantic import BaseModel

class InvoiceData(BaseModel): invoice_number: str date: str total: float vendor: str

client = genai.Client()

response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Extract invoice details', genai.types.Part.from_bytes( data=open('invoice.pdf', 'rb').read(), mime_type='application/pdf' ) ], config=genai.types.GenerateContentConfig( response_mime_type='application/json', response_schema=InvoiceData ) )

invoice_data = InvoiceData.model_validate_json(response.text)

Key Constraints

Format: Only PDFs get vision processing (TXT, HTML, Markdown are text-only)
Size: < 20MB use inline encoding, > 20MB use File API
Pages: Max 1,000 pages per document
Storage: File API stores for 48 hours only
Cost: 258 tokens per page (fixed, regardless of content density)

Performance Tips

Use Inline Encoding for PDFs < 20MB (simpler, single request)
Use File API for larger files or repeated queries (enables context caching)
Place Prompt After PDF for single-page documents
Use Context Caching when querying same PDF multiple times
Process in Parallel for multiple independent documents
Use gemini-2.5-flash for best price/performance ratio

Decision Guide

PDF < 20MB? ├─ Yes → Use inline base64 encoding └─ No → Use File API

Need structured JSON output? ├─ Yes → Define response_schema with Pydantic └─ No → Get text response

Multiple queries on same PDF? ├─ Yes → Use File API + Context Caching └─ No → Inline encoding is sufficient

Script Reference

The skill includes a ready-to-use processing script:

Basic usage

python scripts/process-document.py --file document.pdf --prompt "Your prompt"

With JSON output

python scripts/process-document.py --file document.pdf --prompt "Extract data" --format json

With File API (for large files)

python scripts/process-document.py --file large-document.pdf --prompt "Summarize" --use-file-api

Multiple prompts

python scripts/process-document.py --file document.pdf --prompt "Question 1" --prompt "Question 2"

References

For comprehensive documentation, see:

references/gemini-document-processing-report.md
Complete API reference
references/quick-reference.md
Quick lookup guide
references/code-examples.md
Additional code patterns

Troubleshooting

API Key Not Found:

Check API key is set

./scripts/check-api-key.sh

File Too Large:

Use File API for files > 20MB
Add --use-file-api flag to the script

Vision Not Working:

Ensure file is PDF format
Other formats (TXT, HTML) don't support vision processing

Support

API Documentation: https://ai.google.dev/gemini-api/docs/document-processing
Get API Key: https://aistudio.google.com/apikey
Model Info: https://ai.google.dev/gemini-api/docs/models/gemini

gemini-document-processing

Safety Notice

Copy this and send it to your AI assistant to learn

Enable Vertex AI

Use the provided script

Process and summarize

Q&A on document content

Read PDF

Process document

Basic usage

With JSON output

With File API (for large files)

Multiple prompts

Check API key is set

Source Transparency

Related Skills

imagemagick

gemini-video-understanding

remix-icon

obsidian-qa-saver