SARIF Parsing Best Practices
Parse, analyze, and process SARIF files from static analysis tools like CodeQL, Semgrep, and others.
When to Use
-
Reading or interpreting static analysis scan results in SARIF format
-
Aggregating findings from multiple security tools
-
Deduplicating or filtering security alerts
-
Extracting specific vulnerabilities from SARIF files
-
Integrating SARIF data into CI/CD pipelines
-
Converting SARIF output to other formats
SARIF Structure Overview
SARIF 2.1.0 is the current OASIS standard:
sarifLog ├── version: "2.1.0" └── runs[] (array of analysis runs) ├── tool │ ├── driver │ │ ├── name (required) │ │ ├── version │ │ └── rules[] (rule definitions) │ └── extensions[] (plugins) ├── results[] (findings) │ ├── ruleId │ ├── level (error/warning/note) │ ├── message.text │ ├── locations[] │ │ └── physicalLocation │ │ ├── artifactLocation.uri │ │ └── region (startLine, startColumn, etc.) │ ├── fingerprints{} │ └── partialFingerprints{} └── artifacts[] (scanned files metadata)
Why Fingerprinting Matters
Without stable fingerprints, you can't track findings across runs:
-
Baseline comparison: "Is this a new finding or did we see it before?"
-
Regression detection: "Did this PR introduce new vulnerabilities?"
-
Suppression: "Ignore this known false positive in future runs"
Tool Selection Guide
Use Case Tool Installation
Quick CLI queries jq brew install jq / apt install jq
Python scripting (simple) pysarif pip install pysarif
Python scripting (advanced) sarif-tools pip install sarif-tools
.NET applications SARIF SDK NuGet package
JavaScript/Node.js sarif-js npm package
Quick Analysis with jq
Pretty print the file
jq '.' results.sarif
Count total findings
jq '[.runs[].results[]] | length' results.sarif
List all rule IDs triggered
jq '[.runs[].results[].ruleId] | unique' results.sarif
Extract errors only
jq '.runs[].results[] | select(.level == "error")' results.sarif
Get findings with file locations
jq '.runs[].results[] | { rule: .ruleId, message: .message.text, file: .locations[0].physicalLocation.artifactLocation.uri, line: .locations[0].physicalLocation.region.startLine }' results.sarif
Filter by severity and get count per rule
jq '[.runs[].results[] | select(.level == "error")] | group_by(.ruleId) | map({rule: .[0].ruleId, count: length})' results.sarif
Python with sarif-tools
from sarif import loader
Load single file
sarif_data = loader.load_sarif_file("results.sarif")
Or load multiple files
sarif_set = loader.load_sarif_files(["tool1.sarif", "tool2.sarif"])
Get summary report
report = sarif_data.get_report()
Get histogram by severity
errors = report.get_issue_type_histogram_for_severity("error") warnings = report.get_issue_type_histogram_for_severity("warning")
Filter results
high_severity = [r for r in sarif_data.get_results() if r.get("level") == "error"]
sarif-tools CLI commands:
Summary of findings
sarif summary results.sarif
List all results with details
sarif ls results.sarif
Get results by severity
sarif ls --level error results.sarif
Diff two SARIF files (find new/fixed issues)
sarif diff baseline.sarif current.sarif
Convert to other formats
sarif csv results.sarif > results.csv sarif html results.sarif > report.html
Aggregating Multiple SARIF Files
import json from pathlib import Path
def aggregate_sarif_files(sarif_paths: list[str]) -> dict: """Combine multiple SARIF files into one.""" aggregated = { "version": "2.1.0", "$schema": "https://json.schemastore.org/sarif-2.1.0.json", "runs": [] }
for path in sarif_paths:
with open(path) as f:
sarif = json.load(f)
aggregated["runs"].extend(sarif.get("runs", []))
return aggregated
def deduplicate_results(sarif: dict) -> dict: """Remove duplicate findings based on fingerprints.""" seen_fingerprints = set()
for run in sarif["runs"]:
unique_results = []
for result in run.get("results", []):
# Use partialFingerprints or create key from location
fp = None
if result.get("partialFingerprints"):
fp = tuple(sorted(result["partialFingerprints"].items()))
elif result.get("fingerprints"):
fp = tuple(sorted(result["fingerprints"].items()))
else:
# Fallback: create fingerprint from rule + location
loc = result.get("locations", [{}])[0]
phys = loc.get("physicalLocation", {})
fp = (
result.get("ruleId"),
phys.get("artifactLocation", {}).get("uri"),
phys.get("region", {}).get("startLine")
)
if fp not in seen_fingerprints:
seen_fingerprints.add(fp)
unique_results.append(result)
run["results"] = unique_results
return sarif
CI/CD Integration
GitHub Actions
-
name: Upload SARIF uses: github/codeql-action/upload-sarif@v3 with: sarif_file: results.sarif
-
name: Check for high severity run: | HIGH_COUNT=$(jq '[.runs[].results[] | select(.level == "error")] | length' results.sarif) if [ "$HIGH_COUNT" -gt 0 ]; then echo "Found $HIGH_COUNT high severity issues" exit 1 fi
Fail on New Issues
from sarif import loader
def check_for_regressions(baseline: str, current: str) -> int: """Return count of new issues not in baseline.""" baseline_data = loader.load_sarif_file(baseline) current_data = loader.load_sarif_file(current)
baseline_fps = {get_fingerprint(r) for r in baseline_data.get_results()}
new_issues = [r for r in current_data.get_results()
if get_fingerprint(r) not in baseline_fps]
return len(new_issues)
Common Pitfalls and Solutions
Path Normalization Issues
from urllib.parse import unquote from pathlib import Path
def normalize_path(uri: str, base_path: str = "") -> str: """Normalize SARIF artifact URI to consistent path.""" # Remove file:// prefix if present if uri.startswith("file://"): uri = uri[7:]
# URL decode
uri = unquote(uri)
# Handle relative paths
if not Path(uri).is_absolute() and base_path:
uri = str(Path(base_path) / uri)
return str(Path(uri))
Safe Data Access
def safe_get_location(result: dict) -> tuple[str, int]: """Safely extract file and line from result.""" try: loc = result.get("locations", [{}])[0] phys = loc.get("physicalLocation", {}) file_path = phys.get("artifactLocation", {}).get("uri", "unknown") line = phys.get("region", {}).get("startLine", 0) return file_path, line except (IndexError, KeyError, TypeError): return "unknown", 0
Key Principles
-
Validate first: Check SARIF structure before processing
-
Handle optionals: Many fields are optional; use defensive access
-
Normalize paths: Tools report paths differently; normalize early
-
Fingerprint wisely: Combine multiple strategies for stable deduplication
-
Stream large files: Use ijson or similar for 100MB+ files
Resources
-
OASIS SARIF 2.1.0 Specification
-
GitHub SARIF Support
-
SARIF Validator
Attribution
Based on trailofbits/skills sarif-parsing skill.