DVC Pipeline Auditor
Audit DVC (Data Version Control) pipelines for reproducibility, storage efficiency, and operational correctness. Reviews dvc.yaml pipeline definitions, .dvc file tracking, remote storage configuration, parameter management, metric collection, and dependency chains. Acts as a senior ML engineer auditing your data versioning and pipeline infrastructure.
Usage
Basic: Audit the DVC pipeline in /path/to/project/
Focused: Check pipeline stage dependencies | Find .dvc files out of sync | Analyze DVC storage usage | Review params.yaml structure
How It Works
Step 1: Discover DVC Project Structure
find /path/to/project -name ".dvc" -type d
cat /path/to/project/.dvc/config
find /path/to/project -name "dvc.yaml" -o -name "dvc.lock" -o -name "*.dvc"
find /path/to/project -name "params.yaml" -o -name "params.json"
Parses pipeline stages (cmd, deps, outs, params, metrics, plots), lock file hashes, .dvc tracking metadata, and remote configuration.
Step 2: Audit Pipeline Stages
Stage: prepare
PASS: Script in deps — changes trigger re-run
PASS: Params explicitly listed, output directory tracked
Stage: train
PASS: Dependencies chain from prepare outputs
PASS: Metrics with cache:false — committed to git
FAIL: No evaluation stage defined
Pipeline trains but never evaluates on holdout set
FIX: Add evaluate stage with model + test data as deps
FAIL: Stage "featurize" missing dependency
cmd uses config.yaml but it's NOT listed in deps
RISK: config changes won't trigger re-run
FIX: Add "config.yaml" to deps
FAIL: Stage "train" has undeclared output
Script writes logs/training.log but not listed in outs
FIX: Add to outs or explicitly exclude
Step 3: Validate Dependency Graph
prepare -> featurize -> train -> evaluate -> export_model
Depth: 4 | No circular deps | No orphan stages
FAIL: Implicit dependency — "train" reads data/prepared/features.csv
produced by "featurize", but deps reference data/prepared/ (from prepare)
FIX: Change dep to data/features/ (featurize output)
WARN: "export_model" has no downstream consumers — verify terminal stage
Step 4: Check Lock File Integrity
FAIL: dvc.lock is STALE
Stage "prepare" lock hash doesn't match current src/prepare.py
RISK: Pipeline results don't reflect current code
FIX: Run `dvc repro prepare`
FAIL: dvc.lock references deleted file src/old_featurize.py
FIX: Update dvc.yaml dep, run `dvc repro featurize`
WARN: dvc.lock not committed to git
RISK: Collaborators can't reproduce exact pipeline state
FIX: Ensure dvc.lock is tracked (NOT in .gitignore)
Step 5: Audit .dvc File Tracking
8 tracked files, 4.7 GB total
FAIL: data/raw/users.csv — hash mismatch
Local file modified but .dvc not updated
FIX: `dvc add data/raw/users.csv`
FAIL: models/model_v2.pkl — .dvc exists but file missing locally
Not in cache either. FIX: `dvc pull models/model_v2.pkl`
WARN: 3 large files (1.4 GB total) NOT tracked by DVC
data/embeddings/vectors.npy (890 MB)
models/backup_model.pkl (320 MB)
data/external/reference.parquet (180 MB)
RISK: Committed to git (bloating) or untracked (lost on clone)
FIX: `dvc add <file>` for each
WARN: data/raw/transactions.csv — 2.1 GB CSV
Consider Parquet for 60-80% size reduction
Step 6: Review Remote Storage
Remotes: "storage" (default, s3://ml-data-bucket/dvc-store/),
"backup" (gs://backup-bucket/dvc/)
FAIL: No authentication configured for "storage"
.dvc/config.local missing. `dvc push`/`pull` will fail.
FIX: `dvc remote modify --local storage access_key_id <KEY>`
FAIL: Remote "backup" has never been pushed to
Disaster recovery not functioning.
FIX: `dvc push -r backup`
WARN: No shared cache configured
FIX: `dvc cache dir /shared/dvc-cache` on shared machines
Storage: 4.7 GB tracked, 3.2 GB remote (32% dedup savings)
RECOMMEND: `dvc gc -c` to clean ~800 MB unused cache
Step 7: Audit Parameters and Metrics
FAIL: Duplicate parameter "learning_rate" in params.yaml AND params/train.yaml
Values differ (0.01 vs 0.001). FIX: Single source of truth
FAIL: Unused param "model.dropout" — no stage references it
FIX: Remove or wire into training script
WARN: Hardcoded values in src/train.py
batch_size=64, num_epochs=100 should be in params.yaml
WARN: Only 1 plot defined. Add confusion matrix, ROC, feature importance.
WARN: No `dvc exp` usage — switch to `dvc exp run` for experiment tracking
Step 8: Check Reproducibility
FAIL: Python version not pinned (no .python-version)
FAIL: Deps not fully pinned: "scikit-learn>=1.0", "pandas" (any version)
FIX: `pip freeze > requirements.txt` for exact versions
FAIL: Random seed only set for numpy, not Python random or torch
RISK: Non-deterministic results across runs
WARN: System dep (ffmpeg) not documented — affects output but untracked
WARN: No CI/CD running `dvc repro --dry` on PRs
Reproducibility Score: 45/100
Step 9: Final Report
# DVC Pipeline Audit Report
## Overall Health Score: 54/100
Pipeline structure: 7/10 Lock integrity: 4/10
Data tracking: 5/10 Remote storage: 5/10
Parameters: 5/10 Metrics/plots: 4/10
Reproducibility: 3/10 Stage deps: 6/10
## Critical Issues
1. Lock file stale — results don't match current code
2. Hash mismatch on users.csv — data change not versioned
3. 1.4 GB of large files not tracked by DVC
4. No remote authentication — push/pull will fail
5. Dependencies not pinned — non-reproducible environment
## High Priority
6. Missing evaluation stage
7. Duplicate parameter definitions
8. No CI/CD pipeline validation on PRs
9. Hardcoded hyperparameters
10. Partial random seeds — non-deterministic results
Output
- Pipeline graph with dependency validation and issue annotations
- Lock file integrity check with staleness detection
- Data tracking audit covering hash mismatches and untracked large files
- Remote storage review for auth, efficiency, and backup status
- Parameter audit for duplicates, hardcoded values, unused params
- Reproducibility score covering deps, seeds, CI/CD validation
- Health score 0-100 with per-category breakdown and remediation commands
Tips for Best Results
- Point the agent at your project root (where .dvc/ directory lives)
- Include source code (src/) to detect hardcoded values
- Share requirements.txt for dependency analysis
- Run after adding new pipeline stages to catch dependency issues
- Combine with mlops-experiment-tracker for full ML workflow audit