Human Extractor Skill
Description
GPU-accelerated pipeline for detecting, tracking, and classifying humans in dashcam footage. Processes MP4 videos to extract human crops with optional CLIP-based head covering classification, saving all outputs to a unified directory with comprehensive indexing.
Purpose
Extract visual evidence of human presence from dashcam recordings for investigative analysis. Optimized for high throughput using NVDEC decoding, batched YOLOv8 detection, ByteTrack multi-object tracking, and optional CLIP classification.
Usage
Basic Invocation
Extract humans from Park_R videos on October 6, 2025
Advanced Invocation
Scan Park_R\20251006 and 20251007, keep only frames with people, save all outputs in one folder, add one full-frame per timestamp with boxes, use my GPU at max, filter for head-covered individuals at 80% confidence
Input Parameters
Required
-
roots (list[str]): One or more source directories containing MP4 files
-
Example: ["G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R\20251006"]
Core Detection
-
confidence (float, default: 0.35): YOLOv8 detection confidence threshold (0.0-1.0)
-
iou (float, default: 0.50): IoU threshold for NMS (0.0-1.0)
-
yolo_batch (int, default: 64): YOLOv8 batch size (32-128 depending on VRAM)
CLIP Filtering (Optional)
-
clip_filter.enabled (bool, default: false): Enable head covering classification
-
clip_filter.threshold (float, default: 0.80): CLIP confidence threshold
-
clip_filter.batch (int, default: 384): CLIP batch size (256-512)
Hardware Acceleration
-
nvdec (bool, default: true): Use NVIDIA hardware video decoding
-
gpu_id (int, default: 0): CUDA device ID
Output Control
-
single_output_dir (str, default: "parsed\ALL_CROPS"): Unified output directory
-
save_full_frame (bool, default: false): Save one annotated full-frame per timestamp
-
full_frame_maxw (int, default: 1280): Max width for full-frame saves
-
draw_boxes (bool, default: true): Annotate boxes on full-frames
Filename Convention
- filename_version (str, default: "v1"): Version tag for output filenames
Deduplication
-
dedup.enabled (bool, default: true): Enable similarity deduplication
-
dedup.ssim (float, default: 0.92): SSIM threshold (0.0-1.0)
-
dedup.rate_cap_per_track_per_min (int, default: 12): Max crops per track per minute
Parallel Processing
-
parallel.dates (list[str], optional): Process multiple dates concurrently
-
parallel.max_workers (int, default: 3): Max parallel date workers
Output Format
Success Response
{ "status": "ok", "summary": { "videos_processed": 142, "crops_saved": 4414, "frames_saved": 728, "gpu_util_avg": 0.85, "processing_time_sec": 2847, "errors": 0 }, "artifacts": { "index_csv": "G:\My Drive\PROJECTS\APPS\Human_Detection\parsed\ALL_CROPS\INDEX.csv", "output_dir": "G:\My Drive\PROJECTS\APPS\Human_Detection\parsed\ALL_CROPS", "log_file": "G:\My Drive\PROJECTS\APPS\Human_Detection\parsed\ALL_CROPS\run_20251006_143022.log" }, "performance": { "nvdec_active": true, "yolo_batch": 64, "clip_batch": 384, "avg_fps": 48.3, "vram_peak_gb": 9.2 }, "notes": [ "NVDEC hardware decoding active", "Batched YOLO=64, CLIP=384", "GPU utilization: 85%" ] }
Error Response
{ "status": "error", "error": "CUDA out of memory", "suggestion": "Reduce batch sizes: yolo_batch=48, clip_batch=256", "partial_results": { "videos_processed": 67, "crops_saved": 2103 } }
Output Structure
Directory Layout
parsed\ALL_CROPS
├── INDEX.csv # Global master index
├── INDEX.20251006_pid1234.csv # Shard (pre-merge)
├── run_20251006_143022.log # Execution log
│
Crop files (per person detection)
├── 20251006__20251006142644_070785B__t15234__f365__trk017__x1014y46w266h659__c85__v1.webp ├── 20251006__20251006143844_070787B__t8420__f202__trk003__x234y567w180h420__c92__v1.webp │
Full-frame files (optional, one per timestamp)
├── 20251006__20251006142644_070785B__t15234__FRAME__v1.webp └── 20251006__20251006143844_070787B__t8420__FRAME__v1.webp
Filename Convention
Crop Format:
<date>__<video_stem>__t<ts_ms>__f<frame_idx>__trk<track_id>__x<x1>y<y1>w<w>h<h>__c<covered_0to100>__v<ver>.webp
Example: 20251006__20251006142644_070785B__t15234__f365__trk017__x1014y46w266h659__c85__v1.webp
Decoded:
- Date: 2025-10-06
- Video: 20251006142644_070785B.MP4
- Timestamp: 15234 ms
- Frame: 365
- Track: 17
- BBox: x=1014, y=46, w=266, h=659
- CLIP confidence: 85% (head covering)
- Version: v1
Full-Frame Format:
<date>__<video_stem>__t<ts_ms>__FRAME__v<ver>.webp
Example: 20251006__20251006142644_070785B__t15234__FRAME__v1.webp
INDEX.csv Schema
dataset,date,video_rel,video_stem,frame_idx,ts_ms,track_id,x1,y1,w,h,person_conf,covered_conf,file_type,crop_file,sha1,bboxes_json,annotated,pipeline_ver,yolo_batch,clip_batch,nvdec,created_utc
Example rows:
Park_R,20251006,20251006\20251006142644_070785B.MP4,20251006142644_070785B,365,15234,17,1014,46,266,659,0.92,0.85,crop,20251006__20251006142644_070785B__t15234__f365__trk017__x1014y46w266h659__c85__v1.webp,a3f2c8b9...,,,v1,64,384,1,2025-10-06T14:30:22Z Park_R,20251006,20251006\20251006142644_070785B.MP4,20251006142644_070785B,365,15234,,,,,,,frame,20251006__20251006142644_070785B__t15234__FRAME__v1.webp,d4e1a2c7...,"[{""x1"":1014,""y1"":46,""w"":266,""h"":659,""conf"":0.92,""track"":17}]",1,v1,64,384,1,2025-10-06T14:30:22Z
Column Definitions:
-
dataset: Source camera (Park_R, Park_F, Movie_F, Movie_R)
-
date: YYYYMMDD
-
video_rel: Relative path from dataset root
-
video_stem: Filename without .MP4 extension
-
frame_idx: Frame number in video
-
ts_ms: Timestamp in milliseconds
-
track_id: ByteTrack ID (empty for FRAME rows)
-
x1,y1,w,h: Bounding box (empty for FRAME rows)
-
person_conf: YOLOv8 detection confidence
-
covered_conf: CLIP head covering confidence (0-100 scale, empty if disabled)
-
file_type: "crop" or "frame"
-
crop_file: Relative filename
-
sha1: File hash for integrity
-
bboxes_json: All detections in frame (FRAME rows only)
-
annotated: 1 if boxes drawn on frame, 0 otherwise
-
pipeline_ver: Semantic version tag
-
yolo_batch: YOLO batch size used
-
clip_batch: CLIP batch size used (0 if disabled)
-
nvdec: 1 if NVDEC used, 0 otherwise
-
created_utc: ISO 8601 timestamp
Implementation Details
Processing Pipeline
[MP4 Videos] │ ▼ [NVDEC Decoder (GPU)] RGB tensor → CUDA Stream A │ ▼ [YOLOv8s Detection] Batched (64 frames) FP16, conf=0.35 │ ▼ [ByteTrack Tracking] IoU=0.5, max_age=10 │ ├──────────────────────► [Full-Frame Saver] │ (optional, downscaled, annotated) ▼ [ROI Align (GPU)] Extract crops on GPU │ ▼ [CLIP Classification] ◄────── (optional) Batched (384 crops) FP16, threshold=0.80 │ ▼ [Deduplication Filter] SSIM ≥ 0.92 Rate cap: 12/min/track │ ▼ [Async I/O Thread Pool] WebP encode (q=85) Shard INDEX writes │ ▼ [Final Merge] INDEX.csv
GPU Optimization Strategy
Dual CUDA Streams:
-
Stream A: YOLOv8 detection
-
Stream B: CLIP classification
-
Overlap compute + memory transfers
Dynamic Batching:
-
Accumulate frames until batch size reached
-
Process immediately on timeout (100ms)
-
Keep GPU pipeline full (80-90% utilization)
Memory Management:
-
Pinned memory for faster CPU↔GPU transfers
-
Pre-allocated tensor buffers
-
Stream-ordered operations
Decoder Priority:
-
NVDEC (GPU hardware decoder) - 5-10x faster
-
CPU fallback (OpenCV) if NVDEC unavailable
-
Multi-threaded DataLoader (8-12 workers)
Performance Targets (RTX 4080 16GB)
Metric Target Notes
GPU Utilization 80-90% NVDEC + dual streams + large batches
Throughput 3-4 videos/min Parking videos (2 FPS sampling)
VRAM Usage 6-10 GB YOLO=64, CLIP=384
Latency <30s per video Including decode, detect, track, classify
Configuration Tuning
If GPU util < 70%:
-
Increase batch sizes: yolo_batch=80 , clip_batch=448
-
Verify NVDEC active (check nvdec_active in response)
-
Increase parallel workers: max_workers=4
If CUDA OOM:
-
Reduce CLIP batch first: clip_batch=256
-
Then reduce YOLO batch: yolo_batch=48
-
Disable full-frame saves: save_full_frame=false
If disk I/O bottleneck:
-
Disable full-frame: save_full_frame=false
-
Reduce quality: full_frame_maxw=960 , WebP q=75
-
Use faster storage (NVMe SSD)
CLI Equivalent
Basic usage
python -m src.cli.run_multi_dates
--root "G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R"
--out parsed\ALL_CROPS
--dates 20251006 20251007
--use-nvdec --conf 0.35 --iou 0.5
Advanced usage with CLIP filtering
python -m src.cli.run_multi_dates
--root "G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R"
--out parsed\ALL_CROPS
--dates 20251006 20251007 20251008
--use-nvdec
--yolo-batch 64
--clip-batch 384
--clip-threshold 0.80
--conf 0.35
--iou 0.5
--save-full-frame
--draw-boxes
--parallel 3
Example Interactions
Example 1: Basic Detection
User: "Extract all humans from Park_R videos on October 6"
Skill invokes:
{ "mode": "extract_humans", "roots": ["G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R\20251006"], "confidence": 0.35, "single_output_dir": "parsed\ALL_CROPS", "nvdec": true }
Example 2: Advanced with CLIP
User: "Scan Park_R for October 6-8, filter for people with head coverings at 80% confidence, save annotated frames, max GPU usage"
Skill invokes:
{ "mode": "extract_humans", "roots": [ "G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R\20251006", "G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R\20251007", "G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R\20251008" ], "confidence": 0.35, "iou": 0.50, "yolo_batch": 64, "clip_filter": { "enabled": true, "threshold": 0.80, "batch": 384 }, "nvdec": true, "save_full_frame": true, "draw_boxes": true, "single_output_dir": "parsed\ALL_CROPS", "parallel": { "max_workers": 3 } }
Example 3: Low-Resource Mode
User: "Process Park_R October 6 with minimal GPU memory"
Skill invokes:
{ "mode": "extract_humans", "roots": ["G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R\20251006"], "confidence": 0.35, "yolo_batch": 32, "clip_filter": { "enabled": false }, "nvdec": false, "save_full_frame": false, "single_output_dir": "parsed\ALL_CROPS" }
Safety & Guardrails
Do Not
-
❌ Move or delete source MP4 files
-
❌ Infer gender unless explicitly enabled (sensitive, noisy)
-
❌ Process videos without user consent
-
❌ Share outputs containing identifiable persons
Do
-
✅ Verify GPU availability before processing
-
✅ Enforce longitude sign corrections for GPS overlays
-
✅ Maintain audit trail in INDEX.csv
-
✅ Log versions, batches, NVDEC usage
-
✅ Handle OOM gracefully with suggestions
Resume Safety
-
Idempotent: skip already-processed crops by filename
-
Shard-based: partial runs can resume
-
Index integrity: SHA1 hashes verify file correctness
Testing & Verification
Pre-Run Checks
GPU availability
assert torch.cuda.is_available(), "CUDA required" assert torch.cuda.device_count() > 0, "No GPU found"
Model files
assert Path("models/yolov8s.pt").exists(), "YOLOv8 model missing"
Output directory writable
output_dir = Path("parsed/ALL_CROPS") output_dir.mkdir(parents=True, exist_ok=True) assert os.access(output_dir, os.W_OK), "Output dir not writable"
Post-Run Verification
Check outputs exist
assert Path("parsed/ALL_CROPS/INDEX.csv").exists() assert len(list(Path("parsed/ALL_CROPS").glob("*.webp"))) > 0
Validate INDEX.csv
df = pd.read_csv("parsed/ALL_CROPS/INDEX.csv") assert df['crop_file'].notna().all() assert df['person_conf'].between(0, 1).all()
Sample roundtrip
sample = df.sample(1).iloc[0] assert Path(f"parsed/ALL_CROPS/{sample['crop_file']}").exists()
GPU utilization check
assert gpu_util_avg > 0.70, f"Low GPU util: {gpu_util_avg}"
Dependencies
Required
-
Python 3.10+
-
PyTorch 2.0+ with CUDA 11.8+
-
ultralytics (YOLOv8)
-
transformers (CLIP)
-
opencv-python
-
pillow
-
pandas
-
numpy
Optional (Performance)
-
NVIDIA Video Codec SDK (NVDEC)
-
TensorRT (future optimization)
-
nvJPEG (GPU JPEG encoding)
Installation
cd "G:\My Drive\PROJECTS\APPS\Human_Detection" pip install -r requirements.txt
Troubleshooting
Common Issues
- CUDA Out of Memory
Error: CUDA out of memory. Tried to allocate 2.50 GiB Solution: Reduce batch sizes yolo_batch: 64 → 48 → 32 clip_batch: 384 → 256 → 128
- NVDEC Not Available
Warning: NVDEC unavailable, falling back to CPU decode Solution: Check NVIDIA driver version (≥525.60) GPU must support Video Codec SDK Verify with: nvidia-smi --query-gpu=name --format=csv
- Low GPU Utilization
Warning: GPU util only 45% Solutions:
-
Increase batch sizes (if VRAM allows)
-
Enable NVDEC: nvdec=true
-
Increase parallel workers: max_workers=4
-
Check CPU bottleneck (use more DataLoader workers)
-
Slow Processing
Performance: 0.8 videos/min (expected 3-4) Diagnostics:
- Check disk I/O (use SSD)
- Verify NVDEC active (5-10x faster than CPU)
- Profile with: python -m torch.utils.bottleneck script.py
Future Enhancements
Planned
-
TensorRT optimization (2-4x CLIP speedup)
-
Multi-GPU sharding (process different dates on different GPUs)
-
GPU JPEG/WebP encoding (nvJPEG)
-
Real-time streaming mode
Under Consideration
-
Face recognition integration
-
Gender classification (opt-in only, with warnings)
-
Action recognition (walking, standing, etc.)
-
Multi-camera fusion (correlate detections across cameras)
Version History
v1.0 (Current)
-
Initial release
-
YOLOv8s + ByteTrack + CLIP
-
NVDEC support
-
Unified output directory
-
Global INDEX.csv
References
-
YOLOv8 Documentation
-
ByteTrack Paper
-
CLIP Paper
-
NVIDIA Video Codec SDK
Contact & Support
For issues or questions:
-
Check parsed/ALL_CROPS/run_*.log for error details
-
Review GPU diagnostics: nvidia-smi
-
Validate input paths exist and are readable
-
Verify CUDA/PyTorch installation: python -c "import torch; print(torch.cuda.is_available())"