Senior Computer Vision Engineer
Production computer vision engineering skill for object detection, image segmentation, and visual AI system deployment.
Table of Contents
-
Quick Start
-
Core Expertise
-
Tech Stack
-
Workflow 1: Object Detection Pipeline
-
Workflow 2: Model Optimization and Deployment
-
Workflow 3: Custom Dataset Preparation
-
Architecture Selection Guide
-
Reference Documentation
-
Common Commands
Quick Start
Generate training configuration for YOLO or Faster R-CNN
python scripts/vision_model_trainer.py models/ --task detection --arch yolov8
Analyze model for optimization opportunities (quantization, pruning)
python scripts/inference_optimizer.py model.pt --target onnx --benchmark
Build dataset pipeline with augmentations
python scripts/dataset_pipeline_builder.py images/ --format coco --augment
Core Expertise
This skill provides guidance on:
-
Object Detection: YOLO family (v5-v11), Faster R-CNN, DETR, RT-DETR
-
Instance Segmentation: Mask R-CNN, YOLACT, SOLOv2
-
Semantic Segmentation: DeepLabV3+, SegFormer, SAM (Segment Anything)
-
Image Classification: ResNet, EfficientNet, Vision Transformers (ViT, DeiT)
-
Video Analysis: Object tracking (ByteTrack, SORT), action recognition
-
3D Vision: Depth estimation, point cloud processing, NeRF
-
Production Deployment: ONNX, TensorRT, OpenVINO, CoreML
Tech Stack
Category Technologies
Frameworks PyTorch, torchvision, timm
Detection Ultralytics (YOLO), Detectron2, MMDetection
Segmentation segment-anything, mmsegmentation
Optimization ONNX, TensorRT, OpenVINO, torch.compile
Image Processing OpenCV, Pillow, albumentations
Annotation CVAT, Label Studio, Roboflow
Experiment Tracking MLflow, Weights & Biases
Serving Triton Inference Server, TorchServe
Workflow 1: Object Detection Pipeline
Use this workflow when building an object detection system from scratch.
Step 1: Define Detection Requirements
Analyze the detection task requirements:
Detection Requirements Analysis:
- Target objects: [list specific classes to detect]
- Real-time requirement: [yes/no, target FPS]
- Accuracy priority: [speed vs accuracy trade-off]
- Deployment target: [cloud GPU, edge device, mobile]
- Dataset size: [number of images, annotations per class]
Step 2: Select Detection Architecture
Choose architecture based on requirements:
Requirement Recommended Architecture Why
Real-time (>30 FPS) YOLOv8/v11, RT-DETR Single-stage, optimized for speed
High accuracy Faster R-CNN, DINO Two-stage, better localization
Small objects YOLO + SAHI, Faster R-CNN + FPN Multi-scale detection
Edge deployment YOLOv8n, MobileNetV3-SSD Lightweight architectures
Transformer-based DETR, DINO, RT-DETR End-to-end, no NMS required
Step 3: Prepare Dataset
Convert annotations to required format:
COCO format (recommended)
python scripts/dataset_pipeline_builder.py data/images/
--annotations data/labels/
--format coco
--split 0.8 0.1 0.1
--output data/coco/
Verify dataset
python -c "from pycocotools.coco import COCO; coco = COCO('data/coco/train.json'); print(f'Images: {len(coco.imgs)}, Categories: {len(coco.cats)}')"
Step 4: Configure Training
Generate training configuration:
For Ultralytics YOLO
python scripts/vision_model_trainer.py data/coco/
--task detection
--arch yolov8m
--epochs 100
--batch 16
--imgsz 640
--output configs/
For Detectron2
python scripts/vision_model_trainer.py data/coco/
--task detection
--arch faster_rcnn_R_50_FPN
--framework detectron2
--output configs/
Step 5: Train and Validate
Ultralytics training
yolo detect train data=data.yaml model=yolov8m.pt epochs=100 imgsz=640
Detectron2 training
python train_net.py --config-file configs/faster_rcnn.yaml --num-gpus 1
Validate on test set
yolo detect val model=runs/detect/train/weights/best.pt data=data.yaml
Step 6: Evaluate Results
Key metrics to analyze:
Metric Target Description
mAP@50
0.7 Mean Average Precision at IoU 0.5
mAP@50:95
0.5 COCO primary metric
Precision
0.8 Low false positives
Recall
0.8 Low missed detections
Inference time <33ms For 30 FPS real-time
Workflow 2: Model Optimization and Deployment
Use this workflow when preparing a trained model for production deployment.
Step 1: Benchmark Baseline Performance
Measure current model performance
python scripts/inference_optimizer.py model.pt
--benchmark
--input-size 640 640
--batch-sizes 1 4 8 16
--warmup 10
--iterations 100
Expected output:
Baseline Performance (PyTorch FP32):
- Batch 1: 45.2ms (22.1 FPS)
- Batch 4: 89.4ms (44.7 FPS)
- Batch 8: 165.3ms (48.4 FPS)
- Memory: 2.1 GB
- Parameters: 25.9M
Step 2: Select Optimization Strategy
Deployment Target Optimization Path
NVIDIA GPU (cloud) PyTorch → ONNX → TensorRT FP16
NVIDIA GPU (edge) PyTorch → TensorRT INT8
Intel CPU PyTorch → ONNX → OpenVINO
Apple Silicon PyTorch → CoreML
Generic CPU PyTorch → ONNX Runtime
Mobile PyTorch → TFLite or ONNX Mobile
Step 3: Export to ONNX
Export with dynamic batch size
python scripts/inference_optimizer.py model.pt
--export onnx
--input-size 640 640
--dynamic-batch
--simplify
--output model.onnx
Verify ONNX model
python -c "import onnx; model = onnx.load('model.onnx'); onnx.checker.check_model(model); print('ONNX model valid')"
Step 4: Apply Quantization (Optional)
For INT8 quantization with calibration:
Generate calibration dataset
python scripts/inference_optimizer.py model.onnx
--quantize int8
--calibration-data data/calibration/
--calibration-samples 500
--output model_int8.onnx
Quantization impact analysis:
Precision Size Speed Accuracy Drop
FP32 100% 1x 0%
FP16 50% 1.5-2x <0.5%
INT8 25% 2-4x 1-3%
Step 5: Convert to Target Runtime
TensorRT (NVIDIA GPU)
trtexec --onnx=model.onnx --saveEngine=model.engine --fp16
OpenVINO (Intel)
mo --input_model model.onnx --output_dir openvino/
CoreML (Apple)
python -c "import coremltools as ct; model = ct.convert('model.onnx'); model.save('model.mlpackage')"
Step 6: Benchmark Optimized Model
python scripts/inference_optimizer.py model.engine
--benchmark
--runtime tensorrt
--compare model.pt
Expected speedup:
Optimization Results:
- Original (PyTorch FP32): 45.2ms
- Optimized (TensorRT FP16): 12.8ms
- Speedup: 3.5x
- Accuracy change: -0.3% mAP
Workflow 3: Custom Dataset Preparation
Use this workflow when preparing a computer vision dataset for training.
Step 1: Audit Raw Data
Analyze image dataset
python scripts/dataset_pipeline_builder.py data/raw/
--analyze
--output analysis/
Analysis report includes:
Dataset Analysis:
- Total images: 5,234
- Image sizes: 640x480 to 4096x3072 (variable)
- Formats: JPEG (4,891), PNG (343)
- Corrupted: 12 files
- Duplicates: 45 pairs
Annotation Analysis:
- Format detected: Pascal VOC XML
- Total annotations: 28,456
- Classes: 5 (car, person, bicycle, dog, cat)
- Distribution: car (12,340), person (8,234), bicycle (3,456), dog (2,890), cat (1,536)
- Empty images: 234
Step 2: Clean and Validate
Remove corrupted and duplicate images
python scripts/dataset_pipeline_builder.py data/raw/
--clean
--remove-corrupted
--remove-duplicates
--output data/cleaned/
Step 3: Convert Annotation Format
Convert VOC to COCO format
python scripts/dataset_pipeline_builder.py data/cleaned/
--annotations data/annotations/
--input-format voc
--output-format coco
--output data/coco/
Supported format conversions:
From To
Pascal VOC XML COCO JSON
YOLO TXT COCO JSON
COCO JSON YOLO TXT
LabelMe JSON COCO JSON
CVAT XML COCO JSON
Step 4: Apply Augmentations
Generate augmentation config
python scripts/dataset_pipeline_builder.py data/coco/
--augment
--aug-config configs/augmentation.yaml
--output data/augmented/
Recommended augmentations for detection:
configs/augmentation.yaml
augmentations: geometric: - horizontal_flip: { p: 0.5 } - vertical_flip: { p: 0.1 } # Only if orientation invariant - rotate: { limit: 15, p: 0.3 } - scale: { scale_limit: 0.2, p: 0.5 }
color: - brightness_contrast: { brightness_limit: 0.2, contrast_limit: 0.2, p: 0.5 } - hue_saturation: { hue_shift_limit: 20, sat_shift_limit: 30, p: 0.3 } - blur: { blur_limit: 3, p: 0.1 }
advanced: - mosaic: { p: 0.5 } # YOLO-style mosaic - mixup: { p: 0.1 } # Image mixing - cutout: { num_holes: 8, max_h_size: 32, max_w_size: 32, p: 0.3 }
Step 5: Create Train/Val/Test Splits
python scripts/dataset_pipeline_builder.py data/augmented/
--split 0.8 0.1 0.1
--stratify
--seed 42
--output data/final/
Split strategy guidelines:
Dataset Size Train Val Test
<1,000 images 70% 15% 15%
1,000-10,000 80% 10% 10%
10,000 90% 5% 5%
Step 6: Generate Dataset Configuration
For Ultralytics YOLO
python scripts/dataset_pipeline_builder.py data/final/
--generate-config yolo
--output data.yaml
For Detectron2
python scripts/dataset_pipeline_builder.py data/final/
--generate-config detectron2
--output detectron2_config.py
Architecture Selection Guide
Object Detection Architectures
Architecture Speed Accuracy Best For
YOLOv8n 1.2ms 37.3 mAP Edge, mobile, real-time
YOLOv8s 2.1ms 44.9 mAP Balanced speed/accuracy
YOLOv8m 4.2ms 50.2 mAP General purpose
YOLOv8l 6.8ms 52.9 mAP High accuracy
YOLOv8x 10.1ms 53.9 mAP Maximum accuracy
RT-DETR-L 5.3ms 53.0 mAP Transformer, no NMS
Faster R-CNN R50 46ms 40.2 mAP Two-stage, high quality
DINO-4scale 85ms 49.0 mAP SOTA transformer
Segmentation Architectures
Architecture Type Speed Best For
YOLOv8-seg Instance 4.5ms Real-time instance seg
Mask R-CNN Instance 67ms High-quality masks
SAM Promptable 50ms Zero-shot segmentation
DeepLabV3+ Semantic 25ms Scene parsing
SegFormer Semantic 15ms Efficient semantic seg
CNN vs Vision Transformer Trade-offs
Aspect CNN (YOLO, R-CNN) ViT (DETR, DINO)
Training data needed 1K-10K images 10K-100K+ images
Training time Fast Slow (needs more epochs)
Inference speed Faster Slower
Small objects Good with FPN Needs multi-scale
Global context Limited Excellent
Positional encoding Implicit Explicit
Reference Documentation
- Computer Vision Architectures
See references/computer_vision_architectures.md for:
-
CNN backbone architectures (ResNet, EfficientNet, ConvNeXt)
-
Vision Transformer variants (ViT, DeiT, Swin)
-
Detection heads (anchor-based vs anchor-free)
-
Feature Pyramid Networks (FPN, BiFPN, PANet)
-
Neck architectures for multi-scale detection
- Object Detection Optimization
See references/object_detection_optimization.md for:
-
Non-Maximum Suppression variants (NMS, Soft-NMS, DIoU-NMS)
-
Anchor optimization and anchor-free alternatives
-
Loss function design (focal loss, GIoU, CIoU, DIoU)
-
Training strategies (warmup, cosine annealing, EMA)
-
Data augmentation for detection (mosaic, mixup, copy-paste)
- Production Vision Systems
See references/production_vision_systems.md for:
-
ONNX export and optimization
-
TensorRT deployment pipeline
-
Batch inference optimization
-
Edge device deployment (Jetson, Intel NCS)
-
Model serving with Triton
-
Video processing pipelines
Common Commands
Ultralytics YOLO
Training
yolo detect train data=coco.yaml model=yolov8m.pt epochs=100 imgsz=640
Validation
yolo detect val model=best.pt data=coco.yaml
Inference
yolo detect predict model=best.pt source=images/ save=True
Export
yolo export model=best.pt format=onnx simplify=True dynamic=True
Detectron2
Training
python train_net.py --config-file configs/COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml
--num-gpus 1 OUTPUT_DIR ./output
Evaluation
python train_net.py --config-file configs/faster_rcnn.yaml --eval-only
MODEL.WEIGHTS output/model_final.pth
Inference
python demo.py --config-file configs/faster_rcnn.yaml
--input images/*.jpg --output results/
--opts MODEL.WEIGHTS output/model_final.pth
MMDetection
Training
python tools/train.py configs/faster_rcnn/faster-rcnn_r50_fpn_1x_coco.py
Testing
python tools/test.py configs/faster_rcnn.py checkpoints/latest.pth --eval bbox
Inference
python demo/image_demo.py demo.jpg configs/faster_rcnn.py checkpoints/latest.pth
Model Optimization
ONNX export and simplify
python -c "import torch; model = torch.load('model.pt'); torch.onnx.export(model, torch.randn(1,3,640,640), 'model.onnx', opset_version=17)" python -m onnxsim model.onnx model_sim.onnx
TensorRT conversion
trtexec --onnx=model.onnx --saveEngine=model.engine --fp16 --workspace=4096
Benchmark
trtexec --loadEngine=model.engine --batch=1 --iterations=1000 --avgRuns=100
Performance Targets
Metric Real-time High Accuracy Edge
FPS
30 10 15
mAP@50
0.6 0.8 0.5
Latency P99 <50ms <150ms <100ms
GPU Memory <4GB <8GB <2GB
Model Size <50MB <200MB <20MB
Resources
-
Architecture Guide: references/computer_vision_architectures.md
-
Optimization Guide: references/object_detection_optimization.md
-
Deployment Guide: references/production_vision_systems.md
-
Scripts: scripts/ directory for automation tools