Multi-AI Verification

Overview

multi-ai-verification provides comprehensive quality assurance through a 5-layer verification pyramid, from automated rules to LLM-as-judge evaluation.

Purpose: Multi-layer independent verification ensuring production-ready quality

Pattern: Task-based (5 independent verification operations, one per layer)

Key Innovation: 5-layer pyramid (95% automated at base → 0% at apex) with independent verification preventing bias and test gaming

Core Principles (validated by tri-AI research):

Multi-Layer Defense - 5 layers catch different types of issues
Independent Verification - Separate agent from implementation/testing
Progressive Automation - Automate what can be automated (95% → 0%)
Quality Scoring - Objective 0-100 scoring with ≥90 threshold
Actionable Feedback - 100% feedback is specific and actionable (What/Where/Why/How/Priority)

Quality Gates: All 5 layers must pass for production approval

When to Use

Use multi-ai-verification when:

Final quality check before commit/deployment
Independent code review (preventing bias)
Security verification (OWASP, vulnerabilities)
Comprehensive QA (all layers)
Test quality verification (prevent gaming)
Production readiness validation

Prerequisites

Required

Code to verify (implementation complete)
Tests available (for functional verification)
Quality standards defined

Tools Available

Linters (ESLint, Pylint)
Type checkers (TypeScript, mypy)
Coverage tools (c8, pytest-cov)
Security scanners (Semgrep, Bandit)
Test frameworks (Jest, pytest)

The 5-Layer Verification Pyramid

         Layer 5: Quality Scoring
         (LLM-as-Judge, 0-20% automated)
              /\
             /  \
        Layer 4: Integration
        (E2E, System, 20-30% automated)
          /      \
         /        \
    Layer 3: Visual
    (UI, Screenshots, 30-50% automated)
      /          \
     /            \
Layer 2: Functional
(Tests, Coverage, 60-80% automated)
  /              \
 /                \
Layer 1: Rules-Based
(Linting, Types, Schema, 95% automated)

Principle: Fail fast at automated layers (cheap, fast) before expensive LLM-as-judge evaluation

Verification Operations

Operation 1: Rules-Based Verification (Layer 1)

Purpose: Automated validation of code structure, formatting, types

Automation: 95% automated Speed: Seconds (fast feedback) Confidence: High (deterministic)

Process:

Schema Validation (if applicable):

# Validate JSON/YAML against schemas
ajv validate -s plan.schema.json -d plan.json
ajv validate -s task.schema.json -d tasks/*.json

Linting:

# JavaScript/TypeScript
npx eslint src/**/*.{ts,tsx,js,jsx}

# Python
pylint src/**/*.py

# Expected: Zero linting errors

Type Checking:

# TypeScript
npx tsc --noEmit

# Python
mypy src/

# Expected: Zero type errors

Format Validation:

# Check formatting
npx prettier --check src/**/*.{ts,tsx}

# Or auto-fix
npx prettier --write src/**/*.{ts,tsx}

Security Scanning (SAST):

# Static security analysis
npx semgrep --config=auto src/

# Or for Python
bandit -r src/

# Check for:
# - Hardcoded secrets
# - SQL injection risks
# - XSS vulnerabilities
# - Insecure dependencies

Generate Layer 1 Report:

# Layer 1: Rules-Based Verification

## Schema Validation
✅ plan.json validates
✅ All task files validate

## Linting
✅ 0 linting errors
⚠️ 3 warnings (non-blocking)

## Type Checking
✅ 0 type errors

## Formatting
✅ All files formatted correctly

## Security Scan (SAST)
✅ No critical vulnerabilities
⚠️ 1 medium: Weak password hashing rounds (bcrypt)

**Layer 1 Status**: ✅ PASS (0 critical issues)
**Issues to Address**: 1 medium security issue

Outputs:

Lint report (errors/warnings)
Type check results
Schema validation results
Security scan findings
Layer 1 status (PASS/FAIL)

Validation:

All automated checks run
Results documented
Critical issues = 0 for PASS
Actionable feedback for warnings

Time Estimate: 15-30 minutes (mostly automated)

Gate 1: ✅ PASS if no critical issues (warnings acceptable)

Operation 2: Functional Verification (Layer 2)

Purpose: Validate functionality through test execution and coverage

Automation: 60-80% automated Speed: Minutes (medium feedback) Confidence: High (measurable outcomes)

Process:

Execute Complete Test Suite:

# Run all tests with coverage
npm test -- --coverage --verbose

# Capture results
# - Tests passed/failed
# - Coverage metrics
# - Execution time

Validate Example Code (from documentation):

# Extract examples from SKILL.md
# Execute each example automatically
# Verify outputs match expected

# Target: ≥90% examples work

Check Coverage:

# Coverage Report

**Line Coverage**: 87% ✅ (gate: ≥80%)
**Branch Coverage**: 82% ✅
**Function Coverage**: 92% ✅
**Path Coverage**: 74% ✅

**Gate Status**: PASS ✅ (all ≥80%)

**Uncovered Code**:
- src/admin/legacy.ts: 23% (low priority)
- src/utils/deprecated.ts: 15% (deprecated, ok)

Regression Testing (for updates):

# Compare before/after
git diff main...feature --stat

# Run all tests
npm test

# Verify: No new failures (regression prevention)

Performance Validation:

# Run performance tests
npm run test:performance

# Check response times
# Verify: Within acceptable ranges

Generate Layer 2 Report:

# Layer 2: Functional Verification

## Test Execution
✅ 245/245 tests passing (100%)
⏱️ Execution time: 8.3 seconds

## Coverage
✅ Line: 87% (gate: ≥80%)
✅ Branch: 82%
✅ Function: 92%

## Example Validation
✅ 18/20 examples work (90%)
❌ 2 examples fail (outdated)

## Regression
✅ All existing tests still pass

## Performance
✅ All endpoints <200ms

**Layer 2 Status**: ✅ PASS
**Issues**: 2 outdated examples (update docs)

Outputs:

Test execution results
Coverage report
Example validation results
Regression check
Performance metrics
Layer 2 status

Validation:

Time Estimate: 30-60 minutes

Gate 2: ✅ PASS if tests pass + coverage ≥80%

Operation 3: Visual Verification (Layer 3)

Purpose: Validate UI appearance, layout, accessibility (for UI features)

Automation: 30-50% automated Speed: Minutes-Hours Confidence: Medium (subjective elements)

Process:

Screenshot Generation:

# Generate screenshots of UI
npx playwright test --screenshot=on

# Or manually:
# Open application
# Capture screenshots of key views

Visual Comparison (if previous version exists):

# Compare against baseline
npx playwright test --update-snapshots=missing

# Or use Percy/Chromatic for visual regression
npx percy snapshot screenshots/

Layout Validation:

# Visual Checklist

## Layout
- [ ] Components positioned correctly
- [ ] Spacing/margins match mockup
- [ ] Alignment proper
- [ ] No overlapping elements

## Styling
- [ ] Colors match design system
- [ ] Typography correct (fonts, sizes)
- [ ] Icons/images display properly

## Responsiveness
- [ ] Mobile view (320px-480px): ✅
- [ ] Tablet view (768px-1024px): ✅
- [ ] Desktop view (>1024px): ✅

Accessibility Testing:

# Automated accessibility scan
npx axe-core src/

# Check WCAG compliance
npx pa11y http://localhost:3000

# Manual checks:
# - Keyboard navigation
# - Screen reader compatibility
# - Color contrast ratios

Generate Layer 3 Report:

# Layer 3: Visual Verification

## Screenshot Comparison
✅ Login page matches mockup
✅ Dashboard layout correct
⚠️ Profile page: Avatar alignment off by 5px

## Responsiveness
✅ Mobile: All components visible
✅ Tablet: Layout adapts correctly
✅ Desktop: Full functionality

## Accessibility
✅ WCAG 2.1 AA compliance
✅ Keyboard navigation works
⚠️ 2 color contrast warnings (non-critical)

**Layer 3 Status**: ✅ PASS (minor issues acceptable)
**Issues**: Avatar alignment (cosmetic), contrast warnings

Outputs:

Screenshots of UI
Visual comparison results
Responsiveness validation
Accessibility report
Layer 3 status

Validation:

Time Estimate: 30-90 minutes (skip if no UI)

Gate 3: ✅ PASS if no critical visual/a11y issues

Operation 4: Integration Verification (Layer 4)

Purpose: Validate system-level integration, data flow, API compatibility

Automation: 20-30% automated Speed: Hours (complex) Confidence: Medium-High

Process:

Component Integration Tests:

# Run integration test suite
npm test -- tests/integration/

# Verify components work together
# - Database ← → API
# - API ← → Frontend
# - Frontend ← → User

Data Flow Validation:

# Data Flow Verification

**Flow 1: User Registration**
Frontend form → API endpoint → Validation → Database → Email service
✅ Data flows correctly
✅ No data loss
✅ Transactions atomic

**Flow 2: Authentication**
Login request → API → Database lookup → Token generation → Response
✅ Token generated correctly
✅ Session stored
✅ Response includes token

API Integration Tests:

# Test all API endpoints
npm run test:api

# Verify:
# - All endpoints respond
# - Status codes correct
# - Response formats match spec
# - Error handling works

End-to-End Workflow Tests:

// Complete user journeys
test('Complete registration and login flow', async () => {
  // 1. Register new user
  const registerResponse = await api.post('/register', userData);
  expect(registerResponse.status).toBe(201);

  // 2. Confirm email
  const confirmResponse = await api.get(confirmLink);
  expect(confirmResponse.status).toBe(200);

  // 3. Login
  const loginResponse = await api.post('/login', credentials);
  expect(loginResponse.status).toBe(200);
  expect(loginResponse.data.token).toBeDefined();

  // 4. Access protected resource
  const profileResponse = await api.get('/profile', {
    headers: { Authorization: `Bearer ${loginResponse.data.token}` }
  });
  expect(profileResponse.status).toBe(200);
});

Dependency Compatibility:

# Check external dependencies work
npm audit

# Check for breaking changes
npm outdated

# Verify integration with services
# - Database connection
# - Redis/cache
# - External APIs

Generate Layer 4 Report:

# Layer 4: Integration Verification

## Component Integration
✅ 12/12 integration tests passing
✅ All components integrate correctly

## Data Flow
✅ All 5 data flows validated
✅ No data loss or corruption

## API Integration
✅ All 15 endpoints functional
✅ Response formats correct
✅ Error handling works

## E2E Workflows
✅ 8/8 user journeys complete successfully
✅ No workflow breaks

## Dependencies
✅ 0 critical vulnerabilities
⚠️ 2 moderate (non-blocking)

**Layer 4 Status**: ✅ PASS

Outputs:

Integration test results
Data flow validation
API compatibility report
E2E workflow results
Dependency audit
Layer 4 status

Validation:

Time Estimate: 45-90 minutes

Gate 4: ✅ PASS if all integration tests pass, no critical dependencies

Operation 5: Quality Scoring (Layer 5)

Purpose: Holistic quality assessment using LLM-as-judge and Agent-as-a-Judge patterns

Automation: 0-20% automated Speed: Hours (expensive) Confidence: Medium (requires judgment)

Process:

Spawn Independent Quality Assessor (Agent-as-a-Judge):

Key: Use different model family if possible (prevent self-preference bias)

const qualityAssessment = await task({
  description: "Assess code quality holistically",
  prompt: `Evaluate code quality in src/ and tests/.

  DO NOT read implementation conversation history.

  You have access to tools:
  - Read files
  - Execute tests
  - Run linters
  - Query database (if needed)

  Assess 5 dimensions (score each /20):

  1. CORRECTNESS (/20):
     - Logic correctness
     - Edge case handling
     - Error handling completeness
     - Security considerations

  2. FUNCTIONALITY (/20):
     - Meets all requirements
     - User workflows work
     - Performance acceptable
     - No regressions

  3. QUALITY (/20):
     - Code maintainability
     - Best practices followed
     - Anti-patterns avoided
     - Documentation complete

  4. INTEGRATION (/20):
     - Components integrate smoothly
     - API contracts correct
     - Data flow works
     - Backward compatible

  5. SECURITY (/20):
     - No vulnerabilities
     - Input validation
     - Authentication/authorization
     - Data protection

  TOTAL: /100 (sum of 5 dimensions)

  For each dimension, provide:
  - Score (/20)
  - Strengths (what's good)
  - Weaknesses (what needs improvement)
  - Evidence (file:line references)
  - Recommendations (specific, actionable)

  Write comprehensive report to: quality-assessment.md`
});

Multi-Agent Ensemble (for critical features):

3-5 Agent Voting Committee:

// Spawn 3 independent quality assessors
const [judge1, judge2, judge3] = await Promise.all([
  task({description: "Quality Judge 1", prompt: assessmentPrompt}),
  task({description: "Quality Judge 2", prompt: assessmentPrompt}),
  task({description: "Quality Judge 3", prompt: assessmentPrompt})
]);

// Aggregate scores
const scores = {
  correctness: median([judge1.correctness, judge2.correctness, judge3.correctness]),
  functionality: median([...]),
  quality: median([...]),
  integration: median([...]),
  security: median([...])
};

const totalScore = sum(Object.values(scores)); // Total /100

// Check variance
const totalScores = [judge1.total, judge2.total, judge3.total];
const variance = max(totalScores) - min(totalScores);

if (variance > 15) {
  // High disagreement → spawn 2 more judges (total 5)
  // Use 5-agent ensemble for final score
}

// Final score: median of 3 or 5

Calibration Against Rubric:

# Scoring Calibration

## Correctness: 18/20 (Excellent)
**20**: Zero errors, all edge cases handled perfectly
**18**: Minor edge case missing, otherwise excellent ✅ (achieved)
**15**: 1-2 significant edge cases missing
**10**: Some logic errors present
**0**: Major functionality broken

**Evidence**: All tests pass, edge cases covered except timezone DST edge case (minor)

## Functionality: 19/20 (Excellent)
[Similar rubric with evidence]

## Quality: 17/20 (Good)
[Similar rubric with evidence]

## Integration: 18/20 (Excellent)
[Similar rubric with evidence]

## Security: 16/20 (Good)
[Similar rubric with evidence]

**Total**: 88/100 ⚠️ (Below ≥90 gate)

Gap Analysis (if <90):

# Quality Gap Analysis

**Current Score**: 88/100
**Target**: ≥90/100
**Gap**: 2 points

## Critical Gaps (Blocking Approval)
None

## High Priority (Should Fix for ≥90)
1. **Security: Weak bcrypt rounds**
   - **What**: bcrypt using 10 rounds (outdated)
   - **Where**: src/auth/hash.ts:15
   - **Why**: Current standard is 12-14 rounds
   - **How**: Change `bcrypt.hash(password, 10)` to `bcrypt.hash(password, 12)`
   - **Priority**: High
   - **Impact**: +2 points → 90/100

## Medium Priority
1. **Quality: Missing JSDoc for 3 functions**
   - Impact: +1 point → 91/100

**Recommendation**: Fix high priority issue to reach ≥90 threshold
**Estimated Effort**: 15 minutes

Generate Comprehensive Quality Report:

# Layer 5: Quality Scoring Report

## Executive Summary
**Total Score**: 88/100 ⚠️ (Below ≥90 gate)
**Status**: NEEDS MINOR REVISION

## Dimension Scores
- Correctness: 18/20 ⭐⭐⭐⭐⭐
- Functionality: 19/20 ⭐⭐⭐⭐⭐
- Quality: 17/20 ⭐⭐⭐⭐
- Integration: 18/20 ⭐⭐⭐⭐⭐
- Security: 16/20 ⭐⭐⭐⭐

## Strengths
1. Comprehensive test coverage (87%)
2. All functionality working correctly
3. Clean integration with all components
4. Good error handling

## Weaknesses
1. Bcrypt rounds below current standard (security)
2. Missing documentation for helper functions (quality)
3. One timezone edge case not handled (correctness)

## Recommendations (Prioritized)

### Priority 1 (High - Needed for ≥90)
1. Increase bcrypt rounds: 10 → 12
   - File: src/auth/hash.ts:15
   - Effort: 5 min
   - Impact: +2 points

### Priority 2 (Medium - Nice to Have)
1. Add JSDoc to helper functions
   - Files: src/utils/validation.ts
   - Effort: 30 min
   - Impact: +1 point

2. Handle timezone DST edge case
   - File: src/auth/tokens.ts:78
   - Effort: 20 min
   - Impact: +1 point

**Next Steps**: Apply Priority 1 fix, re-verify to reach ≥90

Outputs:

Quality score (0-100) with dimension breakdown
Calibrated against rubric
Gap analysis
Prioritized recommendations (Critical/High/Medium/Low)
Evidence-based feedback (file:line references)
Action plan to reach ≥90

Validation:

All 5 dimensions scored
Scores calibrated against rubric
Evidence provided for each score
Gap analysis if <90
Recommendations actionable
Ensemble used for critical features (optional)

Time Estimate: 60-120 minutes (ensemble adds 30-60 min)

Gate 5: ✅ PASS if total score ≥90/100

Quality Gates Summary

All 5 Gates Must Pass for production approval:

Gate 1: Rules Pass ✅
   ↓ (Linting, types, schema, security)

Gate 2: Tests Pass ✅
   ↓ (All tests, coverage ≥80%)

Gate 3: Visual OK ✅
   ↓ (UI validated, a11y checked)

Gate 4: Integration OK ✅
   ↓ (E2E works, APIs integrate)

Gate 5: Quality ≥90 ✅
   ↓ (LLM-as-judge score ≥90/100)

✅ PRODUCTION APPROVED

If Any Gate Fails:

Failed Gate → Gap Analysis → Apply Fixes → Re-Verify → Repeat Until Pass

Appendix A: Independence Protocol

How Verification Independence is Maintained

Verification Agent Spawning:

// After implementation and testing complete
const verification = await task({
  description: "Independent quality verification",
  prompt: `Verify code quality independently.

  DO NOT read prior conversation history.

  Review:
  - Code: src/**/*.ts
  - Tests: tests/**/*.test.ts
  - Specs: specs/requirements.md

  Verify against specifications ONLY (not implementation decisions).

  Use tools:
  - Read files to inspect code
  - Run tests to verify functionality
  - Execute linters for quality checks

  Score quality (0-100) with evidence.
  Write report to: independent-verification.md`
});

Bias Prevention Checklist:

Specifications written BEFORE implementation
Verification agent prompt has no implementation context
Agent evaluates against specs, not what code does
Fresh context (via Task tool)
Different model family used (if possible)

Validation of Independence:

## Independence Audit

**Expected Behavior**:
- ✅ Verifier finds 1-3 issues (healthy skepticism)
- ✅ Verifier references specifications
- ✅ Verifier uses tools to verify claims

**Warning Signs**:
- ⚠️ Verifier finds 0 issues (possible rubber stamp)
- ⚠️ Verifier doesn't use tools
- ⚠️ Verifier parrots implementation justifications

**If Warning**: Re-verify with stronger independence prompt

Appendix B: Operational Scoring Rubrics

Complete Rubrics for All 5 Dimensions

Correctness (/20)

20 (Perfect): Zero logic errors, all edge cases handled, security perfect 18 (Excellent): 1 minor edge case missing, otherwise flawless 15 (Good): 2-3 edge cases missing, no critical errors 12 (Acceptable): Some edge cases missing, 1 minor logic issue 10 (Needs Work): Multiple edge cases missing or 1 significant logic error 5 (Poor): Major logic errors present 0 (Broken): Critical functionality broken

Functionality (/20)

20: All requirements met, exceeds expectations 18: All requirements met, well implemented 15: All requirements met, basic implementation 12: 1 requirement partially missing 10: 2+ requirements partially missing 5: Several requirements not met 0: Core functionality missing

Quality (/20)

20: Exceptional code quality, best practices exemplified 18: High quality, follows best practices 15: Good quality, minor style issues 12: Acceptable quality, several style issues 10: Below standard, needs refactoring 5: Poor quality, significant issues 0: Unmaintainable code

Integration (/20)

20: Perfect integration, all touch points verified 18: Excellent integration, minor docs needed 15: Good integration, all major points work 12: Acceptable, 1-2 integration issues 10: Integration issues present 5: Multiple integration problems 0: Does not integrate

Security (/20)

20: Passes all security scans, OWASP compliant, hardened 18: Passes scans, 1 minor non-critical issue 15: Passes, 2-3 minor issues 12: 1 medium security issue 10: Multiple medium issues 5: 1 critical issue present 0: Multiple critical vulnerabilities

Appendix C: Technical Foundation

Verification Tools

Linting:

ESLint (JavaScript/TypeScript)
Pylint/Ruff (Python)

Type Checking:

TypeScript compiler (tsc)
mypy (Python)

Security (SAST):

Semgrep (multi-language)
Bandit (Python)
npm audit (JavaScript)

Visual Testing:

Playwright (screenshot, visual regression)
Percy/Chromatic (visual diff)
axe-core (accessibility)

Coverage:

c8/nyc (JavaScript)
pytest-cov (Python)

Cost Controls

Budget Caps:

LLM-as-judge: $50/month
Ensemble verification: $20/month
Total verification: $70/month

Optimization:

Cache quality scores for 24h (same code → same score)
Skip Layer 5 for changes <50 lines
Use ensemble (3-5 agents) only for critical features
Use cheaper models for pre-filtering (Haiku for Layer 1-2)

Quick Reference

The 5 Layers

Layer	Purpose	Automation	Time	Tools
1	Rules-based	95%	15-30m	Linters, types, SAST
2	Functional	60-80%	30-60m	Test execution, coverage
3	Visual	30-50%	30-90m	Screenshots, a11y
4	Integration	20-30%	45-90m	E2E, API tests
5	Quality Scoring	0-20%	60-120m	LLM-as-judge, ensemble

Total: 3-6 hours for complete 5-layer verification

Quality Thresholds

≥90: ✅ Excellent (production-ready)
80-89: ⚠️ Good (needs minor improvements)
70-79: ❌ Acceptable (needs work before production)
<70: ❌ Poor (significant rework required)

Gates

All 5 Must Pass:

Rules pass (no critical lint/type/security)
Tests pass + coverage ≥80%
Visual OK (no critical UI issues)
Integration OK (E2E works)
Quality ≥90/100

multi-ai-verification provides comprehensive, multi-layer quality assurance with independent LLM-as-judge evaluation, ensuring production-ready code through systematic verification from automated rules to holistic quality assessment.

For rubrics, see Appendix B. For independence protocol, see Appendix A.