QA & Testing Engine — Complete Software Quality System

The definitive testing methodology for AI agents. From test strategy to execution, coverage to reporting — everything you need to ship quality software.

Phase 1: Test Strategy Design

Before writing a single test, design the strategy.

Strategy Brief Template

project:
  name: ""
  type: web-app | api | mobile | library | cli | data-pipeline
  languages: [typescript, python, go, java]
  frameworks: [react, express, django, spring]
  
risk_profile:
  data_sensitivity: low | medium | high | critical  # PII, financial, health
  user_impact: internal | b2b | b2c | life-safety
  deployment_frequency: daily | weekly | monthly
  regulatory: [none, SOC2, HIPAA, PCI-DSS, GDPR]

test_scope:
  in_scope: []    # Features, services, components
  out_of_scope: [] # Explicitly excluded (with reason)
  
environments:
  dev: { url: "", db: "local" }
  staging: { url: "", db: "seeded" }
  prod: { url: "", smoke_only: true }

Test Type Decision Matrix

Risk Profile	Unit	Integration	E2E	Performance	Security	Accessibility
Internal tool	✅ Core	✅ API	⚠️ Happy path	❌	⚠️ Basic	❌
B2B SaaS	✅ Full	✅ Full	✅ Critical flows	✅ Load	✅ OWASP Top 10	✅ WCAG AA
B2C high-traffic	✅ Full	✅ Full	✅ Full	✅ Stress + soak	✅ Full	✅ WCAG AA
Financial/Health	✅ Full + mutation	✅ Full + contract	✅ Full + chaos	✅ Full suite	✅ Pen test	✅ WCAG AAA

Test Pyramid Architecture

         /  E2E  \          5-10% — Critical user journeys only
        / Integration \     20-30% — API contracts, service boundaries
       /    Unit Tests   \  60-70% — Business logic, pure functions

Anti-pattern: Ice cream cone — More E2E than unit tests. Slow, flaky, expensive. Fix by pushing test coverage DOWN the pyramid.

Anti-pattern: Hourglass — Lots of unit + E2E, no integration. Misses contract bugs between services.

Phase 2: Unit Testing Mastery

The AAA Pattern (Arrange-Act-Assert)

Every unit test follows this structure:

describe('PricingCalculator', () => {
  // Group by behavior, not by method
  describe('when customer has volume discount', () => {
    it('applies tiered pricing above threshold', () => {
      // ARRANGE — Set up the scenario
      const calculator = new PricingCalculator();
      const customer = createCustomer({ tier: 'enterprise', units: 150 });
      
      // ACT — Execute the behavior under test
      const price = calculator.calculate(customer);
      
      // ASSERT — Verify the outcome (ONE logical assertion)
      expect(price).toEqual({
        subtotal: 12000,
        discount: 1800,  // 15% volume discount
        total: 10200,
      });
    });
  });
});

Test Naming Convention

Format: [unit] [scenario] [expected behavior]

✅ Good:

PricingCalculator applies 15% discount when units exceed 100
UserService throws NotFoundError when user ID is invalid
parseDate returns null for malformed ISO strings

❌ Bad:

test1, should work, calculates price

What to Unit Test (Priority Order)

Business logic — Pricing, rules, calculations, state machines
Data transformations — Parsers, formatters, serializers, mappers
Edge cases — Boundaries, null/undefined, empty collections, overflow
Error handling — Every catch block, every validation path
Pure functions — Easiest to test, highest ROI

What NOT to Unit Test

Framework internals (React rendering, Express routing)
Simple getters/setters with no logic
Third-party library behavior
Implementation details (private methods, internal state)

Mocking Rules

Dependency Type	Strategy	Example
Database	Mock the repository/DAO	`jest.mock('./userRepo')`
HTTP API	Mock the client or use MSW	`msw.http.get('/api/users', ...)`
File system	Mock fs or use temp dirs	`jest.mock('fs/promises')`
Time/Date	Fake timers	`jest.useFakeTimers()`
Randomness	Seed or mock	`jest.spyOn(Math, 'random')`
Environment	Override env vars	`process.env.NODE_ENV = 'test'`

Rule: Mock at boundaries, not internals. If you're mocking a class you own, your design might need refactoring.

Coverage Targets

Metric	Minimum	Good	Excellent
Line coverage	70%	85%	95%+
Branch coverage	60%	80%	90%+
Function coverage	75%	90%	95%+
Critical path coverage	100%	100%	100%

Warning: 100% coverage ≠ quality. Coverage measures what code ran, not what was verified. A test with no assertions has coverage but no value.

Phase 3: Integration Testing

API Testing Checklist

For every API endpoint, test:

endpoint: POST /api/orders
tests:
  happy_path:
    - Valid request returns 201 with order ID
    - Response matches schema
    - Database record created correctly
    - Events/webhooks fired
    
  validation:
    - Missing required fields → 400 with field errors
    - Invalid data types → 400 with type errors
    - Business rule violations → 422 with explanation
    
  authentication:
    - No token → 401
    - Expired token → 401
    - Wrong role → 403
    - Valid token → proceeds
    
  edge_cases:
    - Duplicate request (idempotency) → same response
    - Concurrent requests → no race condition
    - Maximum payload size → 413 or graceful handling
    - Special characters in input → no injection
    
  error_handling:
    - Database down → 503 with retry hint
    - External service timeout → 504 or fallback
    - Rate limit exceeded → 429 with retry-after

Contract Testing

When services communicate, test the contract:

contract:
  consumer: order-service
  provider: payment-service
  
  interactions:
    - description: "Process payment"
      request:
        method: POST
        path: /payments
        body:
          amount: 99.99
          currency: USD
          order_id: "ord_123"
      response:
        status: 200
        body:
          payment_id: "pay_xxx"  # string, not null
          status: "completed"    # enum: completed|pending|failed
          
  breaking_changes:  # NEVER do these without versioning
    - Remove a field from response
    - Change a field's type
    - Add a required field to request
    - Change the URL path
    - Change error response format

Database Testing Rules

Each test gets a clean state — Use transactions that rollback, or truncate between tests
Use factories, not fixtures — createUser({ role: 'admin' }) > hardcoded SQL dumps
Test migrations — Run migrate-up, migrate-down, migrate-up (roundtrip)
Test constraints — Unique violations, FK cascades, NOT NULL
Test queries — Especially complex JOINs, aggregations, window functions

Phase 4: End-to-End Testing

Critical User Journey Mapping

Identify and test the flows that generate revenue or block users:

critical_journeys:
  - name: "Sign up → First value"
    steps:
      - Visit landing page
      - Click sign up
      - Fill registration form
      - Verify email
      - Complete onboarding
      - Perform first key action
    max_duration: 3 minutes
    
  - name: "Purchase flow"
    steps:
      - Browse products
      - Add to cart
      - Enter shipping
      - Enter payment
      - Confirm order
      - Receive confirmation email
    max_duration: 2 minutes
    
  - name: "Login → Core task → Logout"
    steps:
      - Login (password + SSO + MFA variants)
      - Navigate to core feature
      - Complete primary workflow
      - Verify result
      - Logout
    max_duration: 1 minute

E2E Best Practices

Test user behavior, not implementation — Click buttons by text/role, not by CSS class
Use data-testid sparingly — Only when no accessible selector exists
Wait for state, not time — waitFor(element) not sleep(3000)
Isolate test data — Each test creates its own users/data
Run in CI with retries — 1 retry for flaky network, investigate if >5% flake rate

Selector Priority (Best → Worst)

getByRole('button', { name: 'Submit' }) — Accessible, resilient
getByLabelText('Email') — Form-specific, accessible
getByText('Welcome back') — Content-based
getByTestId('submit-btn') — Explicit test hook
querySelector('.btn-primary') — ❌ Fragile, breaks on CSS changes

Flaky Test Triage

Symptom	Likely Cause	Fix
Passes locally, fails in CI	Timing/race condition	Add explicit waits, check CI resource limits
Fails intermittently	Shared state between tests	Isolate test data, reset state
Fails after deploy	Environment difference	Check env vars, API versions, feature flags
Fails at specific time	Time-dependent logic	Mock dates/times, avoid time-sensitive assertions
Fails in parallel	Resource contention	Use unique ports/DBs per worker

Rule: Quarantine flaky tests within 24 hours. A flaky test suite that everyone ignores is worse than no tests.

Phase 5: Performance Testing

Load Test Design

performance_tests:
  smoke:
    vus: 5
    duration: 1m
    purpose: "Verify test works"
    
  load:
    vus: 100  # Expected concurrent users
    duration: 10m
    ramp_up: 2m
    purpose: "Normal traffic behavior"
    thresholds:
      p95_response: <500ms
      error_rate: <1%
      
  stress:
    vus: 300  # 3x expected load
    duration: 15m
    ramp_up: 5m
    purpose: "Find breaking point"
    
  soak:
    vus: 80
    duration: 2h
    purpose: "Memory leaks, connection exhaustion"
    
  spike:
    stages:
      - { vus: 50, duration: 2m }
      - { vus: 500, duration: 30s }  # Sudden spike
      - { vus: 50, duration: 2m }
    purpose: "Recovery behavior"

Performance Budgets

Metric	Web App	API	Background Job
Response time (p50)	<200ms	<100ms	N/A
Response time (p95)	<1s	<500ms	N/A
Response time (p99)	<3s	<1s	N/A
Throughput	>100 rps	>500 rps	>1000/min
Error rate	<0.1%	<0.1%	<0.5%
CPU usage	<70%	<70%	<90%
Memory growth	<5%/hr	<2%/hr	<10%/hr

Database Performance Testing

db_performance:
  query_tests:
    - name: "Dashboard aggregate query"
      baseline: 50ms
      max_acceptable: 200ms
      with_1M_rows: measure
      with_10M_rows: measure
      
  index_verification:
    - Run EXPLAIN ANALYZE on all critical queries
    - Verify no sequential scans on tables >10K rows
    - Check index usage statistics weekly
    
  connection_pool:
    - Test at max connections
    - Verify graceful handling when pool exhausted
    - Monitor connection wait time

Phase 6: Security Testing

OWASP Top 10 Test Checklist

security_tests:
  A01_broken_access_control:
    - [ ] Horizontal privilege escalation (access other user's data)
    - [ ] Vertical privilege escalation (access admin functions)
    - [ ] IDOR (Insecure Direct Object References)
    - [ ] Missing function-level access control
    - [ ] CORS misconfiguration
    
  A02_cryptographic_failures:
    - [ ] Sensitive data in transit (TLS 1.2+)
    - [ ] Sensitive data at rest (encryption)
    - [ ] Password hashing (bcrypt/argon2, not MD5/SHA)
    - [ ] No secrets in code/logs/URLs
    
  A03_injection:
    - [ ] SQL injection (parameterized queries)
    - [ ] NoSQL injection
    - [ ] Command injection (OS commands)
    - [ ] XSS (stored, reflected, DOM-based)
    - [ ] Template injection (SSTI)
    
  A04_insecure_design:
    - [ ] Rate limiting on auth endpoints
    - [ ] Account lockout after N failures
    - [ ] CAPTCHA on public forms
    - [ ] Business logic abuse scenarios
    
  A05_security_misconfiguration:
    - [ ] Default credentials removed
    - [ ] Error messages don't leak stack traces
    - [ ] Security headers set (CSP, HSTS, X-Frame-Options)
    - [ ] Directory listing disabled
    - [ ] Unnecessary HTTP methods disabled
    
  A07_auth_failures:
    - [ ] Brute force protection
    - [ ] Session fixation
    - [ ] Session timeout
    - [ ] JWT validation (signature, expiry, issuer)
    - [ ] MFA bypass attempts

Input Validation Test Payloads

Test every user input with:

injection_payloads:
  sql: ["' OR 1=1--", "'; DROP TABLE users;--", "1 UNION SELECT * FROM users"]
  xss: ["<script>alert(1)</script>", "<img onerror=alert(1) src=x>", "javascript:alert(1)"]
  path_traversal: ["../../etc/passwd", "..\\..\\windows\\system32", "%2e%2e%2f"]
  command: ["; ls -la", "| cat /etc/passwd", "$(whoami)", "`id`"]
  
boundary_values:
  strings: ["", " ", "a"*10000, null, undefined, "emoji: 🎯", "unicode: é à ü", "rtl: مرحبا"]
  numbers: [0, -1, 2147483647, -2147483648, NaN, Infinity, 0.1+0.2]
  arrays: [[], [null], Array(10000)]
  dates: ["1970-01-01", "2099-12-31", "invalid-date", "2024-02-29", "2023-02-29"]

Phase 7: Test Automation Architecture

Framework Selection Guide

Need	JavaScript/TS	Python	Go	Java
Unit	Vitest / Jest	pytest	testing + testify	JUnit 5
API	Supertest	httpx + pytest	net/http/httptest	RestAssured
E2E (browser)	Playwright	Playwright	chromedp	Selenium
Performance	k6	Locust	vegeta	Gatling
Contract	Pact	Pact	Pact	Pact
Security	ZAP + custom	Bandit + custom	gosec	SpotBugs

CI Pipeline Test Stages

pipeline:
  stage_1_fast:  # <2 min, blocks PR
    - Lint + type check
    - Unit tests
    - Security: dependency scan (npm audit / safety)
    
  stage_2_thorough:  # <10 min, blocks merge
    - Integration tests
    - Contract tests
    - Security: SAST scan
    - Coverage report + threshold check
    
  stage_3_confidence:  # <30 min, blocks deploy
    - E2E critical journeys
    - Visual regression (if applicable)
    - Security: container scan
    
  stage_4_post_deploy:  # After deploy to staging
    - Smoke tests against staging
    - Performance baseline check
    - Security: DAST scan (ZAP)
    
  stage_5_production:  # After prod deploy
    - Smoke tests (critical paths only)
    - Synthetic monitoring enabled
    - Canary metrics watching

Test Data Management

test_data_strategy:
  unit_tests:
    approach: factories  # Builder pattern, create exactly what you need
    example: "createUser({ role: 'admin', plan: 'enterprise' })"
    
  integration_tests:
    approach: seeded_database
    reset: per_test_suite  # Transaction rollback or truncate
    sensitive_data: anonymized  # Never use real PII
    
  e2e_tests:
    approach: api_setup  # Create data via API before test
    cleanup: after_each  # Delete created data
    isolation: unique_identifiers  # Timestamp or UUID in test data
    
  performance_tests:
    approach: representative_dataset
    volume: 10x_production  # Test with more data than prod
    generation: faker_libraries  # Realistic but synthetic

Phase 8: Quality Metrics & Reporting

Test Health Dashboard

metrics:
  test_suite_health:
    total_tests: 0
    passing: 0
    failing: 0
    skipped: 0  # >5% skipped = tech debt alarm
    flaky: 0    # >2% flaky = quarantine immediately
    
  coverage:
    line: "0%"
    branch: "0%"
    critical_paths: "0%"  # Must be 100%
    
  execution:
    unit_duration: "0s"    # Target: <30s
    integration_duration: "0s"  # Target: <5m
    e2e_duration: "0s"     # Target: <15m
    total_ci_time: "0s"    # Target: <20m
    
  defect_metrics:
    bugs_found_in_test: 0
    bugs_escaped_to_prod: 0
    escape_rate: "0%"      # Target: <5%
    mttr: "0h"             # Mean time to resolve
    
  trends:  # Track weekly
    new_tests_added: 0
    tests_deleted: 0  # Healthy deletion = removing redundant tests
    coverage_delta: "+0%"
    flake_rate_delta: "+0%"

Test Report Template

# Test Report — [Feature/Sprint/Release]

## Summary
- **Status:** ✅ PASS / ⚠️ PASS WITH RISKS / ❌ FAIL
- **Tests Run:** X | **Passed:** X | **Failed:** X | **Skipped:** X
- **Coverage:** Line X% | Branch X% | Critical 100%
- **Duration:** Xm Xs

## Key Findings

### 🔴 Critical (Block Release)
1. [Finding] — [Impact] — [Fix recommendation]

### 🟡 High (Fix Before Next Release)
1. [Finding] — [Impact] — [Fix recommendation]

### 🟢 Medium/Low (Backlog)
1. [Finding] — [Impact]

## Risk Assessment
- **Untested areas:** [list]
- **Known flaky tests:** [list with ticket IDs]
- **Performance concerns:** [if any]

## Recommendation
[Ship / Ship with monitoring / Hold for fixes]

Quality Score (0-100)

Dimension	Weight	Scoring
Test coverage	20%	<60%=0, 60-70%=5, 70-80%=10, 80-90%=15, 90%+=20
Critical path coverage	20%	<100%=0, 100%=20
Defect escape rate	15%	>10%=0, 5-10%=5, 2-5%=10, <2%=15
Test suite speed	10%	>30m=0, 20-30m=3, 10-20m=7, <10m=10
Flake rate	10%	>5%=0, 2-5%=3, 1-2%=7, <1%=10
Security test coverage	10%	None=0, Basic=3, OWASP Top 10=7, Full=10
Documentation	5%	None=0, Basic=2, Complete=5
Automation ratio	10%	<50%=0, 50-70%=3, 70-90%=7, 90%+=10

Scoring: 0-40 = 🔴 Critical | 41-60 = 🟡 Needs Work | 61-80 = 🟢 Good | 81-100 = 💎 Excellent

Phase 9: Specialized Testing

Accessibility Testing (WCAG 2.1)

accessibility_checklist:
  level_a:  # Minimum compliance
    - [ ] All images have alt text
    - [ ] All form inputs have labels
    - [ ] Color is not the only visual indicator
    - [ ] Page has proper heading hierarchy (h1→h2→h3)
    - [ ] All functionality available via keyboard
    - [ ] Focus is visible and logical
    - [ ] No content flashes >3 times/second
    
  level_aa:  # Standard compliance (recommended)
    - [ ] Color contrast ratio ≥4.5:1 (normal text)
    - [ ] Color contrast ratio ≥3:1 (large text)
    - [ ] Text resizable to 200% without loss
    - [ ] Skip navigation links
    - [ ] Consistent navigation across pages
    - [ ] Error suggestions provided
    - [ ] ARIA landmarks for page regions
    
  tools:
    - axe-core (automated, catches ~30% of issues)
    - Lighthouse accessibility audit
    - Manual keyboard navigation test
    - Screen reader testing (VoiceOver/NVDA)

API Backward Compatibility Testing

compatibility_tests:
  when_updating_api:
    - [ ] All existing fields still present in response
    - [ ] No field type changes (string→number)
    - [ ] New required request fields have defaults
    - [ ] Deprecated fields still work (with warning header)
    - [ ] Error format unchanged
    - [ ] Pagination behavior unchanged
    - [ ] Rate limits not reduced
    
  versioning_strategy:
    - URL versioning: /v1/users, /v2/users
    - Header versioning: Accept: application/vnd.api+json;version=2
    - Sunset header for deprecated versions
    - Minimum 6-month deprecation notice

Chaos Engineering Principles

chaos_tests:
  network:
    - Service dependency goes down → graceful degradation?
    - Network latency increases 10x → timeout handling?
    - DNS resolution fails → fallback behavior?
    
  infrastructure:
    - Database primary fails → replica promotion?
    - Cache (Redis) goes down → DB fallback works?
    - Disk fills up → alerting + graceful failure?
    
  application:
    - Memory pressure → OOM handling?
    - CPU saturation → request queuing?
    - Certificate expiry → monitoring alert?
    
  data:
    - Corrupt message in queue → dead letter + alert?
    - Schema migration fails mid-way → rollback works?
    - Clock skew between services → idempotency holds?

Phase 10: Daily QA Workflow

For New Features

Review requirements — Identify test scenarios before code is written (shift-left)
Write test cases — Cover happy path, edge cases, error cases, security
Review PR tests — Are tests meaningful? Do they test behavior, not implementation?
Run full suite — Unit + integration + E2E for affected areas
Report findings — Use the test report template above

For Bug Fixes

Write failing test first — Reproduce the bug as a test
Verify fix makes test pass — The test IS the proof
Check for regression — Run related test suites
Add to regression suite — Bug tests prevent re-introduction

Weekly QA Review

weekly_review:
  monday:
    - Review flaky test quarantine — fix or delete
    - Check coverage trends — declining = tech debt
    - Review escaped defects — update test strategy
    
  friday:
    - Update test health dashboard
    - Clean up obsolete tests
    - Document new testing patterns discovered
    - Plan next week's testing focus

Natural Language Commands

"Create test strategy for [project/feature]" → Full strategy brief
"Write unit tests for [function/class]" → AAA pattern tests with edge cases
"Test this API endpoint: [method] [path]" → Full API test checklist
"Review these tests for quality" → Test code review with scoring
"Generate performance test plan" → k6/Locust test design
"Security test [feature/endpoint]" → OWASP-based test checklist
"Create test report for [release]" → Formatted test report
"What's our test health?" → Dashboard with metrics and recommendations
"Find gaps in our test coverage" → Analysis with prioritized recommendations
"Help debug this flaky test" → Root cause analysis with fix suggestions
"Set up CI test pipeline" → Stage-by-stage pipeline config
"Accessibility audit [page/component]" → WCAG checklist with findings