Cognitive Skills Benchmarking Framework

Overview

A rigorous framework for A/B/C testing reasoning patterns to empirically determine which cognitive methodologies perform best across problem categories. This framework enables data-driven pattern selection rather than heuristic-based choices.

Why Benchmark?

Different reasoning patterns (ToT, BoT, SRC, HE, AR, DR, AT, RTR, NDF) claim different strengths, but without empirical measurement:

We cannot validate these claims
We cannot quantify trade-offs (quality vs. cost vs. time)
We cannot track improvement over time
Pattern selection remains subjective

This framework provides scientific rigor to cognitive skill evaluation.

Benchmark Structure

Problem Set Organization

benchmark-problems/ ├── optimization/ # ToT territory │ ├── easy/ # 5-10 min problems │ ├── medium/ # 15-30 min problems │ └── hard/ # 30-60 min problems ├── exploration/ # BoT territory │ ├── easy/ │ ├── medium/ │ └── hard/ ├── diagnosis/ # HE territory │ ├── easy/ │ ├── medium/ │ └── hard/ ├── security/ # AR territory │ ├── easy/ │ ├── medium/ │ └── hard/ ├── tradeoffs/ # DR territory │ ├── easy/ │ ├── medium/ │ └── hard/ ├── novel/ # AT territory │ ├── easy/ │ ├── medium/ │ └── hard/ ├── time-critical/ # RTR territory │ ├── easy/ │ ├── medium/ │ └── hard/ └── stakeholder/ # NDF territory ├── easy/ ├── medium/ └── hard/

Problem Definition Schema

problem-template.yaml

problem_id: "OPT-001" domain: "optimization" difficulty: "medium" title: "API Rate Limiter Design" description: | Design a rate limiting system for a public API that handles 10,000 requests/second with fair distribution across users.

context: constraints: - "Must handle burst traffic gracefully" - "Sub-millisecond latency requirement" - "Distributed deployment across 5 regions" resources: - "Redis cluster available" - "Current architecture uses nginx"

evaluation_criteria:

criterion: "Scalability" weight: 0.3 rubric: | 5: Handles 10x traffic with linear cost 4: Handles 5x traffic efficiently 3: Handles 2x traffic 2: Handles current load only 1: Cannot meet requirements
criterion: "Fairness" weight: 0.25 rubric: | 5: Per-user fairness with adaptive limits 4: Per-user fairness with fixed limits 3: Global fairness only 2: Basic fairness, exploitable 1: No fairness consideration
criterion: "Implementability" weight: 0.25 rubric: | 5: Clear implementation path, <1 week 4: Implementation path, 1-2 weeks 3: Requires some research, 2-4 weeks 2: Significant unknowns 1: Impractical to implement
criterion: "Operational Simplicity" weight: 0.2 rubric: | 5: Self-healing, minimal ops burden 4: Standard monitoring/alerting sufficient 3: Requires dedicated monitoring 2: High operational complexity 1: Operational nightmare

ground_truth: known_good_solutions: - "Token bucket with Redis MULTI/EXEC" - "Sliding window log with sorted sets" common_pitfalls: - "Race conditions in distributed counting" - "Memory explosion with naive approaches" expert_rating: 4.2 # If available

tags:

"distributed-systems"
"performance"
"redis"

Metrics Framework

Core Metrics

Metric Type Range Description

Quality Score Aggregate 0-100 Weighted sum of evaluation criteria

Confidence Self-reported 0-100% Pattern's reported confidence in solution

Token Cost Integer 0-∞ Total tokens consumed (input + output)

Execution Time Duration ms Wall-clock time to solution

Correctness Binary/Partial 0-1 Does solution actually work?

Completeness Percentage 0-100% How much of the problem addressed?

Human Preference Rank 1-N Human ranking among alternatives

Quality Score Calculation

def calculate_quality_score(solution, criteria): """ Calculate weighted quality score from rubric evaluations.

Args:
    solution: The solution being evaluated
    criteria: List of (criterion, weight, score) tuples

Returns:
    Quality score 0-100
"""
weighted_sum = 0
total_weight = 0

for criterion, weight, score in criteria:
    # Score is 1-5, normalize to 0-20, then weight
    normalized = (score - 1) * 25  # 1->0, 5->100
    weighted_sum += normalized * weight
    total_weight += weight

return weighted_sum / total_weight if total_weight > 0 else 0

Efficiency Metrics

@dataclass class EfficiencyMetrics: tokens_per_quality_point: float # Lower is better time_per_quality_point: float # Lower is better quality_per_minute: float # Higher is better

@classmethod
def calculate(cls, quality: float, tokens: int, time_ms: int):
    return cls(
        tokens_per_quality_point=tokens / max(quality, 1),
        time_per_quality_point=time_ms / max(quality, 1),
        quality_per_minute=(quality * 60000) / max(time_ms, 1)
    )

Confidence Calibration

Track how well self-reported confidence predicts actual quality:

def calibration_score(predictions: List[Tuple[float, float]]) -> float: """ Calculate calibration: does confidence predict quality?

Args:
    predictions: List of (confidence, actual_quality) pairs

Returns:
    Calibration score (-1 to 1, 1 is perfect)
"""
if len(predictions) &#x3C; 10:
    return None  # Insufficient data

# Bin by confidence and compare to actual
bins = defaultdict(list)
for conf, quality in predictions:
    bin_key = int(conf // 10) * 10  # 0-10, 10-20, etc.
    bins[bin_key].append(quality)

errors = []
for bin_center, qualities in bins.items():
    expected = bin_center + 5  # Center of bin
    actual = sum(qualities) / len(qualities)
    errors.append(abs(expected - actual))

# Average error, inverted and normalized
avg_error = sum(errors) / len(errors) if errors else 50
return 1 - (avg_error / 50)  # 0 error -> 1, 50 error -> 0

A/B/C Testing Protocol

Experimental Design

experiment: id: "EXP-2024-001" hypothesis: "ToT outperforms BoT on optimization problems"

conditions: - name: "baseline" pattern: "direct_analysis" description: "No specialized pattern"

- name: "condition_a"
  pattern: "tree_of_thoughts"
  description: "ToT with default parameters"

- name: "condition_b"
  pattern: "breadth_of_thought"
  description: "BoT with default parameters"

- name: "condition_c"
  pattern: "tree_of_thoughts"
  parameters:
    max_branches: 5
    pruning_threshold: 0.6
  description: "ToT with aggressive pruning"

problem_set: domain: "optimization" difficulties: ["medium", "hard"] sample_size: 30 # Problems per condition

randomization: seed: 42 counterbalancing: true # Vary problem order

controls: temperature: 0.7 # Fixed across conditions max_tokens: 8000 # Fixed across conditions time_limit: 300000 # 5 minutes per problem

Statistical Requirements

Sample Size Calculation

def required_sample_size( effect_size: float = 0.5, # Cohen's d alpha: float = 0.05, # Significance level power: float = 0.80 # Statistical power ) -> int: """ Calculate minimum sample size for meaningful comparison.

For quality score comparisons (continuous 0-100):
- Small effect (d=0.2): n=393 per condition
- Medium effect (d=0.5): n=64 per condition
- Large effect (d=0.8): n=26 per condition
"""
from scipy import stats

# Two-tailed t-test
z_alpha = stats.norm.ppf(1 - alpha/2)
z_beta = stats.norm.ppf(power)

n = 2 * ((z_alpha + z_beta) / effect_size) ** 2
return int(np.ceil(n))

Minimum Requirements

Comparison Type Minimum N Statistical Test

Two patterns 30/condition Independent t-test

Multiple patterns 30/condition ANOVA + Tukey HSD

Paired (same problem) 20 problems Paired t-test

Win/loss record 50 comparisons Sign test

Confound Control

confound_controls:

Problem-level controls

problem_randomization: enabled: true seed_per_experiment: true

Order effects

counterbalancing: method: "latin_square" wash_out_period: true # Clear context between conditions

LLM variability

temperature_control: fixed_temperature: 0.7 multiple_runs: 3 # Run each problem 3x, average

Time effects

session_controls: max_problems_per_session: 10 break_between_conditions: true

Evaluator bias

blind_evaluation: enabled: true solutions_anonymized: true random_order: true

Running an A/B/C Test

async def run_abc_test(experiment_config: dict) -> ExperimentResults: """ Execute A/B/C test according to protocol. """ results = ExperimentResults(experiment_id=experiment_config['id'])

# Load problem set
problems = load_problems(
    domain=experiment_config['problem_set']['domain'],
    difficulties=experiment_config['problem_set']['difficulties']
)

# Randomize
random.seed(experiment_config['randomization']['seed'])
random.shuffle(problems)

# Sample required number
problems = problems[:experiment_config['problem_set']['sample_size']]

for problem in problems:
    problem_results = {}

    for condition in experiment_config['conditions']:
        # Clear context (wash-out)
        await clear_context()

        # Run pattern
        start_time = time.time()
        solution = await run_pattern(
            pattern=condition['pattern'],
            parameters=condition.get('parameters', {}),
            problem=problem,
            controls=experiment_config['controls']
        )
        execution_time = time.time() - start_time

        # Collect metrics
        problem_results[condition['name']] = {
            'solution': solution,
            'quality_score': evaluate_quality(solution, problem),
            'confidence': solution.confidence,
            'tokens': solution.token_count,
            'execution_time': execution_time,
            'correctness': verify_correctness(solution, problem)
        }

    results.add_problem_results(problem.id, problem_results)

# Statistical analysis
results.analyze()

return results

Problem Categories

Optimization Problems (ToT Territory)

Characteristics: Single best solution, prunable search space, clear evaluation criteria.

example_problems:

id: "OPT-001" title: "Database Query Optimization" type: "performance"
id: "OPT-002" title: "Memory Allocation Strategy" type: "resource"
id: "OPT-003" title: "API Response Caching" type: "architecture"

expected_pattern_performance: tree_of_thoughts: "primary" breadth_of_thought: "secondary" direct_analysis: "baseline"

Exploration Problems (BoT Territory)

Characteristics: Multiple valid solutions, unknown solution space, need diversity.

example_problems:

id: "EXP-001" title: "Architecture Options for New Service" type: "greenfield"
id: "EXP-002" title: "Possible Causes of Intermittent Bug" type: "diagnostic"
id: "EXP-003" title: "Migration Strategy Alternatives" type: "strategic"

expected_pattern_performance: breadth_of_thought: "primary" tree_of_thoughts: "secondary" direct_analysis: "baseline"

Diagnosis Problems (HE Territory)

Characteristics: Information uncertainty, need for multiple hypotheses, testing required.

example_problems:

id: "DIA-001" title: "Production Latency Spike Investigation" type: "performance"
id: "DIA-002" title: "Data Inconsistency Root Cause" type: "data"
id: "DIA-003" title: "Memory Leak Identification" type: "resource"

expected_pattern_performance: hypothesis_engine: "primary" self_reflecting_chain: "secondary" direct_analysis: "baseline"

Security Problems (AR Territory)

Characteristics: Adversarial thinking, attack vectors, defense in depth.

example_problems:

id: "SEC-001" title: "Authentication Flow Security Review" type: "authentication"
id: "SEC-002" title: "API Endpoint Vulnerability Assessment" type: "api_security"
id: "SEC-003" title: "Data Encryption Strategy" type: "data_protection"

expected_pattern_performance: adversarial_reasoning: "primary" hypothesis_engine: "secondary" direct_analysis: "baseline"

Trade-off Problems (DR Territory)

Characteristics: Competing values, no perfect solution, stakeholder tensions.

example_problems:

id: "TRD-001" title: "Consistency vs. Availability Trade-off" type: "cap_theorem"
id: "TRD-002" title: "Technical Debt vs. Feature Velocity" type: "strategic"
id: "TRD-003" title: "Privacy vs. Personalization" type: "product"

expected_pattern_performance: dialectical_reasoning: "primary" negotiated_decision: "secondary" direct_analysis: "baseline"

Novel Problems (AT Territory)

Characteristics: No established approaches, requires creativity, analogical thinking.

example_problems:

id: "NOV-001" title: "AI-Native UX Paradigm" type: "design"
id: "NOV-002" title: "Cross-Domain Integration Pattern" type: "architecture"
id: "NOV-003" title: "Emergent Technology Application" type: "strategic"

expected_pattern_performance: analogical_transfer: "primary" breadth_of_thought: "secondary" direct_analysis: "baseline"

Time-Critical Problems (RTR Territory)

Characteristics: Minutes not hours, good enough now, actionable immediately.

example_problems:

id: "RTR-001" title: "Production Incident Triage" type: "incident"
id: "RTR-002" title: "Deadline-Driven Decision" type: "strategic"
id: "RTR-003" title: "Security Breach Response" type: "security"

expected_pattern_performance: rapid_triage_reasoning: "primary" direct_analysis: "secondary" tree_of_thoughts: "too_slow"

Stakeholder Problems (NDF Territory)

Characteristics: Multiple parties, competing interests, need consensus.

example_problems:

id: "STK-001" title: "Cross-Team Resource Allocation" type: "organizational"
id: "STK-002" title: "Feature Prioritization Conflict" type: "product"
id: "STK-003" title: "Architecture Decision with Team Buy-in" type: "technical"

expected_pattern_performance: negotiated_decision: "primary" dialectical_reasoning: "secondary" direct_analysis: "baseline"

Benchmark Reporting

Pattern Performance Matrix

┌─────────────────────────────────────────────────────────────────────────────┐ │ PATTERN PERFORMANCE MATRIX │ │ Experiment: EXP-2024-001 │ │ Problems: 240 (30 per category) │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Domain │ ToT │ BoT │ SRC │ HE │ AR │ DR │ Direct │ │ ────────────────┼───────┼───────┼───────┼───────┼───────┼───────┼─────────│ │ Optimization │ 82.3* │ 71.2 │ 68.4 │ 65.1 │ 62.3 │ 64.8 │ 58.2 │ │ Exploration │ 69.1 │ 84.7* │ 72.3 │ 74.2 │ 68.9 │ 71.5 │ 55.3 │ │ Diagnosis │ 71.4 │ 73.8 │ 78.2 │ 85.6* │ 72.1 │ 69.4 │ 61.7 │ │ Security │ 68.9 │ 71.2 │ 74.5 │ 76.3 │ 86.2* │ 70.8 │ 57.9 │ │ Trade-offs │ 65.4 │ 72.1 │ 70.3 │ 68.7 │ 71.2 │ 83.9* │ 54.6 │ │ Novel │ 62.8 │ 78.4 │ 71.6 │ 69.2 │ 65.7 │ 72.3 │ 51.2 │ │ Time-critical │ 48.2 │ 45.6 │ 52.3 │ 55.1 │ 51.8 │ 49.7 │ 62.4 │ │ Stakeholder │ 64.7 │ 71.3 │ 68.9 │ 70.2 │ 69.5 │ 78.4 │ 52.8 │ │ ────────────────┼───────┼───────┼───────┼───────┼───────┼───────┼─────────│ │ OVERALL │ 66.6 │ 71.0 │ 69.6 │ 70.6 │ 68.5 │ 70.1 │ 56.8 │ │ │ │ * = Best in category (p < 0.05) │ │ Quality scores normalized 0-100 │ └─────────────────────────────────────────────────────────────────────────────┘

Win/Loss Records

┌─────────────────────────────────────────────────────────────────────────────┐ │ HEAD-TO-HEAD RECORDS │ │ (wins-losses-ties) │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ │ ToT │ BoT │ SRC │ HE │ Direct │ │ │ ────────────┼──────────┼──────────┼──────────┼──────────┼──────────│ │ │ ToT │ - │ 98-127-15│ 112-108-20│ 105-118-17│ 178-52-10│ │ │ BoT │ 127-98-15│ - │ 124-103-13│ 119-108-13│ 189-42-9 │ │ │ SRC │ 108-112-20│103-124-13│ - │ 115-112-13│ 175-58-7 │ │ │ HE │ 118-105-17│108-119-13│ 112-115-13│ - │ 182-48-10│ │ │ Direct │ 52-178-10│ 42-189-9 │ 58-175-7 │ 48-182-10│ - │ │ │ │ │ Statistical significance: Chi-square test, p < 0.05 for all vs Direct │ └─────────────────────────────────────────────────────────────────────────────┘

Cost-Quality Trade-offs

┌─────────────────────────────────────────────────────────────────────────────┐ │ COST-QUALITY ANALYSIS │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Pattern │ Avg Quality │ Avg Tokens │ Tokens/Point │ ROI vs Direct │ │ │ ───────────┼─────────────┼────────────┼──────────────┼───────────────│ │ │ Direct │ 56.8 │ 2,100 │ 37.0 │ 1.00x │ │ │ ToT │ 66.6 │ 8,500 │ 127.6 │ 0.29x │ │ │ BoT │ 71.0 │ 12,200 │ 171.8 │ 0.22x │ │ │ SRC │ 69.6 │ 6,400 │ 92.0 │ 0.40x │ │ │ HE │ 70.6 │ 7,800 │ 110.5 │ 0.34x │ │ │ AR │ 68.5 │ 9,100 │ 132.8 │ 0.28x │ │ │ DR │ 70.1 │ 8,900 │ 127.0 │ 0.29x │ │ │ │ │ ROI = Quality improvement per token spent (relative to Direct) │ │ Higher ROI = more efficient pattern │ │ │ │ RECOMMENDATION: SRC offers best efficiency for moderate quality gains │ │ BoT/HE justified when problem fits their specialty │ └─────────────────────────────────────────────────────────────────────────────┘

Detailed Report Template

Benchmark Report: [Experiment ID]

Executive Summary

Best Overall Pattern: [Pattern] (avg quality: X)
Most Efficient Pattern: [Pattern] (tokens/point: X)
Biggest Surprise: [Finding that contradicted expectations]
Key Recommendation: [Actionable insight]

Methodology

Problems tested: [N]
Patterns compared: [List]
Statistical tests: [Tests used]
Significance level: α = 0.05

Results by Category

[Category 1]

Pattern	Quality	Tokens	Time	Sig
...	...	...	...	...

Winner: [Pattern] (p = X) Analysis: [Why this pattern excelled]

[Repeat for each category]

Statistical Analysis

ANOVA Results

F-statistic: X
p-value: X
Effect size (η²): X

Post-hoc Comparisons (Tukey HSD)

[Significant pairwise differences]

Limitations

[Sample size constraints]
[Problem selection bias]
[Evaluator variability]

Recommendations

[Primary recommendation]
[Secondary recommendation]
[Areas for further research]

Appendix: Raw Data

[Link to detailed results]

ChromaDB Integration

Schema Definition

Benchmark results collection

BENCHMARK_SCHEMA = { "collection_name": "cognitive_benchmarks", "embedding_function": "sentence-transformers/all-MiniLM-L6-v2", "metadata_schema": { "experiment_id": "string", "problem_id": "string", "pattern": "string", "domain": "string", "difficulty": "string", "quality_score": "float", "confidence": "float", "tokens": "int", "execution_time_ms": "int", "correctness": "float", "timestamp": "datetime", "session_id": "string" } }

Storage Operations

import chromadb from datetime import datetime from typing import List, Dict, Any

class BenchmarkStore: def init(self, persist_directory: str = "./benchmark_db"): self.client = chromadb.PersistentClient(path=persist_directory) self.collection = self.client.get_or_create_collection( name="cognitive_benchmarks", metadata={"description": "Cognitive pattern benchmark results"} )

def store_result(self, result: Dict[str, Any]) -> str:
    """Store a single benchmark result."""
    doc_id = f"{result['experiment_id']}_{result['problem_id']}_{result['pattern']}"

    # Create searchable document text
    document = f"""
    Pattern: {result['pattern']}
    Domain: {result['domain']}
    Problem: {result['problem_id']}
    Quality: {result['quality_score']}
    Solution summary: {result.get('solution_summary', '')}
    """

    self.collection.upsert(
        ids=[doc_id],
        documents=[document],
        metadatas=[{
            "experiment_id": result['experiment_id'],
            "problem_id": result['problem_id'],
            "pattern": result['pattern'],
            "domain": result['domain'],
            "difficulty": result['difficulty'],
            "quality_score": result['quality_score'],
            "confidence": result['confidence'],
            "tokens": result['tokens'],
            "execution_time_ms": result['execution_time_ms'],
            "correctness": result['correctness'],
            "timestamp": datetime.now().isoformat(),
            "session_id": result.get('session_id', 'default')
        }]
    )

    return doc_id

def store_experiment(self, experiment_results: List[Dict[str, Any]]) -> int:
    """Store all results from an experiment."""
    count = 0
    for result in experiment_results:
        self.store_result(result)
        count += 1
    return count

def query_pattern_performance(
    self,
    pattern: str,
    domain: str = None,
    limit: int = 100
) -> List[Dict[str, Any]]:
    """Query historical performance for a pattern."""
    where_clause = {"pattern": pattern}
    if domain:
        where_clause = {
            "$and": [
                {"pattern": pattern},
                {"domain": domain}
            ]
        }

    results = self.collection.query(
        query_texts=[f"pattern {pattern} performance"],
        where=where_clause,
        n_results=limit
    )

    return results['metadatas'][0] if results['metadatas'] else []

def get_improvement_trend(
    self,
    pattern: str,
    domain: str,
    window_days: int = 30
) -> Dict[str, float]:
    """Calculate performance trend over time."""
    results = self.query_pattern_performance(pattern, domain, limit=1000)

    if len(results) &#x3C; 10:
        return {"trend": None, "insufficient_data": True}

    # Sort by timestamp
    sorted_results = sorted(results, key=lambda x: x['timestamp'])

    # Split into early and recent
    midpoint = len(sorted_results) // 2
    early_avg = sum(r['quality_score'] for r in sorted_results[:midpoint]) / midpoint
    recent_avg = sum(r['quality_score'] for r in sorted_results[midpoint:]) / (len(sorted_results) - midpoint)

    return {
        "early_average": early_avg,
        "recent_average": recent_avg,
        "improvement": recent_avg - early_avg,
        "improvement_pct": ((recent_avg - early_avg) / early_avg) * 100,
        "sample_size": len(sorted_results)
    }

def compare_sessions(
    self,
    session_a: str,
    session_b: str
) -> Dict[str, Any]:
    """Compare performance across sessions."""
    results_a = self.collection.get(
        where={"session_id": session_a}
    )
    results_b = self.collection.get(
        where={"session_id": session_b}
    )

    # Aggregate by pattern
    def aggregate(results):
        by_pattern = {}
        for meta in results['metadatas']:
            pattern = meta['pattern']
            if pattern not in by_pattern:
                by_pattern[pattern] = []
            by_pattern[pattern].append(meta['quality_score'])
        return {p: sum(s)/len(s) for p, s in by_pattern.items()}

    agg_a = aggregate(results_a)
    agg_b = aggregate(results_b)

    comparison = {}
    for pattern in set(agg_a.keys()) | set(agg_b.keys()):
        comparison[pattern] = {
            "session_a": agg_a.get(pattern, None),
            "session_b": agg_b.get(pattern, None),
            "difference": (agg_b.get(pattern, 0) - agg_a.get(pattern, 0))
                if pattern in agg_a and pattern in agg_b else None
        }

    return comparison

Query Patterns

Find best pattern for a domain

def recommend_pattern(store: BenchmarkStore, domain: str) -> str: patterns = ["ToT", "BoT", "SRC", "HE", "AR", "DR", "AT", "RTR", "NDF"]

best_pattern = None
best_score = 0

for pattern in patterns:
    results = store.query_pattern_performance(pattern, domain)
    if results:
        avg_score = sum(r['quality_score'] for r in results) / len(results)
        if avg_score > best_score:
            best_score = avg_score
            best_pattern = pattern

return best_pattern

Track calibration over time

def calibration_trend(store: BenchmarkStore, pattern: str) -> Dict: results = store.query_pattern_performance(pattern, limit=500)

predictions = [(r['confidence'], r['quality_score']) for r in results]

# Calculate calibration for different time periods
# ... (implementation as before)

return {"pattern": pattern, "calibration_history": [...]}

Templates

Running a Benchmark

benchmark-run-template.yaml

benchmark_run: experiment_id: "EXP-YYYY-NNN" date: "YYYY-MM-DD" operator: "[Name]"

configuration: patterns_under_test: - name: "tree_of_thoughts" version: "1.0" parameters: {}

baseline:
  name: "direct_analysis"

problem_set:
  source: "benchmark-problems/optimization/"
  selection: "random"
  count: 30

controls:
  temperature: 0.7
  max_tokens: 8000
  runs_per_problem: 3

execution_checklist: - [ ] Problem set loaded and verified - [ ] Patterns configured correctly - [ ] ChromaDB connection verified - [ ] Evaluation rubrics reviewed - [ ] Blind evaluation setup complete

notes: | [Pre-run observations and expectations]

Recording Results

result-template.yaml

result: problem_id: "OPT-001" pattern: "tree_of_thoughts" run_number: 1

solution: summary: "[Brief description of solution approach]" full_response: "[Complete pattern output]"

raw_metrics: tokens_input: 2500 tokens_output: 3200 execution_time_ms: 45000 self_reported_confidence: 85

evaluation: evaluator: "[Name or 'blind']"

criteria_scores:
  - criterion: "Scalability"
    score: 4
    justification: "[Why this score]"

  - criterion: "Implementability"
    score: 5
    justification: "[Why this score]"

correctness_check:
  verified: true
  method: "[How verified - test, review, etc.]"
  issues_found: []

calculated_quality: 82.5

notes: | [Observations about this run]

Experiment Summary

experiment-summary-template.yaml

experiment_summary: experiment_id: "EXP-YYYY-NNN" status: "completed" completion_date: "YYYY-MM-DD"

overview: total_problems: 90 total_runs: 270 patterns_tested: 3

aggregate_results: by_pattern: tree_of_thoughts: mean_quality: 82.3 std_quality: 8.7 mean_tokens: 5800 mean_time_ms: 42000

  breadth_of_thought:
    mean_quality: 71.2
    std_quality: 12.4
    mean_tokens: 8200
    mean_time_ms: 58000

  direct_analysis:
    mean_quality: 58.2
    std_quality: 15.1
    mean_tokens: 2100
    mean_time_ms: 18000

statistical_tests: anova: f_statistic: 24.7 p_value: 0.00001 significant: true

pairwise:
  - comparison: "ToT vs Direct"
    difference: 24.1
    p_value: 0.00001
    significant: true

  - comparison: "ToT vs BoT"
    difference: 11.1
    p_value: 0.003
    significant: true

conclusions: winner: "tree_of_thoughts" key_finding: | ToT significantly outperformed both BoT and Direct analysis on optimization problems, with a large effect size (d=1.2).

caveats:
  - "Limited to optimization domain"
  - "Single difficulty level tested"

recommendations:
  - "Use ToT as default for optimization problems"
  - "Extend testing to other domains"
  - "Investigate ToT parameter sensitivity"

chromadb_stored: true storage_ids: ["EXP-YYYY-NNN_*"]

Best Practices

Problem Design

Clear evaluation criteria: Every problem must have explicit rubrics
Known baselines: Include problems with known good solutions
Difficulty calibration: Pre-test problems to verify difficulty ratings
Domain balance: Equal representation across categories

Test Execution

Randomization: Always randomize problem order
Wash-out periods: Clear context between conditions
Multiple runs: Run each problem 3x minimum for reliability
Blind evaluation: Anonymize solutions before scoring

Analysis

Statistical rigor: Report confidence intervals, not just means
Effect sizes: Cohen's d alongside p-values
Practical significance: Is the difference meaningful?
Honest limitations: Acknowledge constraints

Reporting

Reproducibility: Include all configuration details
Raw data availability: Store complete results
Negative results: Report when hypotheses fail
Actionable recommendations: What should change?

Appendix: Statistical Reference

Effect Size Interpretation (Cohen's d)

d Interpretation

0.2 Small

0.5 Medium

0.8 Large

1.2 Very large

Sample Size Requirements

Desired Power Small Effect Medium Effect Large Effect

0.80 393 64 26

0.90 526 86 34

0.95 651 105 42

Multiple Comparison Correction

When comparing N patterns:

Bonferroni: α_adjusted = 0.05 / N(N-1)/2
For 6 patterns: α_adjusted = 0.05 / 15 = 0.0033

Version History

Version Date Changes

1.0 2024-01-XX Initial framework

benchmark-framework

Safety Notice

Copy this and send it to your AI assistant to learn

problem-template.yaml

Problem-level controls

Order effects

LLM variability

Time effects

Evaluator bias

Benchmark Report: [Experiment ID]

Executive Summary

Methodology

Results by Category

[Category 1]

Statistical Analysis

ANOVA Results

Post-hoc Comparisons (Tukey HSD)

Limitations

Recommendations

Appendix: Raw Data

Benchmark results collection

Find best pattern for a domain

Track calibration over time

benchmark-run-template.yaml

result-template.yaml

experiment-summary-template.yaml

Source Transparency

Related Skills

integrated-reasoning

chromadb-integration-skills

error-handling-skills

document-writing-skills