Backtest Expert
Systematic approach to backtesting trading strategies based on professional methodology that prioritizes robustness over optimistic results.
Core Philosophy
Goal: Find strategies that "break the least", not strategies that "profit the most" on paper.
Principle: Add friction, stress test assumptions, and see what survives. If a strategy holds up under pessimistic conditions, it's more likely to work in live trading.
When to Use This Skill
Use this skill when:
-
Developing or validating systematic trading strategies
-
Evaluating whether a trading idea is robust enough for live implementation
-
Troubleshooting why a backtest might be misleading
-
Learning proper backtesting methodology
-
Avoiding common pitfalls (curve-fitting, look-ahead bias, survivorship bias)
-
Assessing parameter sensitivity and regime dependence
-
Setting realistic expectations for slippage and execution costs
Backtesting Workflow
- State the Hypothesis
Define the edge in one sentence.
Example: "Stocks that gap up >3% on earnings and pull back to previous day's close within first hour provide mean-reversion opportunity."
If you can't articulate the edge clearly, don't proceed to testing.
- Codify Rules with Zero Discretion
Define with complete specificity:
-
Entry: Exact conditions, timing, price type
-
Exit: Stop loss, profit target, time-based exit
-
Position sizing: Fixed $$, % of portfolio, volatility-adjusted
-
Filters: Market cap, volume, sector, volatility conditions
-
Universe: What instruments are eligible
Critical: No subjective judgment allowed. Every decision must be rule-based and unambiguous.
- Run Initial Backtest
Test over:
-
Minimum 5 years (preferably 10+)
-
Multiple market regimes (bull, bear, high/low volatility)
-
Realistic costs: Commissions + conservative slippage
Examine initial results for basic viability. If fundamentally broken, iterate on hypothesis.
- Stress Test the Strategy
This is where 80% of testing time should be spent.
Parameter sensitivity:
-
Test stop loss at 50%, 75%, 100%, 125%, 150% of baseline
-
Test profit target at 80%, 90%, 100%, 110%, 120% of baseline
-
Vary entry/exit timing by ±15-30 minutes
-
Look for "plateaus" of stable performance, not narrow spikes
Execution friction:
-
Increase slippage to 1.5-2x typical estimates
-
Model worst-case fills (buy at ask+1 tick, sell at bid-1 tick)
-
Add realistic order rejection scenarios
-
Test with pessimistic commission structures
Time robustness:
-
Analyze year-by-year performance
-
Require positive expectancy in majority of years
-
Ensure strategy doesn't rely on 1-2 exceptional periods
-
Test in different market regimes separately
Sample size:
-
Absolute minimum: 30 trades
-
Preferred: 100+ trades
-
High confidence: 200+ trades
- Out-of-Sample Validation
Walk-forward analysis:
-
Optimize on training period (e.g., Year 1-3)
-
Test on validation period (Year 4)
-
Roll forward and repeat
-
Compare in-sample vs out-of-sample performance
Warning signs:
-
Out-of-sample <50% of in-sample performance
-
Need frequent parameter re-optimization
-
Parameters change dramatically between periods
- Evaluate Results
Questions to answer:
-
Does edge survive pessimistic assumptions?
-
Is performance stable across parameter variations?
-
Does strategy work in multiple market regimes?
-
Is sample size sufficient for statistical confidence?
-
Are results realistic, not "too good to be true"?
Decision criteria:
-
✅ Deploy: Survives all stress tests with acceptable performance
-
🔄 Refine: Core logic sound but needs parameter adjustment
-
❌ Abandon: Fails stress tests or relies on fragile assumptions
Key Testing Principles
Punish the Strategy
Add friction everywhere:
-
Commissions higher than reality
-
Slippage 1.5-2x typical
-
Worst-case fills
-
Order rejections
-
Partial fills
Rationale: Strategies that survive pessimistic assumptions often outperform in live trading.
Seek Plateaus, Not Peaks
Look for parameter ranges where performance is stable, not optimal values that create performance spikes.
Good: Strategy profitable with stop loss anywhere from 1.5% to 3.0% Bad: Strategy only works with stop loss at exactly 2.13%
Stable performance indicates genuine edge; narrow optima suggest curve-fitting.
Test All Cases, Not Cherry-Picked Examples
Wrong approach: Study hand-picked "market leaders" that worked Right approach: Test every stock that met criteria, including those that failed
Selective examples create survivorship bias and overestimate strategy quality.
Separate Idea Generation from Validation
Intuition: Useful for generating hypotheses Validation: Must be purely data-driven
Never let attachment to an idea influence interpretation of test results.
Common Failure Patterns
Recognize these patterns early to save time:
-
Parameter sensitivity: Only works with exact parameter values
-
Regime-specific: Great in some years, terrible in others
-
Slippage sensitivity: Unprofitable when realistic costs added
-
Small sample: Too few trades for statistical confidence
-
Look-ahead bias: "Too good to be true" results
-
Over-optimization: Many parameters, poor out-of-sample results
See references/failed_tests.md for detailed examples and diagnostic framework.
Available Reference Documentation
Methodology Reference
File: references/methodology.md
When to read: For detailed guidance on specific testing techniques.
Contents:
-
Stress testing methods
-
Parameter sensitivity analysis
-
Slippage and friction modeling
-
Sample size requirements
-
Market regime classification
-
Common biases and pitfalls (survivorship, look-ahead, curve-fitting, etc.)
Failed Tests Reference
File: references/failed_tests.md
When to read: When strategy fails tests, or learning from past mistakes.
Contents:
-
Why failures are valuable
-
Common failure patterns with examples
-
Case study documentation framework
-
Red flags checklist for evaluating backtests
Critical Reminders
Time allocation: Spend 20% generating ideas, 80% trying to break them.
Context-free requirement: If strategy requires "perfect context" to work, it's not robust enough for systematic trading.
Red flag: If backtest results look too good (>90% win rate, minimal drawdowns, perfect timing), audit carefully for look-ahead bias or data issues.
Tool limitations: Understand your backtesting platform's quirks (interpolation methods, handling of low liquidity, data alignment issues).
Statistical significance: Small edges require large sample sizes to prove. 5% edge per trade needs 100+ trades to distinguish from luck.
Discretionary vs Systematic Differences
This skill focuses on systematic/quantitative backtesting where:
-
All rules are codified in advance
-
No discretion or "feel" in execution
-
Testing happens on all historical examples, not cherry-picked cases
-
Context (news, macro) is deliberately stripped out
Discretionary traders study differently—this skill may not apply to setups requiring subjective judgment.