Implement Paper From Scratch
The best way to truly understand a paper is to implement it. This skill guides you through that process methodically.
Philosophy
- No copy-pasting from reference implementations - We build understanding, not just working code
- Checkpoint questions verify understanding - You should be able to answer "why" at each step
- Minimal dependencies - Use NumPy/PyTorch fundamentals, not high-level wrappers
- Deliberate debugging - Bugs are learning opportunities, not obstacles
Process
Phase 1: Pre-Implementation Analysis
Before writing any code:
-
Identify the core algorithm - Strip away ablations, extensions, bells and whistles. What's the minimal version?
-
List the components - Break into modules:
- Data pipeline
- Model architecture
- Loss function(s)
- Training loop
- Evaluation metrics
-
Find the tricky parts - What's non-obvious?
- Custom layers or operations
- Numerical stability concerns
- Hyperparameter sensitivity
- Implementation details buried in appendices
-
Gather reference numbers - What should we expect?
- Training loss trajectory
- Validation metrics at convergence
- Compute requirements (if stated)
Phase 2: Scaffolded Implementation
Build up the implementation in this order:
Step 1: Data
# Start with synthetic/toy data
# Verify shapes and types before touching real data
Checkpoint: Can you describe what each tensor represents and its expected shape?
Step 2: Model Architecture
# Build layer by layer
# Print shapes at each stage
# Verify parameter counts match paper
Checkpoint: If you randomly initialize and do a forward pass, do the output shapes match what the paper describes?
Step 3: Loss Function
# Implement exactly as described
# Test with known inputs/outputs
# Check gradient flow
Checkpoint: Can you explain each term in the loss and why it's there?
Step 4: Training Loop
# Minimal loop first (no logging, checkpointing, etc.)
# Verify loss decreases on tiny overfit test
# Then add bells and whistles
Checkpoint: Can you overfit a single batch? If not, something is broken.
Step 5: Evaluation
# Implement paper's exact metrics
# Compare against reported numbers
Checkpoint: On the same data split, how close are you to paper's numbers?
Phase 3: The Debugging Gauntlet
When it doesn't work (and it won't at first):
-
The Overfit Test
- Can you memorize 1 example? 10? 100?
- If not, architecture or gradient bug
-
The Gradient Check
- Are gradients flowing to all parameters?
- Any NaN or exploding gradients?
-
The Initialization Check
- Match paper's initialization exactly
- This matters more than people think
-
The Learning Rate Sweep
- Log scale: 1e-5 to 1e-1
- Loss should decrease for some range
-
The Ablation Debug
- Remove components until it works
- Add back one at a time
Phase 4: Checkpoint Questions
At each stage, you should be able to answer:
Understanding:
- Why does this component exist?
- What would happen without it?
- What alternatives were considered?
Implementation:
- Why this specific implementation choice?
- Where could numerical issues arise?
- What's the computational complexity?
Debugging:
- What would it look like if this was broken?
- How would you test this in isolation?
- What are the most likely bugs?
Output Format
For each implementation session, provide:
## Today's Implementation Goal
[Specific component we're building]
## Prerequisites Check
- [ ] Previous components working
- [ ] Understand what we're building
- [ ] Know expected behavior
## Implementation
### Code
[Code blocks with extensive comments]
### Checkpoint Questions
1. [Question]
<details><summary>Answer</summary>[Answer]</details>
2. [Question]
<details><summary>Answer</summary>[Answer]</details>
### Verification Steps
- [ ] Test 1: [What to check]
- [ ] Test 2: [What to check]
### Common Bugs at This Stage
1. [Bug pattern]: [How to identify and fix]
## What's Next
[Preview of next component and how it connects]
Tips for Specific Paper Types
Transformer-based
- Attention mask shapes are the #1 bug source
- Verify positional encoding is applied correctly
- Check layer norm placement (pre vs post)
RL/Policy Gradient
- Sign errors in policy gradient are silent killers
- Advantage normalization matters
- Verify discount factor handling
Generative Models
- KL term balancing is finicky
- Check latent space distribution
- Verify reconstruction looks reasonable before training
Computer Vision
- Normalization (ImageNet stats, batch norm) is crucial
- Data augmentation can make or break results
- Verify input preprocessing matches paper exactly
Success Criteria
You're done when:
- Numbers match - Within reasonable variance of paper's results
- Understanding is deep - You can explain every line of code
- You found the gotchas - You know what breaks and why
- You could modify it - Confident to try your own variations
Anti-Patterns to Avoid
- ❌ Copying code you don't understand
- ❌ Skipping checkpoint questions
- ❌ Using pre-built components for core algorithm
- ❌ Ignoring discrepancies with paper
- ❌ Moving on before current step works