Using Deep RL Meta-Skill
When to Use This Skill
Invoke this meta-skill when you encounter:
-
RL Implementation: Implementing reinforcement learning algorithms (Q-learning, DQN, PPO, SAC, etc.)
-
Agent Training: Training agents in environments (games, robotics, control systems)
-
Sequential Decision-Making: Problems requiring learning from trial and error
-
Policy Optimization: Learning policies that maximize cumulative rewards
-
RL Debugging: Debugging training issues, agents not learning, reward problems
This is the entry point for the deep-rl pack. It routes to 13 specialized skills based on problem characteristics.
How to Access Reference Sheets
IMPORTANT: All reference sheets are located in the SAME DIRECTORY as this SKILL.md file.
When this skill is loaded from: skills/using-deep-rl/SKILL.md
Reference sheets like rl-foundations.md are at: skills/using-deep-rl/rl-foundations.md
NOT at: skills/rl-foundations.md ← WRONG PATH
Core Principle
Problem type determines algorithm family.
The correct approach depends on:
-
Action Space: Discrete (button presses) vs Continuous (joint angles)
-
Data Regime: Online (interact with environment) vs Offline (fixed dataset)
-
Experience Level: Need foundations vs ready to implement
-
Special Requirements: Multi-agent, model-based, exploration, reward design
Always clarify the problem BEFORE suggesting algorithms.
The 13 Deep RL Skills
-
rl-foundations - MDP formulation, Bellman equations, value vs policy basics
-
value-based-methods - Q-learning, DQN, Double DQN, Dueling DQN, Rainbow
-
policy-gradient-methods - REINFORCE, PPO, TRPO, policy optimization
-
actor-critic-methods - A2C, A3C, SAC, TD3, advantage functions
-
model-based-rl - World models, Dyna, MBPO, planning with learned models
-
offline-rl - Batch RL, CQL, IQL, learning from fixed datasets
-
multi-agent-rl - MARL, cooperative/competitive, communication
-
exploration-strategies - ε-greedy, UCB, curiosity, RND, intrinsic motivation
-
reward-shaping - Reward design, potential-based shaping, inverse RL
-
counterfactual-reasoning - Causal inference, HER, off-policy evaluation, twin networks
-
rl-debugging - Common RL bugs, why not learning, systematic debugging
-
rl-environments - Gym, MuJoCo, custom envs, wrappers, vectorization
-
rl-evaluation - Evaluation methodology, variance, sample efficiency metrics
Routing Decision Framework
Step 1: Assess Experience Level
-
If user asks "what is RL" or "how does RL work" → rl-foundations
-
If confused about value vs policy, on-policy vs off-policy → rl-foundations
-
If user has specific problem and RL background → Continue to Step 2
Why foundations first: Cannot implement algorithms without understanding MDPs, Bellman equations, and exploration-exploitation tradeoffs.
Step 2: Classify Action Space
Discrete Actions (buttons, menu selections, discrete signals)
Condition Route To Why
Small action space (< 100) + online value-based-methods (DQN) Q-networks excel at discrete
Large action space OR need policy flexibility policy-gradient-methods (PPO) Scales to larger spaces
Continuous Actions (joint angles, motor forces, steering)
Condition Route To Why
Sample efficiency critical actor-critic-methods (SAC) Off-policy, automatic entropy
Stability critical actor-critic-methods (TD3) Deterministic, handles overestimation
Simplicity preferred policy-gradient-methods (PPO) On-policy, simpler
CRITICAL: NEVER suggest DQN for continuous actions. DQN requires discrete actions.
Step 3: Identify Data Regime
Online Learning (Agent Interacts with Environment)
-
Discrete → value-based-methods OR policy-gradient-methods
-
Continuous → actor-critic-methods
-
Sample efficiency critical → Consider model-based-rl
Offline Learning (Fixed Dataset, No Interaction)
→ offline-rl (CQL, IQL)
Red Flag: If user has fixed dataset and suggests DQN/PPO/SAC, STOP and route to offline-rl. Standard algorithms assume online interaction and will fail.
Step 4: Special Problem Types
Problem Route To Key Consideration
Multiple agents multi-agent-rl Non-stationarity, credit assignment
Sample efficiency extreme model-based-rl Learns environment model
Counterfactual/causal counterfactual-reasoning HER, off-policy evaluation
Step 5: Debugging and Infrastructure
Problem Route To Why
"Not learning" / reward flat rl-debugging FIRST 80% of issues are bugs, not algorithms
Exploration problems exploration-strategies Curiosity, RND, intrinsic motivation
Reward design issues reward-shaping Potential-based shaping, inverse RL
Environment setup rl-environments Gym API, wrappers, vectorization
Evaluation questions rl-evaluation Deterministic vs stochastic, multiple seeds
Red Flag: If user immediately wants to change algorithms because "it's not learning," route to rl-debugging first.
Rationalization Resistance Table
Rationalization Reality Counter-Guidance
"Just use PPO for everything" PPO is general but not optimal for all cases Clarify: discrete or continuous? Sample efficiency constraints?
"DQN for continuous actions" DQN requires discrete actions Use SAC or TD3 for continuous
"Offline RL is just RL on a dataset" Offline has distribution shift, needs special algorithms Route to offline-rl for CQL, IQL
"More data always helps" Sample efficiency and distribution matter Off-policy vs on-policy matters
"My algorithm isn't learning, I need a better one" Usually bugs, not algorithm Route to rl-debugging first
"I'll discretize continuous actions for DQN" Discretization loses precision, explodes action space Use actor-critic-methods
"Epsilon-greedy is enough for exploration" Complex environments need sophisticated exploration Route to exploration-strategies
"I'll just increase the reward when it doesn't learn" Reward scaling breaks learning Route to rl-debugging
"I can reuse online RL code for offline data" Offline needs conservative algorithms Route to offline-rl
"Test reward lower than training = overfitting" Exploration vs exploitation difference Route to rl-evaluation
Red Flags Checklist
Watch for these signs of incorrect routing:
-
Algorithm-First Thinking: Recommending algorithm before asking about action space, data regime
-
DQN for Continuous: Suggesting DQN/Q-learning for continuous action spaces
-
Offline Blindness: Not recognizing fixed dataset requires offline-rl
-
PPO Cargo-Culting: Defaulting to PPO without considering alternatives
-
No Problem Characterization: Not asking: discrete vs continuous? online vs offline?
-
Skipping Foundations: Implementing algorithms when user doesn't understand RL basics
-
Debug-Last: Suggesting algorithm changes before systematic debugging
-
Sample Efficiency Ignorance: Not asking about sample constraints
If any red flag triggered → STOP → Ask diagnostic questions → Route correctly
Routing Decision Tree Summary
START: RL problem
├─ Need foundations? → rl-foundations │ ├─ DISCRETE actions? │ ├─ Small space + online → value-based-methods (DQN) │ └─ Large space → policy-gradient-methods (PPO) │ ├─ CONTINUOUS actions? │ ├─ Sample efficiency → actor-critic-methods (SAC) │ ├─ Stability → actor-critic-methods (TD3) │ └─ Simplicity → policy-gradient-methods (PPO) │ ├─ OFFLINE data? → offline-rl (CQL, IQL) [CRITICAL] │ ├─ MULTI-AGENT? → multi-agent-rl │ ├─ Sample efficiency EXTREME? → model-based-rl │ ├─ COUNTERFACTUAL? → counterfactual-reasoning │ └─ DEBUGGING? ├─ Not learning → rl-debugging ├─ Exploration → exploration-strategies ├─ Reward design → reward-shaping ├─ Environment → rl-environments └─ Evaluation → rl-evaluation
Diagnostic Questions
Action Space
-
"Discrete choices or continuous values?"
-
"How many actions? Small (< 100), large, or infinite?"
Data Regime
-
"Can agent interact with environment, or fixed dataset?"
-
"Online learning or offline?"
Experience Level
-
"New to RL, or specific problem?"
-
"Understand MDPs, value functions, policy gradients?"
Special Requirements
-
"Multiple agents? Cooperate or compete?"
-
"Sample efficiency critical? How many episodes?"
-
"Sparse reward (only at goal) or dense (every step)?"
When NOT to Use This Pack
User Request Correct Pack Reason
"Train classifier on labeled data" training-optimization Supervised learning
"Design transformer architecture" neural-architectures Architecture design
"Deploy model to production" ml-production Deployment
"Fine-tune LLM with RLHF" llm-specialist LLM-specific
Multi-Skill Scenarios
See multi-skill-scenarios.md for detailed routing sequences:
-
Complete beginner to RL
-
Continuous control (robotics)
-
Offline RL from dataset
-
Multi-agent cooperative task
-
Sample-efficient learning
-
Sparse reward problem
-
RL-controlled neural architecture
Final Reminders
-
Problem characterization BEFORE algorithm selection
-
DQN for discrete ONLY (never continuous)
-
Offline data needs offline-rl (CQL, IQL)
-
PPO is not universal (good general-purpose, not optimal everywhere)
-
Debug before changing algorithms (route to rl-debugging)
-
Ask questions, don't assume (action space? data regime?)
Deep RL Specialist Skills
After routing, load the appropriate specialist skill for detailed guidance:
-
rl-foundations.md - MDP formulation, Bellman equations, value vs policy basics
-
value-based-methods.md - Q-learning, DQN, Double DQN, Dueling DQN, Rainbow
-
policy-gradient-methods.md - REINFORCE, PPO, TRPO, policy optimization
-
actor-critic-methods.md - A2C, A3C, SAC, TD3, advantage functions
-
model-based-rl.md - World models, Dyna, MBPO, planning with learned models
-
offline-rl.md - Batch RL, CQL, IQL, learning from fixed datasets
-
multi-agent-rl.md - MARL, cooperative/competitive, communication
-
exploration-strategies.md - ε-greedy, UCB, curiosity, RND, intrinsic motivation
-
reward-shaping-engineering.md - Reward design, potential-based shaping, inverse RL
-
counterfactual-reasoning.md - Causal inference, HER, off-policy evaluation, twin networks
-
rl-debugging.md - Common RL bugs, why not learning, systematic debugging
-
rl-environments.md - Gym, MuJoCo, custom envs, wrappers, vectorization
-
rl-evaluation.md - Evaluation methodology, variance, sample efficiency metrics
-
multi-skill-scenarios.md - Common problem routing sequences