reinforcement-learning

Reinforcement Learning Best Practices

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "reinforcement-learning" with this command: npx skills add aznatkoiny/zai-skills/aznatkoiny-zai-skills-reinforcement-learning

Reinforcement Learning Best Practices

Overview

This skill provides comprehensive guidance for implementing reinforcement learning in Python using the modern ecosystem (2024-2025). Gymnasium has replaced OpenAI Gym as the standard environment interface. Stable-Baselines3 (SB3) is recommended for prototyping, RLlib for production/distributed training, and CleanRL for research.

When to Use

  • Building RL agents for discrete or continuous control tasks

  • Creating custom simulation environments

  • Tuning hyperparameters for RL algorithms

  • Debugging training issues (reward curves, policy collapse, numerical instability)

  • Deploying trained policies to production

Library Selection

Library Best For Ease Flexibility Production

Stable-Baselines3 Prototyping, learning High Medium Good

RLlib Production, distributed Medium High Excellent

CleanRL Research, understanding High Low Poor

TorchRL Custom implementations Low Highest Good

Algorithm Decision Tree

Start | v Action space type? | +-- Discrete --> Sample efficiency critical? | | | +-- Yes --> DQN (or Double/Dueling DQN) | +-- No --> Stability critical? | | | +-- Yes --> PPO | +-- No --> A2C (faster iterations) | +-- Continuous --> Sample efficiency critical? | +-- Yes --> SAC (auto entropy) or TD3 +-- No --> PPO (more stable, less efficient)

Quick Selection Table:

Scenario Recommended Why

Discrete actions, getting started PPO Stable, good defaults

Continuous control SAC or TD3 Sample efficient, handles continuous well

Sample efficiency critical SAC, DQN Off-policy, reuses experience

Stability critical PPO Trust region, consistent

High-dimensional obs (images) PPO + CNN Handles visual input well

Fast iteration needed A2C Simpler, faster per update

Quick Start with Stable-Baselines3

Basic Training

from stable_baselines3 import PPO from stable_baselines3.common.env_util import make_vec_env

Create vectorized environment (4 parallel envs)

env = make_vec_env("CartPole-v1", n_envs=4)

Initialize and train

model = PPO("MlpPolicy", env, verbose=1) model.learn(total_timesteps=100_000)

Save and load

model.save("ppo_cartpole") loaded_model = PPO.load("ppo_cartpole")

Evaluate

obs = env.reset() for _ in range(1000): action, _ = loaded_model.predict(obs, deterministic=True) obs, reward, done, info = env.step(action)

Custom Environment Template

import gymnasium as gym from gymnasium import spaces import numpy as np

class CustomEnv(gym.Env): metadata = {"render_modes": ["human", "rgb_array"]}

def __init__(self, render_mode=None):
    super().__init__()
    self.observation_space = spaces.Box(
        low=-np.inf, high=np.inf, shape=(4,), dtype=np.float32
    )
    self.action_space = spaces.Discrete(2)
    self.render_mode = render_mode

def reset(self, seed=None, options=None):
    super().reset(seed=seed)
    self.state = self.np_random.uniform(low=-0.05, high=0.05, size=(4,))
    return self.state.astype(np.float32), {}

def step(self, action):
    # Implement environment dynamics here
    observation = self.state.astype(np.float32)
    reward = 1.0
    terminated = False  # Episode ended due to task completion/failure
    truncated = False   # Episode ended due to time limit
    info = {}
    return observation, reward, terminated, truncated, info

def render(self):
    pass

Hyperparameter Tuning with Optuna

import optuna from stable_baselines3 import PPO from stable_baselines3.common.evaluation import evaluate_policy

def objective(trial): learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True) n_steps = trial.suggest_categorical("n_steps", [256, 512, 1024, 2048]) gamma = trial.suggest_float("gamma", 0.9, 0.9999)

model = PPO(
    "MlpPolicy", "CartPole-v1",
    learning_rate=learning_rate,
    n_steps=n_steps,
    gamma=gamma,
    verbose=0
)
model.learn(total_timesteps=50_000)

mean_reward, _ = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
return mean_reward

study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=50) print(f"Best params: {study.best_params}")

Core Workflow

  • Define the environment - Use Gymnasium API, validate spaces

  • Select algorithm - Based on action space and requirements

  • Start simple - Default hyperparameters, short training

  • Monitor training - TensorBoard, check reward curves

  • Debug issues - Use the debugging playbook

  • Tune hyperparameters - Optuna for systematic search

  • Evaluate properly - Separate eval env, multiple seeds

  • Deploy - Export to ONNX/TorchScript

Reference Files

  • algorithms.md - Deep dive on DQN, PPO, SAC, A2C, TD3

  • environments.md - Gymnasium setup, custom envs, wrappers

  • training.md - Hyperparameters, reward engineering, normalization

  • debugging.md - Failure modes, diagnostics, sanity checks

  • evaluation.md - Metrics, logging, reproducibility

  • deployment.md - ONNX export, inference optimization, safety

Essential Dependencies

pip install gymnasium stable-baselines3 tensorboard optuna

For Atari environments

pip install gymnasium[atari] gymnasium[accept-rom-license]

For MuJoCo

pip install gymnasium[mujoco]

Common Pitfalls to Avoid

  • Not normalizing observations - Use VecNormalize wrapper

  • Wrong action space handling - Check discrete vs continuous

  • Ignoring seed management - Set seeds for reproducibility

  • Training and eval on same env - Use separate eval environment

  • Not monitoring entropy - Low entropy = policy collapse

  • Sparse rewards without shaping - Add intermediate rewards

  • Too large/small learning rate - Start with 3e-4 for most algorithms

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

consulting-frameworks

No summary provided by upstream source.

Repository SourceNeeds Review
General

x402-payments

No summary provided by upstream source.

Repository SourceNeeds Review
General

cpp-reinforcement-learning

No summary provided by upstream source.

Repository SourceNeeds Review
General

prompt-optimizer

No summary provided by upstream source.

Repository SourceNeeds Review