pytorch-lightning

PyTorch Lightning - High-Level Training Framework

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "pytorch-lightning" with this command: npx skills add davila7/claude-code-templates/davila7-claude-code-templates-pytorch-lightning

PyTorch Lightning - High-Level Training Framework

Quick start

PyTorch Lightning organizes PyTorch code to eliminate boilerplate while maintaining flexibility.

Installation:

pip install lightning

Convert PyTorch to Lightning (3 steps):

import lightning as L import torch from torch import nn from torch.utils.data import DataLoader, Dataset

Step 1: Define LightningModule (organize your PyTorch code)

class LitModel(L.LightningModule): def init(self, hidden_size=128): super().init() self.model = nn.Sequential( nn.Linear(28 * 28, hidden_size), nn.ReLU(), nn.Linear(hidden_size, 10) )

def training_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.model(x)
    loss = nn.functional.cross_entropy(y_hat, y)
    self.log('train_loss', loss)  # Auto-logged to TensorBoard
    return loss

def configure_optimizers(self):
    return torch.optim.Adam(self.parameters(), lr=1e-3)

Step 2: Create data

train_loader = DataLoader(train_dataset, batch_size=32)

Step 3: Train with Trainer (handles everything else!)

trainer = L.Trainer(max_epochs=10, accelerator='gpu', devices=2) model = LitModel() trainer.fit(model, train_loader)

That's it! Trainer handles:

  • GPU/TPU/CPU switching

  • Distributed training (DDP, FSDP, DeepSpeed)

  • Mixed precision (FP16, BF16)

  • Gradient accumulation

  • Checkpointing

  • Logging

  • Progress bars

Common workflows

Workflow 1: From PyTorch to Lightning

Original PyTorch code:

model = MyModel() optimizer = torch.optim.Adam(model.parameters()) model.to('cuda')

for epoch in range(max_epochs): for batch in train_loader: batch = batch.to('cuda') optimizer.zero_grad() loss = model(batch) loss.backward() optimizer.step()

Lightning version:

class LitModel(L.LightningModule): def init(self): super().init() self.model = MyModel()

def training_step(self, batch, batch_idx):
    loss = self.model(batch)  # No .to('cuda') needed!
    return loss

def configure_optimizers(self):
    return torch.optim.Adam(self.parameters())

Train

trainer = L.Trainer(max_epochs=10, accelerator='gpu') trainer.fit(LitModel(), train_loader)

Benefits: 40+ lines → 15 lines, no device management, automatic distributed

Workflow 2: Validation and testing

class LitModel(L.LightningModule): def init(self): super().init() self.model = MyModel()

def training_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.model(x)
    loss = nn.functional.cross_entropy(y_hat, y)
    self.log('train_loss', loss)
    return loss

def validation_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.model(x)
    val_loss = nn.functional.cross_entropy(y_hat, y)
    acc = (y_hat.argmax(dim=1) == y).float().mean()
    self.log('val_loss', val_loss)
    self.log('val_acc', acc)

def test_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.model(x)
    test_loss = nn.functional.cross_entropy(y_hat, y)
    self.log('test_loss', test_loss)

def configure_optimizers(self):
    return torch.optim.Adam(self.parameters(), lr=1e-3)

Train with validation

trainer = L.Trainer(max_epochs=10) trainer.fit(model, train_loader, val_loader)

Test

trainer.test(model, test_loader)

Automatic features:

  • Validation runs every epoch by default

  • Metrics logged to TensorBoard

  • Best model checkpointing based on val_loss

Workflow 3: Distributed training (DDP)

Same code as single GPU!

model = LitModel()

8 GPUs with DDP (automatic!)

trainer = L.Trainer( accelerator='gpu', devices=8, strategy='ddp' # Or 'fsdp', 'deepspeed' )

trainer.fit(model, train_loader)

Launch:

Single command, Lightning handles the rest

python train.py

No changes needed:

  • Automatic data distribution

  • Gradient synchronization

  • Multi-node support (just set num_nodes=2 )

Workflow 4: Callbacks for monitoring

from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitor

Create callbacks

checkpoint = ModelCheckpoint( monitor='val_loss', mode='min', save_top_k=3, filename='model-{epoch:02d}-{val_loss:.2f}' )

early_stop = EarlyStopping( monitor='val_loss', patience=5, mode='min' )

lr_monitor = LearningRateMonitor(logging_interval='epoch')

Add to Trainer

trainer = L.Trainer( max_epochs=100, callbacks=[checkpoint, early_stop, lr_monitor] )

trainer.fit(model, train_loader, val_loader)

Result:

  • Auto-saves best 3 models

  • Stops early if no improvement for 5 epochs

  • Logs learning rate to TensorBoard

Workflow 5: Learning rate scheduling

class LitModel(L.LightningModule): # ... (training_step, etc.)

def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)

    # Cosine annealing
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer,
        T_max=100,
        eta_min=1e-5
    )

    return {
        'optimizer': optimizer,
        'lr_scheduler': {
            'scheduler': scheduler,
            'interval': 'epoch',  # Update per epoch
            'frequency': 1
        }
    }

Learning rate auto-logged!

trainer = L.Trainer(max_epochs=100) trainer.fit(model, train_loader)

When to use vs alternatives

Use PyTorch Lightning when:

  • Want clean, organized code

  • Need production-ready training loops

  • Switching between single GPU, multi-GPU, TPU

  • Want built-in callbacks and logging

  • Team collaboration (standardized structure)

Key advantages:

  • Organized: Separates research code from engineering

  • Automatic: DDP, FSDP, DeepSpeed with 1 line

  • Callbacks: Modular training extensions

  • Reproducible: Less boilerplate = fewer bugs

  • Tested: 1M+ downloads/month, battle-tested

Use alternatives instead:

  • Accelerate: Minimal changes to existing code, more flexibility

  • Ray Train: Multi-node orchestration, hyperparameter tuning

  • Raw PyTorch: Maximum control, learning purposes

  • Keras: TensorFlow ecosystem

Common issues

Issue: Loss not decreasing

Check data and model setup:

Add to training_step

def training_step(self, batch, batch_idx): if batch_idx == 0: print(f"Batch shape: {batch[0].shape}") print(f"Labels: {batch[1]}") loss = ... return loss

Issue: Out of memory

Reduce batch size or use gradient accumulation:

trainer = L.Trainer( accumulate_grad_batches=4, # Effective batch = batch_size × 4 precision='bf16' # Or 'fp16', reduces memory 50% )

Issue: Validation not running

Ensure you pass val_loader:

WRONG

trainer.fit(model, train_loader)

CORRECT

trainer.fit(model, train_loader, val_loader)

Issue: DDP spawns multiple processes unexpectedly

Lightning auto-detects GPUs. Explicitly set devices:

Test on CPU first

trainer = L.Trainer(accelerator='cpu', devices=1)

Then GPU

trainer = L.Trainer(accelerator='gpu', devices=1)

Advanced topics

Callbacks: See references/callbacks.md for EarlyStopping, ModelCheckpoint, custom callbacks, and callback hooks.

Distributed strategies: See references/distributed.md for DDP, FSDP, DeepSpeed ZeRO integration, multi-node setup.

Hyperparameter tuning: See references/hyperparameter-tuning.md for integration with Optuna, Ray Tune, and WandB sweeps.

Hardware requirements

  • CPU: Works (good for debugging)

  • Single GPU: Works

  • Multi-GPU: DDP (default), FSDP, or DeepSpeed

  • Multi-node: DDP, FSDP, DeepSpeed

  • TPU: Supported (8 cores)

  • Apple MPS: Supported

Precision options:

  • FP32 (default)

  • FP16 (V100, older GPUs)

  • BF16 (A100/H100, recommended)

  • FP8 (H100)

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

senior-data-scientist

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

senior-backend

No summary provided by upstream source.

Repository SourceNeeds Review
1.2K-davila7
Coding

senior-frontend

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

ui-ux-pro-max

No summary provided by upstream source.

Repository SourceNeeds Review