debug:pytorch

PyTorch Debugging Guide

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "debug:pytorch" with this command: npx skills add snakeo/claude-debug-and-refactor-skills-plugin/snakeo-claude-debug-and-refactor-skills-plugin-debug-pytorch

PyTorch Debugging Guide

This guide provides systematic approaches to debugging PyTorch models, from common tensor errors to complex training issues.

Common Error Patterns

  1. CUDA Out of Memory (OOM)

Error Message:

RuntimeError: CUDA out of memory. Tried to allocate X.XX GiB

Causes:

  • Batch size too large for GPU memory

  • Accumulating gradients without clearing

  • Storing tensors on GPU unnecessarily

  • Memory leaks from not detaching tensors

Solutions:

Check current memory usage

print(torch.cuda.memory_summary(device=None, abbreviated=False)) print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB") print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

Clear cache

torch.cuda.empty_cache()

Reduce batch size

batch_size = batch_size // 2

Use gradient checkpointing for large models

from torch.utils.checkpoint import checkpoint output = checkpoint(self.heavy_layer, input)

Use mixed precision training

from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() with autocast(): output = model(input) loss = criterion(output, target) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()

Detach tensors when storing for logging

logged_loss = loss.detach().cpu().item()

Use gradient accumulation instead of large batches

accumulation_steps = 4 for i, (inputs, labels) in enumerate(dataloader): outputs = model(inputs) loss = criterion(outputs, labels) / accumulation_steps loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()

  1. Tensor Size/Shape Mismatch

Error Message:

RuntimeError: size mismatch, m1: [32 x 512], m2: [256 x 10] RuntimeError: The size of tensor a (64) must match the size of tensor b (32)

Causes:

  • Incorrect layer dimensions

  • Wrong tensor reshaping

  • Mismatched batch sizes

  • Incorrect input preprocessing

Solutions:

Debug by printing shapes at each layer

class DebugModel(nn.Module): def forward(self, x): print(f"Input shape: {x.shape}") x = self.layer1(x) print(f"After layer1: {x.shape}") x = self.layer2(x) print(f"After layer2: {x.shape}") return x

Add shape assertions as contracts

def forward(self, x): assert x.dim() == 4, f"Expected 4D input, got {x.dim()}D" assert x.shape[1] == 3, f"Expected 3 channels, got {x.shape[1]}" # ... rest of forward pass

Use einops for clearer reshaping

from einops import rearrange x = rearrange(x, 'b c h w -> b (c h w)')

Calculate dimensions programmatically

def _get_conv_output_size(self, shape): with torch.no_grad(): dummy = torch.zeros(1, *shape) output = self.conv_layers(dummy) return output.numel()

  1. NaN in Gradients/Loss

Error Message:

Loss is nan RuntimeError: Function 'XXXBackward' returned nan values

Causes:

  • Learning rate too high

  • Numerical instability in operations

  • Division by zero

  • Log of zero or negative numbers

  • Exploding gradients

Solutions:

Enable anomaly detection to find the source

torch.autograd.set_detect_anomaly(True)

Check for NaN in tensors

def check_nan(tensor, name="tensor"): if torch.isnan(tensor).any(): print(f"NaN detected in {name}") print(f"Shape: {tensor.shape}") print(f"NaN count: {torch.isnan(tensor).sum()}") raise ValueError(f"NaN in {name}")

Add epsilon for numerical stability

eps = 1e-8 log_probs = torch.log(probs + eps) normalized = x / (x.norm() + eps)

Gradient clipping

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=1.0)

Check gradients after backward

def check_gradients(model): for name, param in model.named_parameters(): if param.grad is not None: if torch.isnan(param.grad).any(): print(f"NaN gradient in {name}") if torch.isinf(param.grad).any(): print(f"Inf gradient in {name}") grad_norm = param.grad.norm() print(f"{name}: grad_norm = {grad_norm:.4f}")

Use stable loss functions

BAD: nn.CrossEntropyLoss on softmax output

GOOD: nn.CrossEntropyLoss on logits (raw scores)

loss = nn.CrossEntropyLoss()(logits, targets) # Not softmax(logits)

  1. Device Mismatch (CPU/GPU)

Error Message:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Causes:

  • Model on GPU, data on CPU (or vice versa)

  • Loading model saved on different device

  • Creating new tensors without specifying device

  • Mixing tensors from different GPUs

Solutions:

Always explicitly move to device

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = model.to(device) inputs = inputs.to(device) targets = targets.to(device)

Load model with map_location

model.load_state_dict(torch.load('model.pt', map_location=device))

Create tensors on the correct device

new_tensor = torch.zeros(10, device=device) new_tensor = torch.zeros_like(existing_tensor) # Same device as existing

Check device of all model parameters

def check_model_device(model): devices = {p.device for p in model.parameters()} print(f"Model parameters on devices: {devices}")

Debug device issues

print(f"Model device: {next(model.parameters()).device}") print(f"Input device: {inputs.device}") print(f"Target device: {targets.device}")

  1. Autograd Graph Issues

Error Message:

RuntimeError: Trying to backward through the graph a second time RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Causes:

  • Calling backward() twice without retain_graph

  • In-place operations on tensors requiring gradients

  • Detaching tensors incorrectly

  • Not enabling gradients on input tensors

Solutions:

For multiple backward passes

loss.backward(retain_graph=True) # Use sparingly - memory intensive

Avoid in-place operations on tensors with gradients

BAD:

x += 1 x[0] = 0 x.add_(1)

GOOD:

x = x + 1 x = x.clone() x[0] = 0

Ensure requires_grad is set

input_tensor = torch.randn(10, requires_grad=True)

Clone before in-place modification

x_modified = x.clone() x_modified[0] = 0

Check if tensor has gradient function

print(f"requires_grad: {tensor.requires_grad}") print(f"grad_fn: {tensor.grad_fn}") print(f"is_leaf: {tensor.is_leaf}")

Properly detach for logging/storage

logged_value = tensor.detach().cpu().numpy()

  1. DataLoader Problems

Error Message:

RuntimeError: DataLoader worker (pid X) is killed BrokenPipeError: [Errno 32] Broken pipe RuntimeError: Cannot re-initialize CUDA in forked subprocess

Causes:

  • Too many workers

  • Memory issues in workers

  • CUDA operations before DataLoader fork

  • Shared memory issues

Solutions:

Reduce number of workers

dataloader = DataLoader(dataset, batch_size=32, num_workers=2)

Use spawn instead of fork for CUDA

import multiprocessing multiprocessing.set_start_method('spawn', force=True)

Pin memory for faster GPU transfer (but uses more memory)

dataloader = DataLoader(dataset, pin_memory=True)

Increase shared memory for Docker

In docker-compose.yml: shm_size: '2gb'

Debug DataLoader issues

dataloader = DataLoader(dataset, num_workers=0) # Single process for debugging

Use persistent workers to avoid respawning

dataloader = DataLoader(dataset, num_workers=4, persistent_workers=True)

Custom collate function with error handling

def safe_collate(batch): try: return torch.utils.data.dataloader.default_collate(batch) except Exception as e: print(f"Collate error: {e}") print(f"Batch: {batch}") raise

  1. Dtype Mismatch

Error Message:

RuntimeError: expected scalar type Float but found Double RuntimeError: expected scalar type Long but found Int

Causes:

  • Mixing float32 and float64

  • Wrong dtype for loss functions

  • NumPy default dtype conflicts

Solutions:

Explicitly set dtype

tensor = torch.tensor(data, dtype=torch.float32) tensor = tensor.float() # Convert to float32 tensor = tensor.long() # Convert to int64

Set default dtype globally

torch.set_default_dtype(torch.float32)

CrossEntropyLoss expects Long targets

targets = targets.long()

BCELoss expects Float targets

targets = targets.float()

Convert NumPy arrays properly

numpy_array = np.array([1.0, 2.0]) # float64 by default tensor = torch.from_numpy(numpy_array).float() # Convert to float32

Debugging Tools

  1. Anomaly Detection

Enable for debugging (disable in production - slow)

torch.autograd.set_detect_anomaly(True)

Use as context manager

with torch.autograd.detect_anomaly(): output = model(input) loss = criterion(output, target) loss.backward()

  1. Python Debugger

Insert breakpoint

breakpoint() # Python 3.7+ import pdb; pdb.set_trace() # Older Python

In Jupyter

from IPython.core.debugger import set_trace set_trace()

Common pdb commands:

n - next line

s - step into

c - continue

p variable - print variable

pp variable - pretty print

l - list source code

q - quit

  1. Memory Profiling

GPU memory summary

print(torch.cuda.memory_summary())

Memory snapshots

torch.cuda.memory._record_memory_history()

... run code ...

torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")

Profile memory allocation

from torch.profiler import profile, ProfilerActivity with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], profile_memory=True) as prof: model(input) print(prof.key_averages().table(sort_by="self_cuda_memory_usage"))

  1. TensorBoard Integration

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter('runs/experiment_1')

Log scalars

writer.add_scalar('Loss/train', loss.item(), epoch) writer.add_scalar('Accuracy/val', accuracy, epoch)

Log histograms of weights and gradients

for name, param in model.named_parameters(): writer.add_histogram(f'weights/{name}', param, epoch) if param.grad is not None: writer.add_histogram(f'gradients/{name}', param.grad, epoch)

Log model graph

writer.add_graph(model, input_tensor)

Log images

writer.add_images('predictions', predicted_images, epoch)

writer.close()

Launch: tensorboard --logdir=runs

  1. PyTorch Profiler

from torch.profiler import profile, record_function, ProfilerActivity

with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True ) as prof: with record_function("model_inference"): model(input)

Print summary

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Export for Chrome trace viewer

prof.export_chrome_trace("trace.json")

Export for TensorBoard

prof.export_stacks("profiler_stacks.txt", "self_cuda_time_total")

The Four Phases of PyTorch Debugging

Phase 1: Reproduce and Isolate

Set seeds for reproducibility

def set_seed(seed=42): torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) np.random.seed(seed) random.seed(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False

set_seed(42)

Create minimal reproducible example

Isolate the problematic component

Test with synthetic data first

dummy_input = torch.randn(2, 3, 224, 224, device=device) dummy_target = torch.randint(0, 10, (2,), device=device)

Run single forward/backward pass

model.train() output = model(dummy_input) loss = criterion(output, dummy_target) loss.backward()

Phase 2: Validate Data Pipeline

Check dataset

print(f"Dataset size: {len(dataset)}") sample = dataset[0] print(f"Sample type: {type(sample)}") print(f"Sample shapes: {[s.shape if hasattr(s, 'shape') else type(s) for s in sample]}")

Check DataLoader output

batch = next(iter(dataloader)) for i, item in enumerate(batch): print(f"Batch item {i}: shape={item.shape}, dtype={item.dtype}, device={item.device}")

Validate data ranges

inputs, targets = batch print(f"Input range: [{inputs.min():.4f}, {inputs.max():.4f}]") print(f"Input mean: {inputs.mean():.4f}, std: {inputs.std():.4f}") print(f"Target unique values: {targets.unique()}")

Check for data issues

assert not torch.isnan(inputs).any(), "NaN in inputs" assert not torch.isinf(inputs).any(), "Inf in inputs"

Phase 3: Validate Model Architecture

Test with tiny data to check for bugs

def overfit_single_batch(model, batch, epochs=100): """Model should be able to overfit a single batch.""" model.train() inputs, targets = batch optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f"Epoch {epoch}: Loss = {loss.item():.6f}")

# Loss should be very low if model can learn
assert loss.item() < 0.1, "Model cannot overfit single batch!"

Check model parameters

def inspect_model(model): total_params = 0 trainable_params = 0 for name, param in model.named_parameters(): total_params += param.numel() if param.requires_grad: trainable_params += param.numel() print(f"{name}: shape={param.shape}, requires_grad={param.requires_grad}") print(f"\nTotal params: {total_params:,}") print(f"Trainable params: {trainable_params:,}")

Verify forward pass shape transformations

def trace_shapes(model, input_shape): """Trace shapes through the model.""" hooks = [] shapes = []

def hook(module, input, output):
    shapes.append({
        'module': module.__class__.__name__,
        'input': [i.shape for i in input if hasattr(i, 'shape')],
        'output': output.shape if hasattr(output, 'shape') else type(output)
    })

for layer in model.modules():
    hooks.append(layer.register_forward_hook(hook))

dummy = torch.randn(1, *input_shape)
model(dummy)

for h in hooks:
    h.remove()

for s in shapes:
    print(s)

Phase 4: Validate Training Loop

Comprehensive training loop debugging

def debug_training_step(model, batch, criterion, optimizer): model.train() inputs, targets = batch

# Check inputs
print(f"Input shape: {inputs.shape}, dtype: {inputs.dtype}")
print(f"Target shape: {targets.shape}, dtype: {targets.dtype}")

# Zero gradients
optimizer.zero_grad()

# Forward pass
with torch.autograd.detect_anomaly():
    outputs = model(inputs)
    print(f"Output shape: {outputs.shape}")
    print(f"Output range: [{outputs.min():.4f}, {outputs.max():.4f}]")

    # Check for NaN in outputs
    if torch.isnan(outputs).any():
        print("WARNING: NaN in outputs!")

    # Compute loss
    loss = criterion(outputs, targets)
    print(f"Loss: {loss.item():.6f}")

    if torch.isnan(loss):
        print("WARNING: NaN loss!")
        return

    # Backward pass
    loss.backward()

# Check gradients
for name, param in model.named_parameters():
    if param.grad is not None:
        grad_norm = param.grad.norm().item()
        print(f"{name}: grad_norm = {grad_norm:.6f}")
        if torch.isnan(param.grad).any():
            print(f"  WARNING: NaN gradient in {name}!")

# Optimizer step
optimizer.step()

return loss.item()

Quick Reference Commands

Environment and Device Checks

PyTorch version

print(f"PyTorch version: {torch.version}") print(f"CUDA available: {torch.cuda.is_available()}") print(f"CUDA version: {torch.version.cuda}") print(f"cuDNN version: {torch.backends.cudnn.version()}")

GPU information

if torch.cuda.is_available(): print(f"GPU count: {torch.cuda.device_count()}") print(f"Current GPU: {torch.cuda.current_device()}") print(f"GPU name: {torch.cuda.get_device_name()}") print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

Gradient Checking

Numerical gradient check

from torch.autograd import gradcheck

For custom autograd functions

input = torch.randn(10, requires_grad=True, dtype=torch.double) test = gradcheck(my_function, input, eps=1e-6, atol=1e-4) print(f"Gradient check passed: {test}")

For modules

def check_module_gradients(module, input_shape): module = module.double() input = torch.randn(*input_shape, requires_grad=True, dtype=torch.double) return gradcheck(module, input, eps=1e-6, atol=1e-4)

Memory Management

Clear GPU memory

torch.cuda.empty_cache()

Memory stats

print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB") print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB") print(f"Max allocated: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")

Reset peak stats

torch.cuda.reset_peak_memory_stats()

Find memory leaks

import gc gc.collect() torch.cuda.empty_cache()

Model Inspection

Count parameters

total_params = sum(p.numel() for p in model.parameters()) trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) print(f"Total: {total_params:,}, Trainable: {trainable_params:,}")

Model summary (requires torchsummary)

from torchsummary import summary summary(model, input_size=(3, 224, 224))

Export model to ONNX for visualization

torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)

Tensor Debugging

Comprehensive tensor info

def tensor_info(t, name="tensor"): print(f"{name}:") print(f" shape: {t.shape}") print(f" dtype: {t.dtype}") print(f" device: {t.device}") print(f" requires_grad: {t.requires_grad}") print(f" is_leaf: {t.is_leaf}") print(f" grad_fn: {t.grad_fn}") print(f" min: {t.min().item():.6f}") print(f" max: {t.max().item():.6f}") print(f" mean: {t.mean().item():.6f}") print(f" std: {t.std().item():.6f}") print(f" has_nan: {torch.isnan(t).any().item()}") print(f" has_inf: {torch.isinf(t).any().item()}")

Best Practices Summary

  • Always use explicit device placement - Never assume tensors are on the right device

  • Use loss functions on logits - CrossEntropyLoss expects raw scores, not softmax output

  • Register modules properly - Use nn.ModuleList/ModuleDict for parameter detection

  • Avoid in-place operations - They can break autograd graphs

  • Set seeds for reproducibility - Makes debugging deterministic

  • Start with small data - Test overfitting on a single batch first

  • Use anomaly detection - Enable during development, disable in production

  • Monitor gradients - Check for NaN, Inf, and exploding/vanishing gradients

  • Profile memory - Use memory_summary() to identify leaks

  • Print shapes liberally - Most bugs are shape mismatches

Resources

  • UvA Deep Learning Notebooks - Debugging Guide

  • PyTorch Lightning Debugging Guide

  • PyTorch Performance Tuning Guide

  • TorchRL Common Errors and Solutions

  • Machine Learning Mastery - Debugging PyTorch

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

refactor:flutter

No summary provided by upstream source.

Repository SourceNeeds Review
General

refactor:nestjs

No summary provided by upstream source.

Repository SourceNeeds Review
General

debug:flutter

No summary provided by upstream source.

Repository SourceNeeds Review
General

refactor:spring-boot

No summary provided by upstream source.

Repository SourceNeeds Review