huggingface-accelerate

HuggingFace Accelerate - Unified Distributed Training

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "huggingface-accelerate" with this command: npx skills add davila7/claude-code-templates/davila7-claude-code-templates-huggingface-accelerate

HuggingFace Accelerate - Unified Distributed Training

Quick start

Accelerate simplifies distributed training to 4 lines of code.

Installation:

pip install accelerate

Convert PyTorch script (4 lines):

import torch

  • from accelerate import Accelerator

  • accelerator = Accelerator()

    model = torch.nn.Transformer() optimizer = torch.optim.Adam(model.parameters()) dataloader = torch.utils.data.DataLoader(dataset)

  • model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

    for batch in dataloader: optimizer.zero_grad() loss = model(batch)

  • loss.backward()
    
  • accelerator.backward(loss)
    optimizer.step()
    

Run (single command):

accelerate launch train.py

Common workflows

Workflow 1: From single GPU to multi-GPU

Original script:

train.py

import torch

model = torch.nn.Linear(10, 2).to('cuda') optimizer = torch.optim.Adam(model.parameters()) dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

for epoch in range(10): for batch in dataloader: batch = batch.to('cuda') optimizer.zero_grad() loss = model(batch).mean() loss.backward() optimizer.step()

With Accelerate (4 lines added):

train.py

import torch from accelerate import Accelerator # +1

accelerator = Accelerator() # +2

model = torch.nn.Linear(10, 2) optimizer = torch.optim.Adam(model.parameters()) dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) # +3

for epoch in range(10): for batch in dataloader: # No .to('cuda') needed - automatic! optimizer.zero_grad() loss = model(batch).mean() accelerator.backward(loss) # +4 optimizer.step()

Configure (interactive):

accelerate config

Questions:

  • Which machine? (single/multi GPU/TPU/CPU)

  • How many machines? (1)

  • Mixed precision? (no/fp16/bf16/fp8)

  • DeepSpeed? (no/yes)

Launch (works on any setup):

Single GPU

accelerate launch train.py

Multi-GPU (8 GPUs)

accelerate launch --multi_gpu --num_processes 8 train.py

Multi-node

accelerate launch --multi_gpu --num_processes 16
--num_machines 2 --machine_rank 0
--main_process_ip $MASTER_ADDR
train.py

Workflow 2: Mixed precision training

Enable FP16/BF16:

from accelerate import Accelerator

FP16 (with gradient scaling)

accelerator = Accelerator(mixed_precision='fp16')

BF16 (no scaling, more stable)

accelerator = Accelerator(mixed_precision='bf16')

FP8 (H100+)

accelerator = Accelerator(mixed_precision='fp8')

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

Everything else is automatic!

for batch in dataloader: with accelerator.autocast(): # Optional, done automatically loss = model(batch) accelerator.backward(loss)

Workflow 3: DeepSpeed ZeRO integration

Enable DeepSpeed ZeRO-2:

from accelerate import Accelerator

accelerator = Accelerator( mixed_precision='bf16', deepspeed_plugin={ "zero_stage": 2, # ZeRO-2 "offload_optimizer": False, "gradient_accumulation_steps": 4 } )

Same code as before!

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

Or via config:

accelerate config

Select: DeepSpeed → ZeRO-2

deepspeed_config.json:

{ "fp16": {"enabled": false}, "bf16": {"enabled": true}, "zero_optimization": { "stage": 2, "offload_optimizer": {"device": "cpu"}, "allgather_bucket_size": 5e8, "reduce_bucket_size": 5e8 } }

Launch:

accelerate launch --config_file deepspeed_config.json train.py

Workflow 4: FSDP (Fully Sharded Data Parallel)

Enable FSDP:

from accelerate import Accelerator, FullyShardedDataParallelPlugin

fsdp_plugin = FullyShardedDataParallelPlugin( sharding_strategy="FULL_SHARD", # ZeRO-3 equivalent auto_wrap_policy="TRANSFORMER_AUTO_WRAP", cpu_offload=False )

accelerator = Accelerator( mixed_precision='bf16', fsdp_plugin=fsdp_plugin )

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

Or via config:

accelerate config

Select: FSDP → Full Shard → No CPU Offload

Workflow 5: Gradient accumulation

Accumulate gradients:

from accelerate import Accelerator

accelerator = Accelerator(gradient_accumulation_steps=4)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

for batch in dataloader: with accelerator.accumulate(model): # Handles accumulation optimizer.zero_grad() loss = model(batch) accelerator.backward(loss) optimizer.step()

Effective batch size: batch_size * num_gpus * gradient_accumulation_steps

When to use vs alternatives

Use Accelerate when:

  • Want simplest distributed training

  • Need single script for any hardware

  • Use HuggingFace ecosystem

  • Want flexibility (DDP/DeepSpeed/FSDP/Megatron)

  • Need quick prototyping

Key advantages:

  • 4 lines: Minimal code changes

  • Unified API: Same code for DDP, DeepSpeed, FSDP, Megatron

  • Automatic: Device placement, mixed precision, sharding

  • Interactive config: No manual launcher setup

  • Single launch: Works everywhere

Use alternatives instead:

  • PyTorch Lightning: Need callbacks, high-level abstractions

  • Ray Train: Multi-node orchestration, hyperparameter tuning

  • DeepSpeed: Direct API control, advanced features

  • Raw DDP: Maximum control, minimal abstraction

Common issues

Issue: Wrong device placement

Don't manually move to device:

WRONG

batch = batch.to('cuda')

CORRECT

Accelerate handles it automatically after prepare()

Issue: Gradient accumulation not working

Use context manager:

CORRECT

with accelerator.accumulate(model): optimizer.zero_grad() accelerator.backward(loss) optimizer.step()

Issue: Checkpointing in distributed

Use accelerator methods:

Save only on main process

if accelerator.is_main_process: accelerator.save_state('checkpoint/')

Load on all processes

accelerator.load_state('checkpoint/')

Issue: Different results with FSDP

Ensure same random seed:

from accelerate.utils import set_seed set_seed(42)

Advanced topics

Megatron integration: See references/megatron-integration.md for tensor parallelism, pipeline parallelism, and sequence parallelism setup.

Custom plugins: See references/custom-plugins.md for creating custom distributed plugins and advanced configuration.

Performance tuning: See references/performance.md for profiling, memory optimization, and best practices.

Hardware requirements

  • CPU: Works (slow)

  • Single GPU: Works

  • Multi-GPU: DDP (default), DeepSpeed, or FSDP

  • Multi-node: DDP, DeepSpeed, FSDP, Megatron

  • TPU: Supported

  • Apple MPS: Supported

Launcher requirements:

  • DDP: torch.distributed.run (built-in)

  • DeepSpeed: deepspeed (pip install deepspeed)

  • FSDP: PyTorch 1.12+ (built-in)

  • Megatron: Custom setup

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

senior-data-scientist

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

senior-backend

No summary provided by upstream source.

Repository SourceNeeds Review
-1.2K
davila7
Coding

senior-frontend

No summary provided by upstream source.

Repository SourceNeeds Review