training-llms-megatron

Megatron-Core - Large-Scale LLM Training

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "training-llms-megatron" with this command: npx skills add orchestra-research/ai-research-skills/orchestra-research-ai-research-skills-training-llms-megatron

Megatron-Core - Large-Scale LLM Training

Quick start

Megatron-Core trains LLMs from 2B to 462B parameters with up to 47% Model FLOP Utilization on H100 GPUs through advanced parallelism strategies.

Installation:

Docker (recommended)

docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3

Or pip

pip install megatron-core

Simple distributed training:

Train with 2 GPUs using data parallelism

torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py

Or LLaMA-3 8B training

./examples/llama/train_llama3_8b_fp8.sh

Common workflows

Workflow 1: Train LLaMA-style model with 3D parallelism

Copy this checklist:

LLaMA Training Setup:

  • Step 1: Choose parallelism configuration
  • Step 2: Configure training hyperparameters
  • Step 3: Launch distributed training
  • Step 4: Monitor performance metrics

Step 1: Choose parallelism configuration

Model size determines parallelism strategy:

Model Size GPUs Tensor Parallel Pipeline Parallel Data Parallel Context Parallel

7B 8 1 1 8 1

13B 8 2 1 4 1

70B 64 4 4 4 1

405B 128 8 8 2 2

Step 2: Configure training hyperparameters

#!/bin/bash

train_llama_70b.sh

GPUS_PER_NODE=8 NNODES=8 # 64 GPUs total TP=4 # Tensor parallel PP=4 # Pipeline parallel CP=1 # Context parallel

LLaMA 70B configuration

MODEL_SIZE=70 # Billion parameters HIDDEN_SIZE=8192 NUM_LAYERS=80 NUM_HEADS=64 SEQ_LENGTH=4096

Training hyperparameters

MICRO_BATCH=1 GLOBAL_BATCH=1024 LR=3e-4

torchrun
--nproc_per_node=$GPUS_PER_NODE
--nnodes=$NNODES
pretrain_gpt.py
--tensor-model-parallel-size $TP
--pipeline-model-parallel-size $PP
--context-parallel-size $CP
--sequence-parallel
--num-layers $NUM_LAYERS
--hidden-size $HIDDEN_SIZE
--num-attention-heads $NUM_HEADS
--seq-length $SEQ_LENGTH
--max-position-embeddings $SEQ_LENGTH
--micro-batch-size $MICRO_BATCH
--global-batch-size $GLOBAL_BATCH
--lr $LR
--train-iters 100000
--lr-decay-style cosine
--lr-warmup-iters 2000
--weight-decay 0.1
--clip-grad 1.0
--bf16
--use-mcore-models
--transformer-impl transformer_engine
--data-path /path/to/data
--vocab-file /path/to/vocab.json
--merge-file /path/to/merges.txt

Step 3: Launch distributed training

Single node (8 GPUs)

bash train_llama_70b.sh

Multi-node with SLURM

sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh

Step 4: Monitor performance metrics

Key metrics to track:

Model FLOP Utilization (MFU): Target >40% on H100 Throughput: Tokens/sec/GPU Memory usage: <80GB per GPU for 70B model Loss: Should decrease steadily

Workflow 2: Configure Mixture of Experts (MoE) training

For sparse MoE models like Mixtral.

MoE Training:

  • Step 1: Configure expert parallelism
  • Step 2: Set MoE hyperparameters
  • Step 3: Launch training with EP

Step 1: Configure expert parallelism

Mixtral 8x7B example

TENSOR_PARALLEL=2 PIPELINE_PARALLEL=1 EXPERT_PARALLEL=4 # Split 8 experts across 4 GPUs DATA_PARALLEL=4

TOTAL_GPUS=$((TENSOR_PARALLEL * PIPELINE_PARALLEL * EXPERT_PARALLEL * DATA_PARALLEL))

= 2 * 1 * 4 * 4 = 32 GPUs

Step 2: Set MoE hyperparameters

torchrun
--nproc_per_node=8
pretrain_gpt.py
--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--expert-model-parallel-size 4
--num-experts 8
--moe-router-topk 2
--moe-router-load-balancing-type aux_loss
--moe-aux-loss-coeff 0.01
--hidden-size 4096
--num-layers 32
--num-attention-heads 32
--seq-length 4096
--max-position-embeddings 4096
--bf16
--use-mcore-models
--transformer-impl transformer_engine
--data-path /path/to/data
--vocab-file /path/to/vocab.json
--merge-file /path/to/merges.txt

Step 3: Launch training with EP

Expert parallelism distributes different experts across GPUs, reducing memory while maintaining capacity.

Memory without EP: 8 experts × 7B = 56GB per GPU Memory with EP=4: 2 experts × 7B = 14GB per GPU Savings: 75% memory reduction

Workflow 3: Optimize for maximum throughput

Achieve 47% MFU on H100.

Performance Optimization:

  • Step 1: Enable Flash Attention
  • Step 2: Use FP8 precision (H100)
  • Step 3: Optimize micro-batch size
  • Step 4: Tune parallelism degrees

Step 1: Enable optimizations

--use-mcore-models # Use Megatron Core models --transformer-impl transformer_engine # Use Transformer Engine --sequence-parallel # Reduce activation memory (use with TP)

Step 2: Use FP8 precision (H100 only)

--fp8-hybrid # FP8 mixed precision training

Transformer Engine handles FP8 automatically

Result: 1.5-2x speedup on H100 vs BF16.

Step 3: Optimize micro-batch size

Find largest micro-batch that fits in memory:

Start with 1, increase until OOM

for MBS in 1 2 4 8; do echo "Testing micro-batch-size=$MBS" torchrun ... --micro-batch-size $MBS done

Typical values:

  • 7B model: 4-8

  • 70B model: 1-2

  • 405B model: 1

Step 4: Tune parallelism degrees

Rules of thumb:

Tensor Parallel: Use ≤8 (limited by NVLink within node) Pipeline Parallel: Use for >70B models Context Parallel: Use for sequences >8K tokens Data Parallel: Fill remaining GPUs

Example 405B on 128 H100s:

TP=8 (1 node) PP=8 (across nodes) CP=2 (long sequences) DP=1 Total = 8 × 8 × 2 × 1 = 128 GPUs

When to use vs alternatives

Use Megatron-Core when:

  • Training models >10B parameters

  • Need maximum efficiency (target >40% MFU)

  • Using NVIDIA GPUs (A100, H100)

  • Production training at scale

  • Want fine-grained parallelism control

Use alternatives instead:

  • PyTorch FSDP: Models <70B, simpler API, PyTorch native

  • DeepSpeed: Easier setup, good for <100B models

  • HuggingFace Accelerate: Prototyping, simpler workflows

  • LitGPT: Educational, single-file implementations

Common issues

Issue: Low GPU utilization (<30% MFU)

Causes:

  • Micro-batch too small

  • Too much parallelism overhead

  • Not using Flash Attention

Fixes:

Increase micro-batch

--micro-batch-size 4 # Was 1

Enable optimizations

--use-flash-attn --sequence-parallel

Reduce TP if >8

--tensor-model-parallel-size 4 # Was 16

Issue: Out of memory

Reduce memory with:

--tensor-model-parallel-size 2 # Split model across GPUs --recompute-granularity full # Gradient checkpointing --recompute-method block # Checkpoint transformer blocks --recompute-num-layers 1 # Checkpoint every layer

Or use CPU/NVMe offloading:

--cpu-optimizer # Offload optimizer to CPU --cpu-optimizer-type ADAM # CPU Adam variant

Issue: Training slower than expected

Check:

  • Network bottleneck: Ensure InfiniBand/NVLink enabled

  • Pipeline bubbles: Use interleaved pipeline schedule --num-layers-per-virtual-pipeline-stage 2

  • Data loading: Use fast data loader --dataloader-type cyclic

Issue: Diverging loss

Stabilize training:

--lr-warmup-iters 2000 # Longer warmup --clip-grad 1.0 # Gradient clipping --init-method-std 0.006 # Smaller init --attention-dropout 0.0 # No dropout in attention --hidden-dropout 0.0 # No dropout in FFN

Advanced topics

Parallelism strategies: See references/parallelism-guide.md for detailed comparison of TP/PP/DP/CP/EP with performance analysis and when to use each.

Performance benchmarks: See references/benchmarks.md for MFU numbers across different model sizes and GPU configurations.

Production configurations: See references/production-examples.md for real-world setups from LLaMA 3 405B, Nemotron-4 340B, and DeepSeek-V3 671B.

Training recipes: See references/training-recipes.md for complete hyperparameter configurations for GPT/LLaMA/Mixtral architectures.

Hardware requirements

  • GPU: NVIDIA Ampere+ (A100, H100, B200)

  • Turing works but slower

  • FP8 requires Hopper/Ada/Blackwell

  • Network: InfiniBand or 400Gb+ Ethernet for multi-node

  • Memory per GPU:

  • 7B model: 40GB+

  • 70B model: 80GB (with TP=4)

  • 405B model: 80GB (with TP=8, PP=8)

  • Storage: Fast NVMe for checkpoints (1TB+ for 70B+ models)

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

ml-paper-writing

No summary provided by upstream source.

Repository SourceNeeds Review
Research

faiss

No summary provided by upstream source.

Repository SourceNeeds Review
Research

serving-llms-vllm

No summary provided by upstream source.

Repository SourceNeeds Review
Research

tensorrt-llm

No summary provided by upstream source.

Repository SourceNeeds Review