llm-training

Frameworks and techniques for training and finetuning large language models.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "llm-training" with this command: npx skills add eyadsibai/ltk/eyadsibai-ltk-llm-training

LLM Training

Frameworks and techniques for training and finetuning large language models.

Framework Comparison

Framework Best For Multi-GPU Memory Efficient

Accelerate Simple distributed Yes Basic

DeepSpeed Large models, ZeRO Yes Excellent

PyTorch Lightning Clean training loops Yes Good

Ray Train Scalable, multi-node Yes Good

TRL RLHF, reward modeling Yes Good

Unsloth Fast LoRA finetuning Limited Excellent

Accelerate (HuggingFace)

Minimal wrapper for distributed training. Run accelerate config for interactive setup.

Key concept: Wrap model, optimizer, dataloader with accelerator.prepare() , use accelerator.backward() for loss.

DeepSpeed (Large Models)

Microsoft's optimization library for training massive models.

ZeRO Stages:

  • Stage 1: Optimizer states partitioned across GPUs

  • Stage 2: + Gradients partitioned

  • Stage 3: + Parameters partitioned (for largest models, 100B+)

Key concept: Configure via JSON, higher stages = more memory savings but more communication overhead.

TRL (RLHF/DPO)

HuggingFace library for reinforcement learning from human feedback.

Training types:

  • SFT (Supervised Finetuning): Standard instruction tuning

  • DPO (Direct Preference Optimization): Simpler than RLHF, uses preference pairs

  • PPO: Classic RLHF with reward model

Key concept: DPO is often preferred over PPO - simpler, no reward model needed, just chosen/rejected response pairs.

Unsloth (Fast LoRA)

Optimized LoRA finetuning - 2x faster, 60% less memory.

Key concept: Drop-in replacement for standard LoRA with automatic optimizations. Best for 7B-13B models.

Memory Optimization Techniques

Technique Memory Savings Trade-off

Gradient checkpointing ~30-50% Slower training

Mixed precision (fp16/bf16) ~50% Minor precision loss

4-bit quantization (QLoRA) ~75% Some quality loss

Flash Attention ~20-40% Requires compatible GPU

Gradient accumulation Effective batch↑ No memory cost

Decision Guide

Scenario Recommendation

Simple finetuning Accelerate + PEFT

7B-13B models Unsloth (fastest)

70B+ models DeepSpeed ZeRO-3

RLHF/DPO alignment TRL

Multi-node cluster Ray Train

Clean code structure PyTorch Lightning

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

document-processing

No summary provided by upstream source.

Repository SourceNeeds Review
General

stripe-payments

No summary provided by upstream source.

Repository SourceNeeds Review
General

file-organization

No summary provided by upstream source.

Repository SourceNeeds Review
General

literature-review

No summary provided by upstream source.

Repository SourceNeeds Review