slurm

Help write, debug, and manage SLURM jobs. Use when the user asks about sbatch, salloc, squeue, job scripts, or cluster resource allocation.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "slurm" with this command: npx skills add michaelrizvi/claude-config/michaelrizvi-claude-config-slurm

SLURM Assistant

Help the user write job scripts, debug failed jobs, and manage cluster resources.

Job Script Guidelines

  • Always include: --job-name, --output, --error, --time, --mem, --gres (for GPUs), --cpus-per-task
  • Place scripts in a dedicated folder (e.g. scripts/)
  • Use set -euo pipefail in the bash portion
  • Log key info at the start: hostname, GPU info (nvidia-smi), date, git commit hash
  • Activate the correct virtual environment before running Python

Resource Allocation Rules

  • Small experiments (<1M params): 1 GPU, 4-8 CPUs, 16-32GB RAM
  • Medium experiments (1M-1B params): 1-2 GPUs, 8-16 CPUs, 32-64GB RAM
  • Large models (7B+): multiple GPUs, 64-128GB+ RAM
  • 32B+ inference: 4+ GPUs, match tensor parallelism to GPU count
  • Rule of thumb: ~4-8 CPUs per GPU, ~2x model size in FP16 for VRAM

Known GPU Types & Selection

GPU types (use with --gres=gpu:<type>:N)

  • a100: A100 40GB HBM2e
  • a100l: A100 80GB HBM2e
  • a6000: RTX A6000 48GB GDDR6
  • h100: H100 80GB HBM3
  • l40s: L40S ~45GB GDDR6
  • rtx8000: Quadro RTX 8000 48GB GDDR6
  • v100: V100 32GB HBM2

GPU selection by attribute

You can also request GPUs by memory, architecture, or feature:

  • By memory: --gres=gpu:48gb:1 (any 48GB GPU: RTX8000, A6000, L40S)
  • By arch: --gres=gpu:ampere:1 (A100, A6000, L40S)
  • By interconnect: --gres=gpu:nvlink:1
  • By system: --gres=gpu:dgx:1
  • Memory tags: 12gb, 32gb, 40gb, 48gb, 80gb
  • Arch tags: volta, turing, ampere

Node Inventory

NodesCountGPUsCPUsRAM
cn-l[001-091]914x L40S (48GB)481024GB
cn-c[001-040]408x RTX8000 (48GB)64384GB
cn-g[001-029]294x A100 (80GB)641024GB
cn-a[001-011]118x RTX8000 (48GB)40384GB
cn-b[001-005]58x V100 (32GB)40384GB
cn-k[001-004]44x A100 (40GB)48512GB
cn-n[001-002]28x H100 (80GB)1922048GB
cn-d[001-004] (DGX)48x A100 (40/80GB)1281024-2048GB
cn-j00118x A6000 (48GB)641024GB

GPUs per node is either 4 or 8 — don't request more than the node type has.

Partitions & Preemption

PartitionTime LimitPer-User Limits
long (default)7 daysNo per-user GPU cap
main5 days2 GPUs, 8 CPUs, 48GB
short3 hours4 GPUs, 1TB mem
unkillable2 days1 GPU, 6 CPUs, 32GB

Preemption hierarchy: unkillable > main > long. Once preempted, jobs are killed and auto-requeued. main jobs do NOT preempt other main jobs. -grace variants give a SIGTERM grace period before kill. Checkpoint frequently on long partition.

Storage

PathQuotaKey Policy
$HOME100GB / 1M filesDaily backup, low I/O — don't write logs here
$SCRATCH5TB / unlimitedFiles unused >90 days deleted
$SLURM_TMPDIRNo quotaFastest I/O, cleared after job
/network/projects/<group>/1TB / 1M filesShared project storage
$ARCHIVE5TBNo backup, not on GPU nodes

Always copy data to $SLURM_TMPDIR at job start for performance. Write logs/outputs to $SCRATCH, not $HOME. Check usage with disk-quota.

Module System

  • module load python/3.10 — required before creating venvs on cluster
  • module load miniconda/3 — for conda environments
  • module avail / module spider <term> — search available modules
  • Pre-built PyTorch/TF modules exist for Mila GPUs
  • On login/CPU nodes without GPUs: CONDA_OVERRIDE_CUDA=11.8 before conda commands

Debugging Failed Jobs

  • Check .err files first — experiment logs go to stderr
  • sacct -j <jobid> --format=JobID,State,ExitCode,MaxRSS,Elapsed,NodeList for completed jobs
  • Common issues: OOM (check MaxRSS), time limit, bad path, missing module/env
  • For OOM: check batch size, model size, gradient accumulation, and whether --mem was sufficient
  • torch.autograd.set_detect_anomaly(True) causes extreme filesystem IOPS — never leave on in batch jobs, admins will flag it

Monitoring

  • disk-quota — check storage usage
  • squeue -u $USER — your active jobs
  • echo $SLURM_JOB_GPUS — which GPU(s) your job got
  • Netdata per-node: <node>.server.mila.quebec:19999 (requires Mila wifi or SSH tunnel)
  • Grafana dashboard: dashboard.server.mila.quebec

Limits

  • Max 1000 jobs per user in the system at any time

Safety

  • Never submit jobs (sbatch) without explicit user confirmation
  • Verify paths and configs before submission
  • Test on small instances first when possible

Scope

$ARGUMENTS

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

pytorch-debug

No summary provided by upstream source.

Repository SourceNeeds Review
General

bash

No summary provided by upstream source.

Repository SourceNeeds Review
General

review

No summary provided by upstream source.

Repository SourceNeeds Review
General

experiment

No summary provided by upstream source.

Repository SourceNeeds Review