SLURM Assistant
Help the user write job scripts, debug failed jobs, and manage cluster resources.
Job Script Guidelines
- Always include:
--job-name,--output,--error,--time,--mem,--gres(for GPUs),--cpus-per-task - Place scripts in a dedicated folder (e.g.
scripts/) - Use
set -euo pipefailin the bash portion - Log key info at the start: hostname, GPU info (
nvidia-smi), date, git commit hash - Activate the correct virtual environment before running Python
Resource Allocation Rules
- Small experiments (<1M params): 1 GPU, 4-8 CPUs, 16-32GB RAM
- Medium experiments (1M-1B params): 1-2 GPUs, 8-16 CPUs, 32-64GB RAM
- Large models (7B+): multiple GPUs, 64-128GB+ RAM
- 32B+ inference: 4+ GPUs, match tensor parallelism to GPU count
- Rule of thumb: ~4-8 CPUs per GPU, ~2x model size in FP16 for VRAM
Known GPU Types & Selection
GPU types (use with --gres=gpu:<type>:N)
- a100: A100 40GB HBM2e
- a100l: A100 80GB HBM2e
- a6000: RTX A6000 48GB GDDR6
- h100: H100 80GB HBM3
- l40s: L40S ~45GB GDDR6
- rtx8000: Quadro RTX 8000 48GB GDDR6
- v100: V100 32GB HBM2
GPU selection by attribute
You can also request GPUs by memory, architecture, or feature:
- By memory:
--gres=gpu:48gb:1(any 48GB GPU: RTX8000, A6000, L40S) - By arch:
--gres=gpu:ampere:1(A100, A6000, L40S) - By interconnect:
--gres=gpu:nvlink:1 - By system:
--gres=gpu:dgx:1 - Memory tags:
12gb,32gb,40gb,48gb,80gb - Arch tags:
volta,turing,ampere
Node Inventory
| Nodes | Count | GPUs | CPUs | RAM |
|---|---|---|---|---|
| cn-l[001-091] | 91 | 4x L40S (48GB) | 48 | 1024GB |
| cn-c[001-040] | 40 | 8x RTX8000 (48GB) | 64 | 384GB |
| cn-g[001-029] | 29 | 4x A100 (80GB) | 64 | 1024GB |
| cn-a[001-011] | 11 | 8x RTX8000 (48GB) | 40 | 384GB |
| cn-b[001-005] | 5 | 8x V100 (32GB) | 40 | 384GB |
| cn-k[001-004] | 4 | 4x A100 (40GB) | 48 | 512GB |
| cn-n[001-002] | 2 | 8x H100 (80GB) | 192 | 2048GB |
| cn-d[001-004] (DGX) | 4 | 8x A100 (40/80GB) | 128 | 1024-2048GB |
| cn-j001 | 1 | 8x A6000 (48GB) | 64 | 1024GB |
GPUs per node is either 4 or 8 — don't request more than the node type has.
Partitions & Preemption
| Partition | Time Limit | Per-User Limits |
|---|---|---|
long (default) | 7 days | No per-user GPU cap |
main | 5 days | 2 GPUs, 8 CPUs, 48GB |
short | 3 hours | 4 GPUs, 1TB mem |
unkillable | 2 days | 1 GPU, 6 CPUs, 32GB |
Preemption hierarchy: unkillable > main > long. Once preempted, jobs are killed and auto-requeued. main jobs do NOT preempt other main jobs. -grace variants give a SIGTERM grace period before kill. Checkpoint frequently on long partition.
Storage
| Path | Quota | Key Policy |
|---|---|---|
$HOME | 100GB / 1M files | Daily backup, low I/O — don't write logs here |
$SCRATCH | 5TB / unlimited | Files unused >90 days deleted |
$SLURM_TMPDIR | No quota | Fastest I/O, cleared after job |
/network/projects/<group>/ | 1TB / 1M files | Shared project storage |
$ARCHIVE | 5TB | No backup, not on GPU nodes |
Always copy data to $SLURM_TMPDIR at job start for performance. Write logs/outputs to $SCRATCH, not $HOME. Check usage with disk-quota.
Module System
module load python/3.10— required before creating venvs on clustermodule load miniconda/3— for conda environmentsmodule avail/module spider <term>— search available modules- Pre-built PyTorch/TF modules exist for Mila GPUs
- On login/CPU nodes without GPUs:
CONDA_OVERRIDE_CUDA=11.8before conda commands
Debugging Failed Jobs
- Check
.errfiles first — experiment logs go to stderr sacct -j <jobid> --format=JobID,State,ExitCode,MaxRSS,Elapsed,NodeListfor completed jobs- Common issues: OOM (check MaxRSS), time limit, bad path, missing module/env
- For OOM: check batch size, model size, gradient accumulation, and whether
--memwas sufficient torch.autograd.set_detect_anomaly(True)causes extreme filesystem IOPS — never leave on in batch jobs, admins will flag it
Monitoring
disk-quota— check storage usagesqueue -u $USER— your active jobsecho $SLURM_JOB_GPUS— which GPU(s) your job got- Netdata per-node:
<node>.server.mila.quebec:19999(requires Mila wifi or SSH tunnel) - Grafana dashboard:
dashboard.server.mila.quebec
Limits
- Max 1000 jobs per user in the system at any time
Safety
- Never submit jobs (
sbatch) without explicit user confirmation - Verify paths and configs before submission
- Test on small instances first when possible
Scope
$ARGUMENTS