Alliance Canada HPC for ML Researchers
This skill helps you write correct Slurm job scripts, set up Python environments, choose the right GPU cluster, manage data, and scale training on Alliance Canada (formerly Compute Canada) infrastructure.
Alliance clusters use Slurm for scheduling, Lmod for software modules, and virtualenv (never Conda) for Python environments. Pre-built Python wheels are available for most ML packages.
Quick Reference
Connect to a cluster
ssh username@trillium.alliancecan.ca
Cluster login nodes: narval.alliancecan.ca, cedar.alliancecan.ca, trillium.alliancecan.ca, graham.alliancecan.ca, fir.alliancecan.ca, nibi.alliancecan.ca, rorqual.alliancecan.ca
Connect IDE to a compute node (Claude Code, Cursor, VSCode, Codex)
Do NOT run these tools on login nodes — always use a compute node. Banned on Fir and tamIA, should be avoided on all clusters.
Never request GPUs for IDE sessions — these tools don't use GPU compute. Use minimal CPU resources (2 CPUs, 4G RAM).
Internet required: Claude Code and Codex need internet to reach their APIs. Many clusters block internet on compute nodes. Clusters with internet: Fir, Nibi, Vulcan, Killarney. Clusters without: Narval (proxy blocks api.anthropic.com), Trillium, Cedar, Graham. See references/remote-development.md for details.
# One-command workflow (recommended, prompts for time):
cluster-claude fir def-yourpi # Claude Code (Fir — has internet)
cluster-claude killarney aip-yourpi # Killarney (aip- accounts)
cluster-cursor narval def-yourpi # Cursor/VSCode
# Manual workflow:
ssh narval # 1. login node
salloc --time=3:00:00 --mem=4G --account=def-yourpi # 2. reserve compute node
srun --pty bash # 3. shell on compute node
# Or from local: ssh -t nc10305 claude (Narval/Rorqual only, needs ProxyJump)
See references/remote-development.md for the cluster-claude/cluster-cursor scripts, SSH ProxyJump setup, per-cluster node prefixes, and Vector Institute workflow.
Set up a Python environment
module load python/3.11
virtualenv --no-download ~/ENV
source ~/ENV/bin/activate
pip install --no-index --upgrade pip
pip install --no-index torch torchvision
# If using HuggingFace datasets/evaluate: load arrow BEFORE install
module load gcc arrow
pip install --no-index datasets evaluate
The --no-index flag uses Alliance pre-built wheels (optimized for cluster hardware). Never use Conda/Anaconda on these clusters.
Important: datasets and evaluate depend on pyarrow, which is provided by the arrow module. You must module load gcc arrow before installing them and every time you activate your virtualenv to use them.
Test before you train
Always run a short test job before submitting long training runs. This catches module issues, data path errors, and GPU visibility problems before you burn hours of allocation.
# Quick GPU test (5-10 min) — works on most clusters
sbatch --time=0:10:00 --gpus-per-node=h100:1 --cpus-per-task=6 \
--mem=32000M --account=def-yourpi train.sh
# Trillium: use the dedicated debugjob command (fast-start, up to 2h for 1 GPU)
debugjob -g 1
# tamIA: whole GPU nodes only (4×H100 or 8×H200, no partial)
sbatch --time=0:10:00 --gpus=h100:4 --account=aip-yourpi train.sh
Most clusters allow 5-minute minimum for test jobs (vs 1 hour for regular jobs). See references/best-practices.md for a pre-flight checklist.
After any job completes, run seff <jobid> to check actual time/memory usage and right-size future jobs. See references/best-practices.md for the full extrapolation method.
Submit a basic GPU job
#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --gpus-per-node=a100:1
#SBATCH --cpus-per-task=6
#SBATCH --mem=32000M
#SBATCH --time=0-03:00
#SBATCH --output=%N-%j.out
module load python/3.11
source ~/ENV/bin/activate
python train.py
Submit with sbatch train_job.sh. Check status with sq (alias for your jobs only).
GPU specifiers by cluster
| Cluster | GPU | Slurm specifier |
|---|---|---|
| Trillium | H100 80GB | h100 |
| Fir | H100 80GB | h100 |
| Nibi | H100 80GB | h100 |
| Rorqual | H100 80GB | h100 |
| Killarney | H100 80GB / L40S 48GB | h100 / l40s |
| Narval | A100 40GB | a100 |
| tamIA | H100 80GB / H200 | h100 / h200 |
| Vulcan | L40S 48GB | l40s |
Use: --gpus-per-node=h100:1 (or a100:1, l40s:1, etc.)
Storage tiers
| Filesystem | Purpose | Quota | Backed up | Purged |
|---|---|---|---|---|
$HOME | Code, scripts, small configs | 50 GB | Yes | No |
$SCRATCH | Large temp files, checkpoints | 20 TB | No | 60 days |
$PROJECT | Shared datasets, results | 1 TB (expandable) | Yes | No |
$SLURM_TMPDIR | Fast local node storage (per job) | Varies | No | Job end |
| Nearline | Long-term archive | 2 TB | Yes | No |
For ML datasets: copy to $SLURM_TMPDIR at job start for best I/O performance.
Path conventions: Projects are named after your PI: def-piname (default), rrg-piname (RAC). Symlink layout differs by cluster — always use $SCRATCH and $PROJECT env vars in scripts, not hardcoded paths. See references/storage-data.md for per-cluster details.
Check disk usage
diskusage_report
When to read reference files
The sections below point to detailed reference files. Read the one that matches the user's task:
Remote development (Claude Code, Cursor, VSCode, Codex on clusters)
Read references/remote-development.md when the user needs help with:
- Connecting Claude Code, Cursor, VSCode, or Codex to a cluster
- Why IDE/AI tools must NOT run on login nodes (banned on Fir, tamIA — avoid on all)
- Setting up SSH ProxyJump to reach compute nodes from local IDE
- The
salloc→ get node → connect workflow - Helper sbatch scripts (
remote-dev.sh,remote-dev-gpu.sh) for long sessions - VSCode/Cursor remote machine settings (required on all clusters to reduce load)
- Saving SLURM environment variables for IDE sessions
- Vector Institute Killarney Jupyter/SSH workflow (vec-playbook)
Getting started (account, SSH, MFA)
Read references/getting-started.md when the user needs help with:
- Creating an account, first-time setup
- SSH connections, keys, MFA
- Basic Linux orientation on clusters
Python environment setup
Read references/python-env.md when the user needs help with:
- Installing Python packages (pip, wheels)
- Virtualenv creation and activation
- Why not Conda, and what to do instead
- SciPy stack, available wheels
- Creating virtualenvs inside jobs ($SLURM_TMPDIR)
GPU job submission
Read references/gpu-jobs.md when the user needs help with:
- Writing GPU job scripts (single/multi-GPU)
- Choosing GPU types and specifiers
- MIG (Multi-Instance GPU) partitions
- CPU/memory ratios per GPU
- Monitoring GPU jobs (nvidia-smi, nvtop)
Storage and data management
Read references/storage-data.md when the user needs help with:
- Choosing where to store datasets
- Filesystem paths per cluster ($SCRATCH, $PROJECT, symlink layout differences)
- PI project naming (def-piname, rrg-piname)
- Handling large collections of small files (tar, zip)
- Using $SLURM_TMPDIR for fast local I/O
- Transferring data (Globus, scp, rsync)
- Scratch purging policies
Distributed training
Read references/distributed-training.md when the user needs help with:
- Multi-GPU training on a single node
- Multi-node distributed training
- PyTorch DDP (DistributedDataParallel)
- DeepSpeed (ZeRO stages, config)
- torchrun launcher with Slurm
- NCCL environment variables
Cluster selection guide
Read references/clusters.md when the user needs help with:
- Which cluster to use for their workload
- Cluster specs (nodes, GPUs, memory, network)
- Trillium vs Narval vs Cedar vs others
- Whole-node scheduling (Trillium) vs per-core (others)
Job management and monitoring
Read references/job-management.md when the user needs help with:
- Slurm directives (time, memory, account)
- Job arrays for hyperparameter sweeps
- Monitoring jobs (squeue, sacct, sstat)
- Checkpointing long training runs
- Experiment tracking (W&B, MLflow)
- W&B per-cluster availability and offline workflow
- JupyterHub (Fir, Narval, Rorqual)
HuggingFace ecosystem
Read references/huggingface.md when the user needs help with:
- Installing transformers, datasets, evaluate, accelerate
- Downloading and caching models (git-lfs, hf CLI, Python)
- HF_TOKEN setup for gated models (Llama, Gemma, Mistral)
- Offline mode and environment variables (HF_HOME, TRANSFORMERS_CACHE)
- HuggingFace Accelerate for multi-GPU / multi-node training
- Fine-tuning LLMs with FSDP
- Using pipelines and tokenizers offline
- Checking dataset configs and splits before loading
Data formats (Arrow, Parquet)
Read references/data-formats.md when the user needs help with:
- Loading the Arrow module (required for datasets/evaluate)
- PyArrow with NumPy, Pandas, Parquet
- Converting CSV to Parquet for efficient storage
- CUDA-accelerated Arrow
- Data format selection for ML workloads
Containers (Apptainer)
Read references/containers.md when the user needs help with:
- Running software in containers on HPC (Apptainer/Singularity)
- GPU access inside containers (
--nvflag) - Bind mounts for cluster filesystems (
-B /project,-B /scratch) - Building SIF images from Docker images
- Using Conda/Micromamba inside containers
- Apptainer cache management
vLLM inference serving
Read references/vllm.md when the user needs help with:
- Installing and running vLLM on Alliance clusters
- Single-node LLM inference with tensor parallelism
- Multi-node inference with Ray
- Downloading and caching HuggingFace models for vLLM
ML best practices
Read references/best-practices.md when the user needs help with:
- Testing jobs before long runs (test job examples, per-cluster policies, pre-flight checklist,
debugjobon Trillium) - Job design (splitting training, right-sizing resources)
- Data I/O optimization (small files problem, $SLURM_TMPDIR)
- Checkpointing and auto-resubmission patterns
- Memory management (gradient checkpointing, mixed precision)
- Experiment organization and reproducibility
- Common anti-patterns to avoid
Common pitfalls
-
Running VSCode/Claude Code/Cursor/Codex on login nodes: These tools are resource-heavy and will degrade the shared login node for everyone. Explicitly banned on Fir and tamIA, but should be avoided on all clusters. Always request a compute node with
sallocfirst, then connect your tool to that node. Seereferences/remote-development.md. -
Using Conda: Alliance clusters provide optimized wheels. Use
virtualenv+pip install --no-index. Conda causes library conflicts and wastes quota. -
Not using
--no-index: Without it, pip downloads from PyPI instead of using pre-built cluster wheels, which can cause CUDA mismatches. -
Forgetting
--account: If you belong to multiple allocations, you must specify--account=def-yourpi. -
Storing datasets in
$HOME: Home is only 50 GB. Use$PROJECTfor persistent datasets,$SCRATCHfor temporary large files. -
Reading many small files from
$PROJECT/$SCRATCH: Parallel filesystems are slow with many small files. Archive them withtarand extract to$SLURM_TMPDIRat job start. -
Not checkpointing: Jobs have wall-time limits. Save checkpoints regularly so you can resume. Split 3-day training into 3x 24h jobs.
-
Requesting too much time: Shorter jobs get scheduled faster. Request only what you need.
-
H100 clusters need torch >= 2.5.1: On Trillium/Fir/Nibi, older PyTorch versions won't work with H100 GPUs.
-
Using Docker directly: Docker is not available on Alliance HPC clusters (security reasons). Use Apptainer instead. You can convert Docker images to Apptainer SIF files:
apptainer build image.sif docker://... -
Forgetting
module load gcc arrowfor datasets/evaluate: These packages depend onpyarrow, which is a system module — not a pip package. Loadgcc arrowbefore installing and every time you use them, or you'll getModuleNotFoundError: No module named 'pyarrow'.