SkyPilot Multi-Cloud Orchestration

Comprehensive guide to running ML workloads across clouds with automatic cost optimization using SkyPilot.

When to use SkyPilot

Use SkyPilot when:

Running ML workloads across multiple clouds (AWS, GCP, Azure, etc.)
Need cost optimization with automatic cloud/region selection
Running long jobs on spot instances with auto-recovery
Managing distributed multi-node training
Want unified interface for 20+ cloud providers
Need to avoid vendor lock-in

Key features:

Multi-cloud: AWS, GCP, Azure, Kubernetes, Lambda, RunPod, 20+ providers
Cost optimization: Automatic cheapest cloud/region selection
Spot instances: 3-6x cost savings with automatic recovery
Distributed training: Multi-node jobs with gang scheduling
Managed jobs: Auto-recovery, checkpointing, fault tolerance
Sky Serve: Model serving with autoscaling

Use alternatives instead:

Modal: For simpler serverless GPU with Python-native API
RunPod: For single-cloud persistent pods
Kubernetes: For existing K8s infrastructure
Ray: For pure Ray-based orchestration

Quick start

Installation

pip install "skypilot[aws,gcp,azure,kubernetes]"

Verify cloud credentials

sky check

Hello World

Create hello.yaml :

resources: accelerators: T4:1

run: | nvidia-smi echo "Hello from SkyPilot!"

Launch:

sky launch -c hello hello.yaml

SSH to cluster

ssh hello

Terminate

sky down hello

Core concepts

Task YAML structure

Task name (optional)

Resource requirements

resources: cloud: aws # Optional: auto-select if omitted region: us-west-2 # Optional: auto-select if omitted accelerators: A100:4 # GPU type and count cpus: 8+ # Minimum CPUs memory: 32+ # Minimum memory (GB) use_spot: true # Use spot instances disk_size: 256 # Disk size (GB)

Number of nodes for distributed training

num_nodes: 2

Working directory (synced to ~/sky_workdir)

workdir: .

Setup commands (run once)

setup: | pip install -r requirements.txt

Run commands

run: | python train.py

Key commands

Command Purpose

sky launch

Launch cluster and run task

sky exec

Run task on existing cluster

sky status

Show cluster status

sky stop

Stop cluster (preserve state)

sky down

Terminate cluster

sky logs

View task logs

sky queue

Show job queue

sky jobs launch

Launch managed job

sky serve up

Deploy serving endpoint

GPU configuration

Available accelerators

NVIDIA GPUs

accelerators: T4:1 accelerators: L4:1 accelerators: A10G:1 accelerators: L40S:1 accelerators: A100:4 accelerators: A100-80GB:8 accelerators: H100:8

Cloud-specific

accelerators: V100:4 # AWS/GCP accelerators: TPU-v4-8 # GCP TPUs

GPU fallbacks

resources: accelerators: H100: 8 A100-80GB: 8 A100: 8 any_of: - cloud: gcp - cloud: aws - cloud: azure

Spot instances

resources: accelerators: A100:8 use_spot: true spot_recovery: FAILOVER # Auto-recover on preemption

Cluster management

Launch and execute

Launch new cluster

sky launch -c mycluster task.yaml

Run on existing cluster (skip setup)

sky exec mycluster another_task.yaml

Interactive SSH

ssh mycluster

Stream logs

sky logs mycluster

Autostop

resources: accelerators: A100:4 autostop: idle_minutes: 30 down: true # Terminate instead of stop

Set autostop via CLI

sky autostop mycluster -i 30 --down

Cluster status

All clusters

sky status

Detailed view

sky status -a

Distributed training

Multi-node setup

resources: accelerators: A100:8

num_nodes: 4 # 4 nodes × 8 GPUs = 32 GPUs total

setup: | pip install torch torchvision

run: | torchrun
--nnodes=$SKYPILOT_NUM_NODES
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE
--node_rank=$SKYPILOT_NODE_RANK
--master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
--master_port=12355
train.py

Environment variables

Variable Description

SKYPILOT_NODE_RANK

Node index (0 to num_nodes-1)

SKYPILOT_NODE_IPS

Newline-separated IP addresses

SKYPILOT_NUM_NODES

Total number of nodes

SKYPILOT_NUM_GPUS_PER_NODE

GPUs per node

Head-node-only execution

run: | if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then python orchestrate.py fi

Managed jobs

Spot recovery

Launch managed job with spot recovery

sky jobs launch -n my-job train.yaml

Checkpointing

file_mounts: /checkpoints: name: my-checkpoints store: s3 mode: MOUNT

resources: accelerators: A100:8 use_spot: true

run: | python train.py
--checkpoint-dir /checkpoints
--resume-from-latest

Job management

List jobs

sky jobs queue

View logs

sky jobs logs my-job

Cancel job

sky jobs cancel my-job

File mounts and storage

Local file sync

workdir: ./my-project # Synced to ~/sky_workdir

file_mounts: /data/config.yaml: ./config.yaml ~/.vimrc: ~/.vimrc

Cloud storage

file_mounts:

Mount S3 bucket

/datasets: source: s3://my-bucket/datasets mode: MOUNT # Stream from S3

Copy GCS bucket

/models: source: gs://my-bucket/models mode: COPY # Pre-fetch to disk

Cached mount (fast writes)

/outputs: name: my-outputs store: s3 mode: MOUNT_CACHED

Storage modes

Mode Description Best For

MOUNT

Stream from cloud Large datasets, read-heavy

COPY

Pre-fetch to disk Small files, random access

MOUNT_CACHED

Cache with async upload Checkpoints, outputs

Sky Serve (Model Serving)

Basic service

service.yaml

service: readiness_probe: /health replica_policy: min_replicas: 1 max_replicas: 10 target_qps_per_replica: 2.0

resources: accelerators: A100:1

run: | python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-2-7b-chat-hf
--port 8000

Deploy

sky serve up -n my-service service.yaml

Check status

sky serve status

Get endpoint

sky serve status my-service

Autoscaling policies

service: replica_policy: min_replicas: 1 max_replicas: 10 target_qps_per_replica: 2.0 upscale_delay_seconds: 60 downscale_delay_seconds: 300 load_balancing_policy: round_robin

Cost optimization

Automatic cloud selection

SkyPilot finds cheapest option

resources: accelerators: A100:8

No cloud specified - auto-select cheapest

Show optimizer decision

sky launch task.yaml --dryrun

Cloud preferences

resources: accelerators: A100:8 any_of: - cloud: gcp region: us-central1 - cloud: aws region: us-east-1 - cloud: azure

Environment variables

envs: HF_TOKEN: $HF_TOKEN # Inherited from local env WANDB_API_KEY: $WANDB_API_KEY

Or use secrets

secrets:

HF_TOKEN
WANDB_API_KEY

Common workflows

Workflow 1: Fine-tuning with checkpoints

file_mounts: /checkpoints: name: finetune-checkpoints store: s3 mode: MOUNT_CACHED

resources: accelerators: A100:8 use_spot: true

setup: | pip install transformers accelerate

run: | python train.py
--checkpoint-dir /checkpoints
--resume

Workflow 2: Hyperparameter sweep

envs: RUN_ID: 0 LEARNING_RATE: 1e-4 BATCH_SIZE: 32

resources: accelerators: A100:1 use_spot: true

run: | python train.py
--lr $LEARNING_RATE
--batch-size $BATCH_SIZE
--run-id $RUN_ID

Launch multiple jobs

for i in {1..10}; do sky jobs launch sweep.yaml
--env RUN_ID=$i
--env LEARNING_RATE=$(python -c "import random; print(10**random.uniform(-5,-3))") done

Debugging

SSH to cluster

ssh mycluster

View logs

sky logs mycluster

Check job queue

sky queue mycluster

View managed job logs

sky jobs logs my-job

Common issues

Issue Solution

Quota exceeded Request quota increase, try different region

Spot preemption Use sky jobs launch for auto-recovery

Slow file sync Use MOUNT_CACHED mode for outputs

GPU not available Use any_of for fallback clouds

References

Advanced Usage - Multi-cloud, optimization, production patterns
Troubleshooting - Common issues and solutions

Resources

Documentation: https://docs.skypilot.co
GitHub: https://github.com/skypilot-org/skypilot
Slack: https://slack.skypilot.co
Examples: https://github.com/skypilot-org/skypilot/tree/master/examples

skypilot-multi-cloud-orchestration

Safety Notice

Copy this and send it to your AI assistant to learn

Verify cloud credentials

SSH to cluster

Terminate

Task name (optional)

Resource requirements

Number of nodes for distributed training

Working directory (synced to ~/sky_workdir)

Setup commands (run once)

Run commands

NVIDIA GPUs

Cloud-specific

Launch new cluster

Run on existing cluster (skip setup)

Interactive SSH

Stream logs

Set autostop via CLI

All clusters

Detailed view

Launch managed job with spot recovery

List jobs

View logs

Cancel job

Mount S3 bucket

Copy GCS bucket

Cached mount (fast writes)

service.yaml

Deploy

Check status

Get endpoint

SkyPilot finds cheapest option

No cloud specified - auto-select cheapest

Show optimizer decision

Or use secrets

Launch multiple jobs

SSH to cluster

View logs

Check job queue

View managed job logs

Source Transparency

Related Skills

ml-paper-writing

skypilot-multi-cloud-orchestration

qdrant-vector-search

peft-fine-tuning