Use when:
-
Setting up BigQuery ML for SQL-based machine learning
-
Configuring Vertex AI custom training jobs
-
Setting up GCP authentication for ML workflows
-
Selecting appropriate GPU/TPU configurations
-
Estimating costs for GCP ML training
-
Deploying models to Vertex AI endpoints
-
Configuring distributed training on GCP
-
Optimizing cost vs performance for cloud ML
Platform Overview
BigQuery ML
What it is: SQL-based machine learning directly in BigQuery Best for:
-
Quick ML prototypes using existing data warehouse data
-
Classification, regression, forecasting on structured data
-
Users familiar with SQL but not Python/ML frameworks
-
Large-scale batch predictions
Available Models:
-
Linear/Logistic Regression
-
XGBoost (BOOSTED_TREE)
-
Deep Neural Networks (DNN)
-
AutoML Tables
-
TensorFlow/PyTorch imported models
Pricing:
-
Based on data processed (same as BigQuery queries)
-
$5 per TB processed for analysis
-
AutoML: $19.32/hour for training
Vertex AI Training
What it is: Fully managed ML training platform Best for:
-
Custom PyTorch/TensorFlow training
-
Large-scale distributed training
-
GPU/TPU-accelerated workloads
-
Production ML pipelines
Available Compute:
-
CPUs: n1-standard, n1-highmem, n1-highcpu
-
GPUs: NVIDIA T4, P4, V100, P100, A100, L4
-
TPUs: v2, v3, v4, v5e (8 cores to 512 cores)
Pricing:
-
CPU: $0.05-0.30/hour depending on machine type
-
GPU T4: $0.35/hour
-
GPU A100: $3.67/hour (40GB) or $4.95/hour (80GB)
-
TPU v3: $8.00/hour (8 cores)
-
TPU v4: $11.00/hour (8 cores)
GPU/TPU Selection Guide
GPU Selection (Vertex AI)
T4 (16GB VRAM):
-
Use case: Inference, light training, small models
-
Cost: $0.35/hour
-
Good for: BERT-base, small CNNs, inference serving
V100 (16GB VRAM):
-
Use case: Mid-size training, mixed precision training
-
Cost: $2.48/hour
-
Good for: ResNet training, medium transformers
A100 (40GB/80GB VRAM):
-
Use case: Large model training, distributed training
-
Cost: $3.67/hour (40GB), $4.95/hour (80GB)
-
Good for: GPT-style models, large vision models, multi-GPU training
L4 (24GB VRAM):
-
Use case: Modern alternative to T4, better performance
-
Cost: $0.66/hour
-
Good for: Mid-size models, efficient inference
TPU Selection (Vertex AI)
TPU v2 (8 cores):
-
Use case: TensorFlow/JAX training, matrix operations
-
Cost: $4.50/hour
-
Memory: 8GB per core (64GB total)
-
Good for: Legacy TensorFlow models
TPU v3 (8 cores):
-
Use case: Standard TPU training
-
Cost: $8.00/hour
-
Memory: 16GB per core (128GB total)
-
Good for: BERT, T5, image classification
TPU v4 (8 cores):
-
Use case: Latest generation, best performance
-
Cost: $11.00/hour
-
Memory: 32GB per core (256GB total)
-
Good for: Large language models, cutting-edge research
TPU v5e (8 cores):
-
Use case: Cost-optimized TPU
-
Cost: $2.50/hour
-
Good for: Development, training at scale on budget
Multi-node TPU Pods:
-
v3-32: 32 cores, $32/hour
-
v3-128: 128 cores, $128/hour
-
v4-128: 128 cores, $176/hour
-
Use for: Massive distributed training (GPT-3 scale)
Usage
Setup BigQuery ML Environment
bash scripts/setup-bigquery-ml.sh
Prompts for:
-
GCP Project ID
-
BigQuery dataset name
-
Service account credentials
-
Default model type preference
Creates:
-
bigquery_config.json
-
Project configuration
-
.bigqueryrc
-
CLI configuration
-
Example training SQL in examples/
Setup Vertex AI Training Environment
bash scripts/setup-vertex-ai.sh
Prompts for:
-
GCP Project ID
-
Region (us-central1, europe-west4, etc.)
-
Service account credentials
-
Default machine type
-
GPU/TPU preference
Creates:
-
vertex_config.yaml
-
Training job configuration
-
vertex_requirements.txt
-
Python dependencies
-
Training script template
Configure GCP Authentication
bash scripts/configure-auth.sh
Prompts for:
-
Authentication method (service account, user account, workload identity)
-
Service account key path (if applicable)
-
IAM roles needed
Creates:
-
.gcp_auth_config
-
Authentication configuration
-
Sets GOOGLE_APPLICATION_CREDENTIALS environment variable
-
Validates permissions
Required IAM Roles:
-
BigQuery ML: roles/bigquery.dataEditor , roles/bigquery.jobUser
-
Vertex AI: roles/aiplatform.user , roles/storage.objectAdmin
-
Both: roles/serviceusage.serviceUsageConsumer
Estimate GCP Training Costs
bash scripts/estimate-gcp-cost.sh
Interactive prompts:
-
Platform: BigQuery ML or Vertex AI
-
If BigQuery ML: Data size to process
-
If Vertex AI:
-
Machine type (CPU/GPU/TPU)
-
Number of machines
-
Training duration estimate
-
Storage requirements
Output:
-
Estimated compute cost
-
Storage cost
-
Data transfer cost (if applicable)
-
Total estimated cost
-
Cost comparison with other GCP options
Templates
BigQuery ML Training Template (templates/bigquery_ml_training.sql )
SQL template for creating and training models:
-
Model creation syntax
-
Feature engineering examples
-
Training options (L1/L2 reg, learning rate, etc.)
-
Evaluation queries
-
Prediction queries
Supported model types:
-
LINEAR_REG, LOGISTIC_REG
-
BOOSTED_TREE_CLASSIFIER, BOOSTED_TREE_REGRESSOR
-
DNN_CLASSIFIER, DNN_REGRESSOR
-
AUTOML_CLASSIFIER, AUTOML_REGRESSOR
Vertex AI Training Job Template (templates/vertex_training_job.py )
Python template for custom training:
-
Training loop structure
-
Distributed training setup (PyTorch DDP)
-
Checkpointing and model saving
-
Metrics logging to Vertex AI
-
Hyperparameter tuning integration
Includes:
-
Single GPU training
-
Multi-GPU training (DataParallel, DistributedDataParallel)
-
TPU training with PyTorch/XLA
-
Cloud Storage integration
GPU Configuration Template (templates/vertex_gpu_config.yaml )
YAML configuration for GPU training jobs:
-
Machine type selection
-
GPU type and count
-
Disk configuration
-
Network configuration
-
Environment variables
Presets included:
-
Single T4 (budget)
-
Single A100 (standard)
-
4x A100 (distributed)
-
8x A100 (large-scale)
TPU Configuration Template (templates/vertex_tpu_config.yaml )
YAML configuration for TPU training jobs:
-
TPU type and topology
-
TPU version selection
-
JAX/TensorFlow runtime
-
XLA compilation flags
Presets included:
-
v3-8 (single TPU)
-
v4-32 (TPU pod slice)
-
v5e-8 (cost-optimized)
GCP Authentication Template (templates/gcp_auth.json )
Service account configuration template:
-
Project ID
-
Service account email
-
Key file path
-
Required scopes
-
IAM role assignments
Security notes:
-
Uses placeholders only (never real keys)
-
Documents how to create service accounts
-
Includes .gitignore protection
Examples
BigQuery ML Regression Example (examples/bigquery-regression-example.sql )
Complete example:
-
Dataset: NYC taxi trip data
-
Task: Predict trip duration
-
Model: BOOSTED_TREE_REGRESSOR
-
Includes feature engineering, training, evaluation
Demonstrates:
-
CREATE MODEL syntax
-
TRANSFORM clause for feature engineering
-
MODEL evaluation
-
Batch predictions
Vertex AI PyTorch Training Example (examples/vertex-pytorch-training.py )
Complete training script:
-
Dataset: IMDB sentiment analysis
-
Model: DistilBERT fine-tuning
-
Training: Single GPU
-
Logging: Vertex AI experiments
Demonstrates:
-
Loading data from GCS
-
Training loop with mixed precision
-
Checkpointing to GCS
-
Metrics logging
-
Model export to Vertex AI
Vertex AI Distributed Training Example (examples/vertex-distributed-training.py )
Multi-GPU training example:
-
Dataset: ImageNet subset
-
Model: ResNet-50
-
Training: 4x A100 with DDP
-
Scaling: Linear scaling rule
Demonstrates:
-
PyTorch DistributedDataParallel
-
Gradient accumulation
-
Learning rate scaling
-
Synchronized batch norm
-
Multi-node coordination
Hugging Face Fine-tuning on Vertex AI (examples/vertex-huggingface-finetuning.py )
Production fine-tuning template:
-
Dataset: Custom text classification
-
Model: BERT/RoBERTa/DeBERTa
-
Training: Hugging Face Trainer API
-
Deployment: Vertex AI endpoint
Demonstrates:
-
Hugging Face Trainer integration
-
Hyperparameter tuning with Vertex AI
-
Model versioning
-
Endpoint deployment
-
Online predictions
Cost Optimization Tips
BigQuery ML
Reduce data processed:
-
Use partitioned tables
-
Filter data in WHERE clause before training
-
Use table sampling for experimentation
-
Cache intermediate results
Use appropriate model types:
-
Start with LINEAR_REG/LOGISTIC_REG (cheapest)
-
Use BOOSTED_TREE for better accuracy at moderate cost
-
Reserve AutoML for when simpler models fail
Optimize queries:
-
Avoid SELECT * (specify columns)
-
Use clustering on filter columns
-
Materialize views for repeated training
Vertex AI
Machine type selection:
-
Start with CPU for prototyping
-
Use T4 for small models (cheapest GPU)
-
Use A100 only for large models that need it
-
Consider TPU v5e for TensorFlow/JAX (very cost-effective)
Training optimization:
-
Use preemptible instances (60-70% cheaper, can be interrupted)
-
Enable automatic checkpoint/resume for preemptible
-
Use mixed precision training (FP16/BF16) for faster training
-
Profile to eliminate CPU bottlenecks
Storage optimization:
-
Store datasets in Cloud Storage (cheaper than persistent disk)
-
Use Filestore only if needed for POSIX filesystem
-
Clean up old model artifacts
-
Use lifecycle policies to archive old data
Multi-GPU efficiency:
-
Ensure near-linear scaling before adding more GPUs
-
Profile inter-GPU communication
-
Use gradient accumulation instead of larger batch sizes
-
Consider 2x GPUs instead of 1x larger GPU (often same cost, better availability)
Integration with ML Training Plugin
This skill integrates with other ml-training components:
-
training-patterns: Provides GCP configs for generated training scripts
-
cost-calculator: Uses GCP pricing data for budget planning
-
monitoring-dashboard: Integrates with Vertex AI TensorBoard
-
validation-scripts: Validates GCP credentials and permissions
-
integration-helpers: Deploys trained models to Vertex AI endpoints
Common Workflows
Workflow 1: Quick BigQuery ML Prototype
-
Run bash scripts/setup-bigquery-ml.sh
-
Copy templates/bigquery_ml_training.sql to your project
-
Modify SQL for your dataset and features
-
Run training query in BigQuery console
-
Evaluate with built-in ML.EVALUATE()
-
Export predictions with ML.PREDICT()
Time: 30 minutes setup + training time Cost: $5 per TB of data processed
Workflow 2: Custom PyTorch Training on Vertex AI
-
Run bash scripts/configure-auth.sh
-
Run bash scripts/setup-vertex-ai.sh
-
Copy templates/vertex_training_job.py
-
Customize training loop for your model
-
Copy templates/vertex_gpu_config.yaml
-
Submit job: gcloud ai custom-jobs create ...
-
Monitor in Vertex AI console
Time: 1 hour setup + training time Cost: Depends on GPU/TPU selection
Workflow 3: Large-Scale Distributed Training
-
Setup Vertex AI (workflow 2)
-
Copy examples/vertex-distributed-training.py
-
Adapt for your model architecture
-
Test locally with 1 GPU
-
Test with 2 GPUs to verify scaling
-
Scale to 4-8 GPUs for full training
-
Use preemptible instances with checkpointing
Time: 2-4 hours setup + training time Cost: $15-60/hour depending on GPU count
Troubleshooting
BigQuery ML Issues
"Insufficient permissions":
-
Verify roles/bigquery.dataEditor and roles/bigquery.jobUser
-
Check dataset-level permissions
-
Ensure billing is enabled
"Model training failed":
-
Check for NULL values in features
-
Verify data types match model expectations
-
Review feature engineering TRANSFORM clause
-
Check for sufficient training data
Vertex AI Issues
"Service account lacks permissions":
-
Verify roles/aiplatform.user
-
Add roles/storage.objectAdmin for GCS access
-
Check project-level IAM policies
"GPU/TPU quota exceeded":
-
Request quota increase in GCP console
-
Use different region with availability
-
Start with smaller GPU/TPU configuration
-
Use preemptible instances (separate quota)
"Training job crashes":
-
Check for CUDA OOM (reduce batch size)
-
Verify dependencies in requirements.txt
-
Review logs in Cloud Logging
-
Test locally before submitting to Vertex
Security Best Practices
Credentials Management
DO:
-
✅ Use service accounts with minimal permissions
-
✅ Store credentials in Secret Manager
-
✅ Use Workload Identity for GKE deployments
-
✅ Rotate service account keys regularly
-
✅ Add .gitignore for *.json key files
DON'T:
-
❌ Hardcode credentials in code
-
❌ Commit service account keys to git
-
❌ Use overly permissive roles (e.g., Owner)
-
❌ Share service account keys across projects
-
❌ Use personal credentials for production
IAM Best Practices
-
Use separate service accounts for training vs serving
-
Grant roles at resource level, not project level when possible
-
Use Workload Identity Federation instead of keys when possible
-
Enable Cloud Audit Logs for ML API usage
-
Review IAM permissions quarterly
Performance Benchmarks
BigQuery ML vs Vertex AI
BigQuery ML:
-
Best for: Structured data, SQL users, quick prototypes
-
Training time: Minutes to hours (depends on data size)
-
Scalability: Automatic (serverless)
-
Cost: $5/TB processed
Vertex AI Custom Training:
-
Best for: Deep learning, custom architectures, GPU/TPU workloads
-
Training time: Hours to days (configurable hardware)
-
Scalability: Manual (choose machine type)
-
Cost: $0.35-20/hour depending on hardware
Rule of thumb:
-
Use BigQuery ML for tabular data with < 100M rows
-
Use Vertex AI for images, text, audio, or custom models
-
Use Vertex AI for models requiring GPU/TPU acceleration
Additional Resources
-
GCP ML Documentation: https://cloud.google.com/vertex-ai/docs
-
BigQuery ML Reference: https://cloud.google.com/bigquery-ml/docs
-
Pricing Calculator: https://cloud.google.com/products/calculator
-
TPU Best Practices: https://cloud.google.com/tpu/docs/best-practices
-
Vertex AI Samples: https://github.com/GoogleCloudPlatform/vertex-ai-samples