Cloud Storage Authentication
Secure authentication patterns for accessing cloud storage (S3, GCS, Azure Blob) and cloud services in data pipelines. Covers IAM roles, service principals, secret managers, and best practices for credential management.
Quick Reference
Provider Recommended Auth Alternative
AWS IAM roles (EC2/ECS/Lambda) Environment variables, Secrets Manager
GCP Workload Identity / ADC Service account keys (discouraged)
Azure Managed Identity Service principal with certificate
Local Dev .env files + local credentials Static keys (temporary only)
Core Principles
-
Least Privilege: Grant only necessary permissions (read-only, specific bucket)
-
Short-lived credentials: Use STS tokens, OIDC, not long-term keys
-
Automatic rotation: Prefer managed identities that rotate automatically
-
Secret management: Never commit credentials; use secret managers
-
Audit everything: Enable CloudTrail/Azure Audit Logs/GCP Audit Logs
-
Separate environments: Different credentials for dev/staging/prod
When to Use What?
-
Production on cloud VMs: Use IAM roles/Managed Identities (no credentials in code)
-
CI/CD pipelines: Use workload identity federation (OIDC) or short-lived tokens
-
Local development: .env files with user credentials from aws configure , gcloud auth , az login
-
Third-party integrations: Service principals with scoped permissions
-
Cross-account access: Role assumption (AWS), workload identity (GCP), service principal (Azure)
Skill Dependencies
This skill is foundational for:
-
@data-engineering-storage-remote-access
-
All cloud storage backends
-
@data-engineering-storage-lakehouse
-
Delta Lake/Iceberg with cloud catalogs
-
@data-engineering-streaming
-
Kafka connectors with cloud auth
-
@data-engineering-ai-ml
-
OpenAI, vector DBs with cloud storage
-
@data-engineering-orchestration
-
dbt, Prefect, Dagster cloud connectors
Detailed Guides
AWS Authentication
See: aws.md
-
IAM roles (EC2 instance profiles, ECS task roles, Lambda execution roles)
-
IAM users with access keys (discouraged for production)
-
STS temporary credentials (AssumeRole, GetSessionToken)
-
S3 presigned URLs for temporary file access
-
Cross-account access patterns
-
AWS Secrets Manager integration
-
Environment variable resolution (AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY , AWS_SESSION_TOKEN )
Google Cloud Platform
See: gcp.md
-
Service accounts (JSON keys)
-
Workload Identity Federation (no keys needed!)
-
Application Default Credentials (ADC)
-
Cloud Storage signed URLs
-
Secret Manager integration
-
Environment variables (GOOGLE_APPLICATION_CREDENTIALS )
-
GCP workload identity for GKE, Cloud Run, Compute Engine
Azure
See: azure.md
-
Managed Identities (system-assigned, user-assigned)
-
Service Principals (client secret, certificate)
-
SAS tokens for Blob Storage
-
Azure Key Vault integration
-
Environment variables (AZURE_STORAGE_ACCOUNT , AZURE_STORAGE_KEY )
-
Azure AD workload identity for AKS, App Service, VMs
Patterns & Best Practices
See: patterns.md
-
Secret rotation automation
-
Multi-environment credential management
-
Local development setup without production keys
-
CI/CD pipeline authentication (GitHub Actions, GitLab CI, Jenkins)
-
Testing with mock credentials (Moto, google-cloud-testutils)
-
Credential leakage prevention (.gitignore, pre-commit hooks)
Testing Strategies
See: testing.md
-
Mocking cloud services for unit tests
-
Using local emulators (MinIO, Azurite, LocalStack)
-
Test credential patterns with placeholders
-
Integration test setup with temporary credentials
Quick Examples
AWS IAM Role (Production)
No credentials in code - automatically from EC2/ECS/Lambda
import boto3 s3 = boto3.client('s3') # Uses instance metadata
GCP Workload Identity (Production)
Enable workload identity on GKE/Cloud Run
Then in Python:
import google.auth credentials, project = google.auth.default()
No env vars needed!
Azure Managed Identity (Production)
from azure.identity import DefaultAzureCredential from azure.storage.blob import BlobServiceClient
credential = DefaultAzureCredential() # Auto-detects managed identity client = BlobServiceClient(account_url="...", credential=credential)
Local Development
AWS
aws configure # Enter keys from IAM user (dev only)
GCP
gcloud auth application-default login
Azure
az login
Common Pitfalls
❌ Hardcoding credentials - Committing to git → rotate immediately ❌ Using root/admin accounts - Create scoped users/service principals ❌ Long-lived keys - Rotate every 90 days or less ❌ Over-permissive roles - Grant s3:GetObject not s3:*
❌ Missing environment separation - Dev credentials in prod ❌ Disabling TLS verification - Except for local MinIO testing only
References
-
AWS IAM Best Practices
-
GCP Workload Identity
-
Azure Managed Identities
-
HashiCorp Vault
-
Legacy @data-engineering-storage-remote-access auth notes are deprecated; use this skill as the source of truth.