Cloud Solution Architect

Overview

Design well-architected, production-grade cloud systems following Azure Architecture Center best practices. This skill provides:

10 design principles for Azure applications
6 architecture styles with selection guidance
44 cloud design patterns mapped to WAF pillars
Technology choice frameworks for compute, storage, data, messaging
Performance antipatterns to avoid
Architecture review workflow for systematic design validation

Ten Design Principles for Azure Applications

Principle Key Tactics

1 Design for self-healing Retry with backoff, circuit breaker, bulkhead isolation, health endpoint monitoring, graceful degradation

2 Make all things redundant Eliminate single points of failure, use availability zones, deploy multi-region, replicate data

3 Minimize coordination Decouple services, use async messaging, embrace eventual consistency, use domain events

4 Design to scale out Horizontal scaling, autoscaling rules, stateless services, avoid session stickiness, partition workloads

5 Partition around limits Data partitioning (shard/hash/range), respect compute & network limits, use CDNs for static content

6 Design for operations Structured logging, distributed tracing, metrics & dashboards, runbook automation, infrastructure as code

7 Use managed services Prefer PaaS over IaaS, reduce operational burden, leverage built-in HA/DR/scaling

8 Use an identity service Microsoft Entra ID, managed identity, RBAC, avoid storing credentials, zero-trust principles

9 Design for evolution Loose coupling, versioned APIs, backward compatibility, async messaging for integration, feature flags

10 Build for business needs Define SLAs/SLOs, establish RTO/RPO targets, domain-driven design, cost modeling, composite SLAs

Architecture Styles

Style Description When to Use Key Services

N-tier Horizontal layers (presentation, business, data) Traditional enterprise apps, lift-and-shift App Service, SQL Database, VNets

Web-Queue-Worker Web frontend → message queue → backend worker Moderate-complexity apps with long-running tasks App Service, Service Bus, Functions

Microservices Small autonomous services, bounded contexts, independent deploy Complex domains, independent team scaling AKS, Container Apps, API Management

Event-driven Pub/sub model, event producers/consumers Real-time processing, IoT, reactive systems Event Hubs, Event Grid, Functions

Big data Batch + stream processing pipeline Analytics, ML pipelines, large-scale data Synapse, Data Factory, Databricks

Big compute HPC, parallel processing Simulations, modeling, rendering, genomics Batch, CycleCloud, HPC VMs

Selection Criteria

Domain complexity → Microservices (high), N-tier (low-medium)
Team autonomy → Microservices (independent teams), N-tier (single team)
Data volume → Big data (TB+), others (GB)
Latency requirements → Event-driven (real-time), Web-Queue-Worker (tolerant)

Cloud Design Patterns

44 patterns organized by primary concern. WAF pillar mapping: R=Reliability, S=Security, CO=Cost Optimization, OE=Operational Excellence, PE=Performance Efficiency.

Messaging & Communication

Pattern Summary Pillars

Asynchronous Request-Reply Decouple request/response with polling or callbacks R, PE

Claim Check Split large messages; store payload separately, pass reference R, PE

Choreography Services coordinate via events without central orchestrator R, OE

Competing Consumers Multiple consumers process messages from shared queue concurrently R, PE

Messaging Bridge Connect incompatible messaging systems R, OE

Pipes and Filters Decompose complex processing into reusable filter stages R, OE

Priority Queue Prioritize requests so higher-priority work is processed first R, PE

Publisher/Subscriber Decouple senders from receivers via topics/subscriptions R, PE

Queue-Based Load Leveling Buffer requests with a queue to smooth intermittent loads R, PE

Sequential Convoy Process related messages in order while allowing parallel groups R, PE

Reliability & Resilience

Pattern Summary Pillars

Bulkhead Isolate resources per workload to prevent cascading failure R

Circuit Breaker Stop calling a failing service; fail fast to protect resources R

Compensating Transaction Undo previously committed steps when a later step fails R

Health Endpoint Monitoring Expose health checks for load balancers and orchestrators R, OE

Leader Election Coordinate distributed instances by electing a leader R

Retry Handle transient faults by retrying with exponential backoff R

Saga Manage data consistency across microservices with compensating transactions R

Scheduler Agent Supervisor Coordinate distributed actions with retry and failure handling R

Data Management

Pattern Summary Pillars

Cache-Aside Load data on demand into cache from data store PE

CQRS Separate read and write models for independent scaling PE, R

Event Sourcing Store state as append-only sequence of domain events R, OE

Index Table Create indexes over frequently queried fields in data stores PE

Materialized View Pre-compute views over data for efficient queries PE

Sharding Distribute data across partitions for scale and performance PE, R

Static Content Hosting Serve static content from cloud storage/CDN directly PE, CO

Valet Key Grant clients limited direct access to storage resources S, PE

Design & Structure

Pattern Summary Pillars

Ambassador Offload cross-cutting concerns to a helper sidecar proxy OE

Anti-Corruption Layer Translate between new and legacy system models OE, R

Backends for Frontends Create separate backends per frontend type (mobile, web, etc.) OE, PE

Compute Resource Consolidation Combine multiple workloads into fewer compute instances CO

External Configuration Store Externalize configuration from deployment packages OE

Sidecar Deploy helper components alongside the main service OE

Strangler Fig Incrementally migrate legacy systems by replacing pieces OE, R

Security & Access

Pattern Summary Pillars

Federated Identity Delegate authentication to an external identity provider S

Gatekeeper Protect services using a dedicated broker that validates requests S

Quarantine Isolate and validate external assets before allowing use S

Rate Limiting Control consumption rate of resources by consumers R, S

Throttling Control resource consumption to sustain SLAs under load R, PE

Deployment & Scaling

Pattern Summary Pillars

Deployment Stamps Deploy multiple independent copies of application components R, PE

Edge Workload Configuration Configure workloads differently across diverse edge devices OE

Gateway Aggregation Aggregate multiple backend calls into a single client request PE

Gateway Offloading Offload shared functionality (SSL, auth) to a gateway OE, S

Gateway Routing Route requests to multiple backends using a single endpoint OE

Geode Deploy backends to multiple regions for active-active serving R, PE

See Design Patterns Reference for detailed implementation guidance.

Technology Choices

Decision Framework

For each technology area, evaluate: requirements → constraints → tradeoffs → select.

Area Key Options Selection Criteria

Compute App Service, Functions, Container Apps, AKS, VMs, Batch Hosting model, scaling, cost, team skills

Storage Blob Storage, Data Lake, Files, Disks, Managed Lustre Access patterns, throughput, cost tier

Data stores SQL Database, Cosmos DB, PostgreSQL, Redis, Table Storage Consistency model, query patterns, scale

Messaging Service Bus, Event Hubs, Event Grid, Queue Storage Ordering, throughput, pub/sub vs queue

Networking Front Door, Application Gateway, Load Balancer, Traffic Manager Global vs regional, L4 vs L7, WAF

AI services Azure OpenAI, AI Search, AI Foundry, Document Intelligence Model needs, data grounding, orchestration

Containers Container Apps, AKS, Container Instances Operational control vs simplicity

See Technology Choices Reference for detailed decision trees.

Best Practices

Practice Key Guidance

API design RESTful conventions, resource-oriented URIs, HATEOAS, versioning via URL path or header

API implementation Async operations, pagination, idempotent PUT/DELETE, content negotiation, ETag caching

Autoscaling Scale on metrics (CPU, queue depth, custom), cool-down periods, predictive scaling, scale-in protection

Background jobs Use queues or scheduled triggers, idempotent processing, poison message handling, graceful shutdown

Caching Cache-aside pattern, TTL policies, cache invalidation strategies, distributed cache for multi-instance

CDN Static asset offloading, cache-busting with versioned URLs, geo-distribution, HTTPS enforcement

Data partitioning Horizontal (sharding), vertical, functional partitioning; partition key selection for even distribution

Partitioning strategies Hash-based, range-based, directory-based; rebalancing approach, cross-partition query avoidance

Host name preservation Preserve original host header through proxies/gateways for cookies, redirects, auth flows

Message encoding Schema evolution (Avro/Protobuf), backward/forward compatibility, schema registry

Monitoring & diagnostics Structured logging, distributed tracing (W3C Trace Context), metrics, alerts, dashboards

Transient fault handling Retry with exponential backoff + jitter, circuit breaker, idempotency keys, timeout budgets

See Best Practices Reference for implementation details.

Performance Antipatterns

Avoid these common patterns that degrade performance under load:

Antipattern Problem Fix

Busy Database Offloading too much processing to the database Move logic to application tier, use caching

Busy Front End Resource-intensive work on frontend request threads Offload to background workers/queues

Chatty I/O Many small I/O requests instead of fewer large ones Batch requests, use bulk APIs, buffer writes

Extraneous Fetching Retrieving more data than needed Project only required fields, paginate, filter server-side

Improper Instantiation Recreating expensive objects per request Use singletons, connection pooling, HttpClientFactory

Monolithic Persistence Single data store for all data types Polyglot persistence — right store for each workload

No Caching Repeatedly fetching unchanged data Cache-aside pattern, CDN, output caching, Redis

Noisy Neighbor One tenant consuming all shared resources Bulkhead isolation, per-tenant quotas, throttling

Retry Storm Aggressive retries overwhelming a recovering service Exponential backoff + jitter, circuit breaker, retry budgets

Synchronous I/O Blocking threads on I/O operations Async/await, non-blocking I/O, reactive streams

Mission-Critical Design

For workloads targeting 99.99%+ SLO, address these design areas:

Design Area Key Considerations

Application platform Multi-region active-active, availability zones, Container Apps or AKS with zone redundancy

Application design Stateless services, idempotent operations, graceful degradation, bulkhead isolation

Networking Azure Front Door (global LB), DDoS Protection, private endpoints, redundant connectivity

Data platform Multi-region Cosmos DB, zone-redundant SQL, async replication, conflict resolution

Deployment & testing Blue-green deployments, canary releases, chaos engineering, automated rollback

Health modeling Composite health scores, dependency health tracking, automated remediation, SLI dashboards

Security Zero-trust, managed identity everywhere, key rotation, WAF policies, threat modeling

Operational procedures Automated runbooks, incident response playbooks, game days, postmortems

See Mission-Critical Reference for detailed guidance.

Well-Architected Framework (WAF) Pillars

Every architecture decision should be evaluated against all five pillars:

Pillar Focus Key Questions

Reliability Resiliency, availability, disaster recovery What is the RTO/RPO? How does it handle failures? Is there redundancy?

Security Threat protection, identity, data protection Is identity managed? Is data encrypted? Are there network controls?

Cost Optimization Cost management, efficiency, right-sizing Is compute right-sized? Are there reserved instances? Is there waste?

Operational Excellence Monitoring, deployment, automation Is deployment automated? Is there observability? Are there runbooks?

Performance Efficiency Scaling, load testing, performance targets Can it scale horizontally? Are there performance baselines? Is caching used?

WAF Tradeoff Matrix

Optimizing for... May impact...

Reliability (redundancy) Cost (more resources)

Security (isolation) Performance (added latency)

Cost (consolidation) Reliability (shared failure domains)

Performance (caching) Cost (cache infrastructure), Reliability (stale data)

Architecture Review Workflow

When reviewing or designing a system, follow this structured approach:

Step 1: Identify Requirements

Functional: What must the system do? Non-functional:

Availability target (e.g., 99.9%, 99.99%)
Latency requirements (p50, p95, p99)
Throughput (requests/sec, messages/sec)
Data residency and compliance
Recovery targets (RTO, RPO)
Cost constraints

Step 2: Select Architecture Style

Match requirements to architecture style using the selection criteria table above.

Step 3: Choose Technology Stack

Use the technology choices decision framework. Prefer managed services (PaaS) over IaaS.

Step 4: Apply Design Patterns

Select relevant patterns from the 44 cloud design patterns based on identified concerns.

Step 5: Address Cross-Cutting Concerns

Identity & access — Microsoft Entra ID, managed identity, RBAC
Monitoring — Application Insights, Azure Monitor, Log Analytics
Security — Network segmentation, encryption at rest/in transit, Key Vault
CI/CD — GitHub Actions, Azure DevOps Pipelines, infrastructure as code

Step 6: Validate Against WAF Pillars

Review each pillar systematically. Document tradeoffs explicitly.

Step 7: Document Decisions

Use Architecture Decision Records (ADRs):

ADR-NNN: [Decision Title]

Status: [Proposed | Accepted | Deprecated]

Context

[What is the issue we're addressing?]

Decision

[What did we decide and why?]

Consequences

[What are the positive and negative impacts?]

References

Design Patterns Reference — Detailed pattern implementations
Technology Choices Reference — Decision trees for Azure services
Best Practices Reference — Implementation guidance
Mission-Critical Reference — High-availability design

Source

Content derived from the Azure Architecture Center — Microsoft's official guidance for cloud solution architecture on Azure. Covers design principles, architecture styles, cloud design patterns, technology choices, best practices, performance antipatterns, mission-critical design, and the Well-Architected Framework.

cloud-solution-architect

Safety Notice

Copy this and send it to your AI assistant to learn

ADR-NNN: [Decision Title]

Status: [Proposed | Accepted | Deprecated]

Context

Decision

Consequences

Source Transparency

Related Skills

skill-creator

mcp-builder

frontend-design-review

continual-learning