software-backend

Software Backend Engineering

Use this skill to design, implement, and review production-grade backend services: API boundaries, data layer, auth, caching, observability, error handling, testing, and deployment.

Defaults to bias toward: type-safe boundaries (validation at the edge), OpenTelemetry for observability, zero-trust assumptions, idempotency for retries, RFC 9457 errors, Postgres + pooling, structured logs, timeouts, and rate limiting.

Scaffolding rule: When scaffolding a new project, show full working implementations for all domain logic — fraud rules, audit logging, webhook handlers, validation pipelines, background jobs. Don't just reference file names or stub functions; show the actual code so the user can run it immediately.

Quick Reference

Task Default Picks Notes

REST API Fastify / Express / NestJS Prefer typed boundaries + explicit timeouts

Edge API Hono / platform-native handlers Keep work stateless, CPU-light

Type-Safe API tRPC Prefer for TS monorepos and internal APIs

GraphQL API Apollo Server / Pothos Prefer for complex client-driven queries

Database PostgreSQL Use pooling + migrations + query budgets

ORM / Query Layer Prisma / Drizzle / SQLAlchemy / GORM / SeaORM / EF Core Prefer explicit transactions

Authentication OIDC/OAuth + sessions/JWT Prefer httpOnly cookies for browsers

Validation Zod / Pydantic / validator libs Validate at the boundary, not deep inside

Caching Redis (or managed) Use TTLs + invalidation strategy

Background Jobs BullMQ / platform queues Make jobs idempotent + retry-safe

Testing Unit + integration + contract/E2E Keep most tests below the UI layer

Observability Structured logs + OpenTelemetry Correlation IDs end-to-end

Scope

Use this skill to:

Design and implement REST/GraphQL/tRPC APIs
Model data schemas and run safe migrations
Implement authentication/authorization (OIDC/OAuth, sessions/JWT)
Add validation, error handling, rate limiting, caching, and background jobs
Ship production readiness (timeouts, observability, deploy/runbooks)

When NOT to Use This Skill

Use a different skill when:

Frontend-only concerns -> See software-frontend
Infrastructure provisioning (Terraform, K8s manifests) -> See ops-devops-platform
API design patterns only (no implementation) -> See dev-api-design
SQL query optimization and indexing -> See data-sql-optimization
Security audits and threat modeling -> See software-security-appsec
System architecture (beyond single service) -> See software-architecture-design

Technology Selection

Pick based on the strongest constraint, not feature lists:

Constraint Default Pick Why

Team knows TypeScript only Fastify/Hono + Prisma/Drizzle Ecosystem depth, hiring ease

Need <50ms P95, CPU-bound work Go (net/http + sqlc/pgx) Goroutines isolate CPU work; no event-loop risk

Data-heavy / ML integration Python (FastAPI + SQLAlchemy) Best ecosystem for numpy/pandas/ML pipelines

Memory-safety critical Rust (Axum + SeaORM/SQLx) Zero-cost abstractions, no GC

Enterprise/.NET team C# (ASP.NET Core + EF Core) Azure integration, mature tooling

Edge/serverless Hono / platform-native handlers Stateless, CPU-light, fast cold starts

Fintech/audit-sensitive Go + sqlc (or raw SQL) ORM magic is a liability; you need auditable SQL

For detailed framework/ORM/auth/caching selection trees, see references/edge-deployment-guide.md and language-specific references. See assets/ for starter templates per language.

API Design Patterns (Dec 2025)

Idempotency Patterns

All mutating operations MUST support idempotency for retry safety.

Implementation:

// Idempotency key header const idempotencyKey = request.headers['idempotency-key']; const cached = await redis.get(idem:${idempotencyKey}); if (cached) return JSON.parse(cached);

const result = await processOperation(); await redis.set(idem:${idempotencyKey}, JSON.stringify(result), 'EX', 86400); return result;

Do Avoid

Store idempotency keys with TTL (24h typical) Processing duplicate requests

Return cached response for duplicate keys Different responses for same key

Use client-generated UUIDs Server-generated keys

Pagination Patterns

Pattern Use When Example

Cursor-based Large datasets, real-time data ?cursor=abc123&limit=20

Offset-based Small datasets, random access ?page=3&per_page=20

Keyset Sorted data, high performance ?after_id=1000&limit=20

Prefer cursor-based pagination for APIs with frequent inserts.

Error Response Standard (Problem Details)

Use a consistent machine-readable error format (RFC 9457 Problem Details): https://www.rfc-editor.org/rfc/rfc9457

{ "type": "https://example.com/problems/invalid-request", "title": "Invalid request", "status": 400, "detail": "email is required", "instance": "/v1/users" }

Health Check Patterns

// Liveness: Is the process running? app.get('/health/live', (req, res) => { res.status(200).json({ status: 'ok' }); });

// Readiness: Can the service handle traffic? app.get('/health/ready', async (req, res) => { const dbOk = await checkDatabase(); const cacheOk = await checkRedis(); if (dbOk && cacheOk) { res.status(200).json({ status: 'ready', db: 'ok', cache: 'ok' }); } else { res.status(503).json({ status: 'not ready', db: dbOk, cache: cacheOk }); } });

Common Mistakes (Non-Obvious)

Avoid Instead Why

N+1 queries include /select or DataLoader 10-100x perf hit; easy to miss in ORM code

No request timeouts Timeouts on HTTP clients, DB, handlers Hung deps cascade; see Production Hardening below

Missing connection pooling Prisma pool / PgBouncer / pgx pool Exhaustion under load on shared DB tiers

Catching errors silently Log + rethrow or handle explicitly Hidden failures, impossible to debug

Production Hardening: Patterns Models Skip

These are the patterns that separate "works in dev" from "survives production." Models tend to skip them unless explicitly prompted — add them to every service.

Request & Query Timeouts

Every outbound call needs a timeout. Without one, a hung dependency leaks connections and cascades failures.

// HTTP client timeout const response = await fetch(url, { signal: AbortSignal.timeout(5000) });

// Database query timeout (Prisma) await prisma.$queryRawSET statement_timeout = '3000';

// Express/Fastify request timeout server.register(import('@fastify/timeout'), { timeout: 30000 });

Layer Default Timeout Rationale

HTTP client calls 5s External APIs shouldn't block you

Database queries 3s Slow queries = missing index or bad plan

Request handler 30s Safety net for the whole request lifecycle

Background jobs 5min Jobs that run longer need chunking

Field-Level Selection (Don't SELECT * )

ORMs default to fetching all columns. On wide tables this wastes bandwidth and hides performance problems.

// BAD: fetches all 30 columns const users = await prisma.user.findMany({ include: { posts: true } });

// GOOD: fetch only what the endpoint needs const users = await prisma.user.findMany({ select: { id: true, name: true, email: true }, include: { posts: { select: { id: true, title: true } } } });

For Go (sqlc): write explicit column lists in SQL queries — sqlc enforces this naturally. For Python (SQLAlchemy): use load_only() or explicit column selection.

Structured Error Responses (RFC 9457)

Return machine-readable errors from day one. Clients shouldn't have to regex-parse error messages.

{ "type": "https://api.example.com/problems/validation-error", "title": "Validation failed", "status": 422, "detail": "email must be a valid email address", "instance": "/v1/users", "errors": [{ "field": "email", "message": "invalid format" }] }

Set Content-Type: application/problem+json . This format is a standard (RFC 9457) and parseable by any HTTP client.

Query Plan Verification

Before shipping any new query to production, verify its execution plan:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ... FROM ... WHERE ...;

Red flags in the output: Seq Scan on large tables, Nested Loop with high row estimates, Sort without index. Add indexes or rewrite the query before deploying.

Performance Debugging Workflow

When a service is slow, work through these layers in order. Fix the cheapest layer first — don't add caching before fixing N+1 queries.

Step What to Check Fix

Query analysis Enable query logging, find N+1s and slow queries Rewrite with include /joins, add select for field-level optimization
Indexing Run EXPLAIN ANALYZE on slow queries Add composite indexes matching WHERE + ORDER BY patterns
Connection pooling Check connection count vs. pool size Configure pool limits (Prisma connection_limit , PgBouncer, pgx pool)
Caching Identify read-heavy, rarely-changing data Add Redis/in-memory cache with TTL + invalidation strategy
Timeouts Check for missing timeouts on DB, HTTP, handlers Add timeouts at every layer (see Production Hardening above)
Platform tuning Shared DB limits, cold starts, memory Upgrade tier, add read replicas, tune runtime settings

Key principle: always measure before and after. Use structured logging with request IDs to trace specific slow requests end-to-end.

Infrastructure Economics

Backend architecture decisions directly impact cost and revenue. See references/infrastructure-economics.md for detailed cost modeling, SLA-to-revenue mapping, unit economics checklists, and FinOps practices.

Navigation

Resources

references/backend-best-practices.md - Template authoring guide, quality checklist, and shared utilities pointers
references/edge-deployment-guide.md - Edge computing patterns, Cloudflare Workers vs Vercel Edge, tRPC, Hono, Bun
references/infrastructure-economics.md - Cost modeling, performance SLAs -> revenue, FinOps practices, cloud optimization
references/go-best-practices.md - Go idioms, concurrency, error handling, GORM usage, testing, profiling
references/rust-best-practices.md - Ownership, async, Axum, SeaORM, error handling, testing
references/python-best-practices.md - FastAPI, SQLAlchemy, async patterns, validation, testing, performance
references/nodejs-best-practices.md - Event loop, async patterns, Express/Fastify/NestJS/Hono, error handling, memory management, security, profiling
references/csharp-best-practices.md - C# 14 / .NET 10 LTS, extension members, field keyword, ASP.NET Core 10 (validation, SSE, OpenAPI 3.1), EF Core 10 (LeftJoin, named filters), HybridCache, Polly v8 resilience
references/database-patterns.md - PostgreSQL patterns (JSONB, CTEs, partitioning), connection pooling, migration strategies, ORM comparison, index design
references/message-queues-background-jobs.md - BullMQ patterns, broker comparison (Redis/SQS/Kafka/RabbitMQ), idempotent jobs, DLQ, scheduling, delivery guarantees
data/sources.json - External references per language/runtime
Shared checklists: ../software-clean-code-standard/assets/checklists/backend-api-review-checklist.md, ../software-clean-code-standard/assets/checklists/secure-code-review-checklist.md

Shared Utilities (Centralized patterns - extract, don't duplicate)

../software-clean-code-standard/utilities/auth-utilities.md - Argon2id, jose JWT, OAuth 2.1/PKCE
../software-clean-code-standard/utilities/error-handling.md - Effect Result types, correlation IDs
../software-clean-code-standard/utilities/config-validation.md - Zod 3.24+, Valibot, secrets management
../software-clean-code-standard/utilities/resilience-utilities.md - p-retry v6, opossum v8, OTel spans
../software-clean-code-standard/utilities/logging-utilities.md - pino v9 + OpenTelemetry integration
../software-clean-code-standard/utilities/testing-utilities.md - Vitest, MSW v2, factories, fixtures
../software-clean-code-standard/utilities/observability-utilities.md - OpenTelemetry SDK, tracing, metrics
../software-clean-code-standard/references/clean-code-standard.md - Canonical clean code rules (CC-* ) for citation

Templates

assets/nodejs/template-nodejs-prisma-postgres.md - Node.js + Prisma + PostgreSQL
assets/go/template-go-fiber-gorm.md - Go + Fiber + GORM + PostgreSQL
assets/rust/template-rust-axum-seaorm.md - Rust + Axum + SeaORM + PostgreSQL
assets/python/template-python-fastapi-sqlalchemy.md - Python + FastAPI + SQLAlchemy + PostgreSQL
assets/csharp/template-csharp-aspnet-efcore.md - C# + ASP.NET Core + Entity Framework Core + PostgreSQL

Related Skills

../software-architecture-design/SKILL.md - System decomposition, SLAs, and data flows
../software-security-appsec/SKILL.md - Authentication/authorization and secure API design
../ops-devops-platform/SKILL.md - CI/CD, infrastructure, and deployment safety
../qa-resilience/SKILL.md - Resilience, retries, and failure playbooks
../software-code-review/SKILL.md - Review checklists and standards for backend changes
../qa-testing-strategy/SKILL.md - Testing strategies, test pyramids, and coverage goals
../dev-api-design/SKILL.md - RESTful design, GraphQL, and API versioning patterns
../data-sql-optimization/SKILL.md - SQL optimization, indexing, and query tuning patterns

Freshness Protocol

When users ask version-sensitive recommendation questions, do a quick freshness check before asserting "best" choices or quoting versions.

Trigger Conditions

"What's the best backend framework for [use case]?"
"What should I use for [API design/auth/database]?"
"What's the latest in Node.js/Go/Rust?"
"Current best practices for [REST/GraphQL/tRPC]?"
"Is [framework/runtime] still relevant in 2026?"
"[Express] vs [Fastify] vs [Hono]?"
"Best ORM for [database/use case]?"

How to Freshness-Check

Start from data/sources.json (official docs, release notes, support policies).
Run a targeted web search for the specific component and open release notes/support policy pages.
Prefer official sources over blogs for versions and support windows.

What to Report

Current landscape: what is stable and widely used now
Emerging trends: what is gaining traction (and why)
Deprecated/declining: what is falling out of favor (and why)
Recommendation: default choice + 1-2 alternatives, with trade-offs

Example Topics (verify with fresh search)

Node.js LTS support window and major changes
Bun vs Deno vs Node.js
Hono, Elysia, and edge-first frameworks
Drizzle vs Prisma for TypeScript
tRPC and end-to-end type safety
Edge computing and serverless patterns
.NET 10 LTS (Nov 2025) and C# 14 adoption
ASP.NET Core 10 built-in validation vs FluentValidation
EF Core 10 vs Dapper for C# data access
HybridCache vs manual IMemoryCache + IDistributedCache

Operational Playbooks

references/operational-playbook.md - Full backend architecture patterns, checklists, TypeScript notes, and decision tables

Fact-Checking

Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
Prefer primary sources; report source links and dates for volatile information.
If web access is unavailable, state the limitation and mark guidance as unverified.

software-backend

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

product-management

marketing-visual-design

startup-idea-validation

software-architecture-design