Software Backend Engineering
Use this skill to design, implement, and review production-grade backend services: API boundaries, data layer, auth, caching, observability, error handling, testing, and deployment.
Defaults to bias toward: type-safe boundaries (validation at the edge), OpenTelemetry for observability, zero-trust assumptions, idempotency for retries, RFC 9457 errors, Postgres + pooling, structured logs, timeouts, and rate limiting.
Scaffolding rule: When scaffolding a new project, show full working implementations for all domain logic — fraud rules, audit logging, webhook handlers, validation pipelines, background jobs. Don't just reference file names or stub functions; show the actual code so the user can run it immediately.
Quick Reference
Task Default Picks Notes
REST API Fastify / Express / NestJS Prefer typed boundaries + explicit timeouts
Edge API Hono / platform-native handlers Keep work stateless, CPU-light
Type-Safe API tRPC Prefer for TS monorepos and internal APIs
GraphQL API Apollo Server / Pothos Prefer for complex client-driven queries
Database PostgreSQL Use pooling + migrations + query budgets
ORM / Query Layer Prisma / Drizzle / SQLAlchemy / GORM / SeaORM / EF Core Prefer explicit transactions
Authentication OIDC/OAuth + sessions/JWT Prefer httpOnly cookies for browsers
Validation Zod / Pydantic / validator libs Validate at the boundary, not deep inside
Caching Redis (or managed) Use TTLs + invalidation strategy
Background Jobs BullMQ / platform queues Make jobs idempotent + retry-safe
Testing Unit + integration + contract/E2E Keep most tests below the UI layer
Observability Structured logs + OpenTelemetry Correlation IDs end-to-end
Scope
Use this skill to:
-
Design and implement REST/GraphQL/tRPC APIs
-
Model data schemas and run safe migrations
-
Implement authentication/authorization (OIDC/OAuth, sessions/JWT)
-
Add validation, error handling, rate limiting, caching, and background jobs
-
Ship production readiness (timeouts, observability, deploy/runbooks)
When NOT to Use This Skill
Use a different skill when:
-
Frontend-only concerns -> See software-frontend
-
Infrastructure provisioning (Terraform, K8s manifests) -> See ops-devops-platform
-
API design patterns only (no implementation) -> See dev-api-design
-
SQL query optimization and indexing -> See data-sql-optimization
-
Security audits and threat modeling -> See software-security-appsec
-
System architecture (beyond single service) -> See software-architecture-design
Technology Selection
Pick based on the strongest constraint, not feature lists:
Constraint Default Pick Why
Team knows TypeScript only Fastify/Hono + Prisma/Drizzle Ecosystem depth, hiring ease
Need <50ms P95, CPU-bound work Go (net/http + sqlc/pgx) Goroutines isolate CPU work; no event-loop risk
Data-heavy / ML integration Python (FastAPI + SQLAlchemy) Best ecosystem for numpy/pandas/ML pipelines
Memory-safety critical Rust (Axum + SeaORM/SQLx) Zero-cost abstractions, no GC
Enterprise/.NET team C# (ASP.NET Core + EF Core) Azure integration, mature tooling
Edge/serverless Hono / platform-native handlers Stateless, CPU-light, fast cold starts
Fintech/audit-sensitive Go + sqlc (or raw SQL) ORM magic is a liability; you need auditable SQL
For detailed framework/ORM/auth/caching selection trees, see references/edge-deployment-guide.md and language-specific references. See assets/ for starter templates per language.
API Design Patterns (Dec 2025)
Idempotency Patterns
All mutating operations MUST support idempotency for retry safety.
Implementation:
// Idempotency key header
const idempotencyKey = request.headers['idempotency-key'];
const cached = await redis.get(idem:${idempotencyKey});
if (cached) return JSON.parse(cached);
const result = await processOperation();
await redis.set(idem:${idempotencyKey}, JSON.stringify(result), 'EX', 86400);
return result;
Do Avoid
Store idempotency keys with TTL (24h typical) Processing duplicate requests
Return cached response for duplicate keys Different responses for same key
Use client-generated UUIDs Server-generated keys
Pagination Patterns
Pattern Use When Example
Cursor-based Large datasets, real-time data ?cursor=abc123&limit=20
Offset-based Small datasets, random access ?page=3&per_page=20
Keyset Sorted data, high performance ?after_id=1000&limit=20
Prefer cursor-based pagination for APIs with frequent inserts.
Error Response Standard (Problem Details)
Use a consistent machine-readable error format (RFC 9457 Problem Details): https://www.rfc-editor.org/rfc/rfc9457
{ "type": "https://example.com/problems/invalid-request", "title": "Invalid request", "status": 400, "detail": "email is required", "instance": "/v1/users" }
Health Check Patterns
// Liveness: Is the process running? app.get('/health/live', (req, res) => { res.status(200).json({ status: 'ok' }); });
// Readiness: Can the service handle traffic? app.get('/health/ready', async (req, res) => { const dbOk = await checkDatabase(); const cacheOk = await checkRedis(); if (dbOk && cacheOk) { res.status(200).json({ status: 'ready', db: 'ok', cache: 'ok' }); } else { res.status(503).json({ status: 'not ready', db: dbOk, cache: cacheOk }); } });
Common Mistakes (Non-Obvious)
Avoid Instead Why
N+1 queries include /select or DataLoader 10-100x perf hit; easy to miss in ORM code
No request timeouts Timeouts on HTTP clients, DB, handlers Hung deps cascade; see Production Hardening below
Missing connection pooling Prisma pool / PgBouncer / pgx pool Exhaustion under load on shared DB tiers
Catching errors silently Log + rethrow or handle explicitly Hidden failures, impossible to debug
Production Hardening: Patterns Models Skip
These are the patterns that separate "works in dev" from "survives production." Models tend to skip them unless explicitly prompted — add them to every service.
Request & Query Timeouts
Every outbound call needs a timeout. Without one, a hung dependency leaks connections and cascades failures.
// HTTP client timeout const response = await fetch(url, { signal: AbortSignal.timeout(5000) });
// Database query timeout (Prisma)
await prisma.$queryRawSET statement_timeout = '3000';
// Express/Fastify request timeout server.register(import('@fastify/timeout'), { timeout: 30000 });
Layer Default Timeout Rationale
HTTP client calls 5s External APIs shouldn't block you
Database queries 3s Slow queries = missing index or bad plan
Request handler 30s Safety net for the whole request lifecycle
Background jobs 5min Jobs that run longer need chunking
Field-Level Selection (Don't SELECT * )
ORMs default to fetching all columns. On wide tables this wastes bandwidth and hides performance problems.
// BAD: fetches all 30 columns const users = await prisma.user.findMany({ include: { posts: true } });
// GOOD: fetch only what the endpoint needs const users = await prisma.user.findMany({ select: { id: true, name: true, email: true }, include: { posts: { select: { id: true, title: true } } } });
For Go (sqlc): write explicit column lists in SQL queries — sqlc enforces this naturally. For Python (SQLAlchemy): use load_only() or explicit column selection.
Structured Error Responses (RFC 9457)
Return machine-readable errors from day one. Clients shouldn't have to regex-parse error messages.
{ "type": "https://api.example.com/problems/validation-error", "title": "Validation failed", "status": 422, "detail": "email must be a valid email address", "instance": "/v1/users", "errors": [{ "field": "email", "message": "invalid format" }] }
Set Content-Type: application/problem+json . This format is a standard (RFC 9457) and parseable by any HTTP client.
Query Plan Verification
Before shipping any new query to production, verify its execution plan:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ... FROM ... WHERE ...;
Red flags in the output: Seq Scan on large tables, Nested Loop with high row estimates, Sort without index. Add indexes or rewrite the query before deploying.
Performance Debugging Workflow
When a service is slow, work through these layers in order. Fix the cheapest layer first — don't add caching before fixing N+1 queries.
Step What to Check Fix
-
Query analysis Enable query logging, find N+1s and slow queries Rewrite with include /joins, add select for field-level optimization
-
Indexing Run EXPLAIN ANALYZE on slow queries Add composite indexes matching WHERE + ORDER BY patterns
-
Connection pooling Check connection count vs. pool size Configure pool limits (Prisma connection_limit , PgBouncer, pgx pool)
-
Caching Identify read-heavy, rarely-changing data Add Redis/in-memory cache with TTL + invalidation strategy
-
Timeouts Check for missing timeouts on DB, HTTP, handlers Add timeouts at every layer (see Production Hardening above)
-
Platform tuning Shared DB limits, cold starts, memory Upgrade tier, add read replicas, tune runtime settings
Key principle: always measure before and after. Use structured logging with request IDs to trace specific slow requests end-to-end.
Infrastructure Economics
Backend architecture decisions directly impact cost and revenue. See references/infrastructure-economics.md for detailed cost modeling, SLA-to-revenue mapping, unit economics checklists, and FinOps practices.
Navigation
Resources
-
references/backend-best-practices.md - Template authoring guide, quality checklist, and shared utilities pointers
-
references/edge-deployment-guide.md - Edge computing patterns, Cloudflare Workers vs Vercel Edge, tRPC, Hono, Bun
-
references/infrastructure-economics.md - Cost modeling, performance SLAs -> revenue, FinOps practices, cloud optimization
-
references/go-best-practices.md - Go idioms, concurrency, error handling, GORM usage, testing, profiling
-
references/rust-best-practices.md - Ownership, async, Axum, SeaORM, error handling, testing
-
references/python-best-practices.md - FastAPI, SQLAlchemy, async patterns, validation, testing, performance
-
references/nodejs-best-practices.md - Event loop, async patterns, Express/Fastify/NestJS/Hono, error handling, memory management, security, profiling
-
references/csharp-best-practices.md - C# 14 / .NET 10 LTS, extension members, field keyword, ASP.NET Core 10 (validation, SSE, OpenAPI 3.1), EF Core 10 (LeftJoin, named filters), HybridCache, Polly v8 resilience
-
references/database-patterns.md - PostgreSQL patterns (JSONB, CTEs, partitioning), connection pooling, migration strategies, ORM comparison, index design
-
references/message-queues-background-jobs.md - BullMQ patterns, broker comparison (Redis/SQS/Kafka/RabbitMQ), idempotent jobs, DLQ, scheduling, delivery guarantees
-
data/sources.json - External references per language/runtime
-
Shared checklists: ../software-clean-code-standard/assets/checklists/backend-api-review-checklist.md, ../software-clean-code-standard/assets/checklists/secure-code-review-checklist.md
Shared Utilities (Centralized patterns - extract, don't duplicate)
-
../software-clean-code-standard/utilities/auth-utilities.md - Argon2id, jose JWT, OAuth 2.1/PKCE
-
../software-clean-code-standard/utilities/error-handling.md - Effect Result types, correlation IDs
-
../software-clean-code-standard/utilities/config-validation.md - Zod 3.24+, Valibot, secrets management
-
../software-clean-code-standard/utilities/resilience-utilities.md - p-retry v6, opossum v8, OTel spans
-
../software-clean-code-standard/utilities/logging-utilities.md - pino v9 + OpenTelemetry integration
-
../software-clean-code-standard/utilities/testing-utilities.md - Vitest, MSW v2, factories, fixtures
-
../software-clean-code-standard/utilities/observability-utilities.md - OpenTelemetry SDK, tracing, metrics
-
../software-clean-code-standard/references/clean-code-standard.md - Canonical clean code rules (CC-* ) for citation
Templates
-
assets/nodejs/template-nodejs-prisma-postgres.md - Node.js + Prisma + PostgreSQL
-
assets/go/template-go-fiber-gorm.md - Go + Fiber + GORM + PostgreSQL
-
assets/rust/template-rust-axum-seaorm.md - Rust + Axum + SeaORM + PostgreSQL
-
assets/python/template-python-fastapi-sqlalchemy.md - Python + FastAPI + SQLAlchemy + PostgreSQL
-
assets/csharp/template-csharp-aspnet-efcore.md - C# + ASP.NET Core + Entity Framework Core + PostgreSQL
Related Skills
-
../software-architecture-design/SKILL.md - System decomposition, SLAs, and data flows
-
../software-security-appsec/SKILL.md - Authentication/authorization and secure API design
-
../ops-devops-platform/SKILL.md - CI/CD, infrastructure, and deployment safety
-
../qa-resilience/SKILL.md - Resilience, retries, and failure playbooks
-
../software-code-review/SKILL.md - Review checklists and standards for backend changes
-
../qa-testing-strategy/SKILL.md - Testing strategies, test pyramids, and coverage goals
-
../dev-api-design/SKILL.md - RESTful design, GraphQL, and API versioning patterns
-
../data-sql-optimization/SKILL.md - SQL optimization, indexing, and query tuning patterns
Freshness Protocol
When users ask version-sensitive recommendation questions, do a quick freshness check before asserting "best" choices or quoting versions.
Trigger Conditions
-
"What's the best backend framework for [use case]?"
-
"What should I use for [API design/auth/database]?"
-
"What's the latest in Node.js/Go/Rust?"
-
"Current best practices for [REST/GraphQL/tRPC]?"
-
"Is [framework/runtime] still relevant in 2026?"
-
"[Express] vs [Fastify] vs [Hono]?"
-
"Best ORM for [database/use case]?"
How to Freshness-Check
-
Start from data/sources.json (official docs, release notes, support policies).
-
Run a targeted web search for the specific component and open release notes/support policy pages.
-
Prefer official sources over blogs for versions and support windows.
What to Report
-
Current landscape: what is stable and widely used now
-
Emerging trends: what is gaining traction (and why)
-
Deprecated/declining: what is falling out of favor (and why)
-
Recommendation: default choice + 1-2 alternatives, with trade-offs
Example Topics (verify with fresh search)
-
Node.js LTS support window and major changes
-
Bun vs Deno vs Node.js
-
Hono, Elysia, and edge-first frameworks
-
Drizzle vs Prisma for TypeScript
-
tRPC and end-to-end type safety
-
Edge computing and serverless patterns
-
.NET 10 LTS (Nov 2025) and C# 14 adoption
-
ASP.NET Core 10 built-in validation vs FluentValidation
-
EF Core 10 vs Dapper for C# data access
-
HybridCache vs manual IMemoryCache + IDistributedCache
Operational Playbooks
- references/operational-playbook.md - Full backend architecture patterns, checklists, TypeScript notes, and decision tables
Fact-Checking
-
Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
-
Prefer primary sources; report source links and dates for volatile information.
-
If web access is unavailable, state the limitation and mark guidance as unverified.