vLLM Studio Backend Architecture
Overview
This skill explains how the backend is wired: controller runtime, OpenAI-compatible proxy, Pi-mono agent loop, LiteLLM gateway, and inference process management.
When To Use
-
Modifying controller routes or run streaming.
-
Debugging OpenAI-compatible endpoint behavior.
-
Updating Pi-mono agent runtime or tool execution.
-
Understanding how inference + LiteLLM fit together.
Quick Start
-
Read references/backend-architecture.md for the component map and data flow.
-
Read references/openai-compat.md for /v1/models and /v1/chat/completions behavior.
-
Read references/backend-commands.md for useful run/debug commands.
Core Guarantees
-
Keep OpenAI-compatible endpoints stable (/v1/models , /v1/chat/completions ).
-
/chat UI uses controller run stream (/chats/:id/turn ) and Pi-mono runtime.
-
Tool execution happens server-side (MCP + AgentFS + plan tools).
References
-
references/backend-architecture.md
-
references/openai-compat.md
-
references/backend-commands.md