dashboard (overview + applications, AI-agent first)
CRITICAL — before doing ANYTHING in this subtree, MUST Read ../olares-shared/SKILL.md for profile selection, login, factory-injected *http.Client, and HTTP 401/403 recovery rules. Every dashboard verb depends on that foundation.
This file is the source of truth for the dashboard subtree. Any code generation, refactor or fix touching cli/cmd/ctl/dashboard/**, cli/pkg/dashboard/**, cli/pkg/dashboard/format/**, cli/pkg/olares/id.go::DashboardURL, or cli/pkg/credential/{types,default_provider}.go::DashboardURL MUST first Read this file end-to-end and respect the Iteration red-lines section. Do not "modernize", "simplify" or "consolidate" anything listed under Frozen modules without an explicit user-approved plan that supersedes this skill.
Scope (frozen)
The dashboard CLI mirrors only the SPA routes that are still wired in packages/app/src/apps/dashboard/router/routes.ts:
Overview2/IndexPage.vue(overview tree) + the per-detail-page subtrees:Overview2/CPU,Memory,Disk,Pods,Network,Fan,GPU.Applications2/IndexPage.vue(applications tree).
Everything else in the SPA (legacy Overview/, audit, settings, …) is out of scope. Do not add commands for deprecated routes; reject any request that asks for them with a pointer to the route file.
Architecture
The dashboard CLI is a two-layer split with mirrored subpackage trees on BOTH sides:
- cmd shell under
cli/cmd/ctl/dashboard/— one directory per command-tree parent, one.gofile per leaf. Each leaf's RunE is a thinvalidate flags → prepareClient → pkg<area>.RunXxx(...)body. Per-areacommon.gois ≤ 70 lines and carries onlyvar common,prepareClient,unknownSubcommandRunE. - pkg layer under
cli/pkg/dashboard/, split in two:- the pkg ROOT (
cli/pkg/dashboard/*.go) owns infrastructure and shared fetchers —Client,Runner,CommonFlags,Envelope/Item/Meta,Fetch*helpers,BuildRankingEnvelope, capability gates, theformatpkg, the schema bundle. No command-specific envelope builders or table writers live here. - the pkg subpackages (
cli/pkg/dashboard/applications/,cli/pkg/dashboard/overview/,cli/pkg/dashboard/overview/{disk,fan,gpu}/) mirror the cmd parent dirs. Each carries one<verb>.gofile per leaf, exposing a singleRunXxx(ctx, c, cf, ...)entry plus itsBuildXxxEnvelope/WriteXxxTabletriplet. Tests sit alongside as<verb>_test.go(one sharedhelpers_test.goper area for httptest fixtures).
- the pkg ROOT (
Every fetch / merge / aggregate / format call goes through pkg —
either the root infra or, for envelope assembly, the matching pkg
subpackage. The cmd shell never touches HTTP, never builds an
envelope, never writes a table, never runs Runner directly.
flowchart LR
User[olares-cli dashboard <verb>] --> Factory[cmdutil.Factory<br/>refreshingTransport]
Factory --> CmdShell["cli/cmd/ctl/dashboard/<area>/<leaf>.go<br/>(thin cobra shell:<br/>validate → prepareClient →<br/>pkg<area>.RunXxx)"]
CmdShell --> PkgArea["cli/pkg/dashboard/<area>/<leaf>.go<br/>(RunXxx + BuildXxxEnvelope +<br/>WriteXxxTable + watch loop)"]
PkgArea --> PkgRoot["cli/pkg/dashboard/<br/>(infra: Envelope, Client, Runner,<br/>fetchers, aggregators, gates, schema)"]
PkgRoot -->|/capi/* + /kapis/* + /hami/*| BFF["dashboard.<terminus><br/>(ks-apiserver-proxy)"]
BFF -->|wildcard proxy| KsAPI[ks-apiserver]
PkgRoot --> Format["cli/pkg/dashboard/format<br/>(1:1 JS port)"]
PkgArea --> Output["WriteJSON / WriteTable<br/>(pkgdashboard.*)"]
Output --> User
cmd area ↔ pkg subpackage mapping — every cmd leaf is a thin
prepareClient + pkg<area>.RunXxx(...) shell. The heavy logic
(envelope builders, table writers, fan-out, watch loops) lives in a
mirror subpackage under cli/pkg/dashboard/<area>/. The pkg root
(cli/pkg/dashboard/) only owns infrastructure and shared fetchers.
| cmd area | pkg subpackage entry points (Run*) | shared infra it pulls from cli/pkg/dashboard/ |
|---|---|---|
cmd/ctl/dashboard/root.go + options.go | n/a — pure cobra wiring + bindPersistentFlags | flags.go (CommonFlags), runner.go, emit.go, output.go |
cmd/ctl/dashboard/applications/root.go | pkg/dashboard/applications/list.go → RunList | ranking.go, workloads.go, apps.go, format/... |
cmd/ctl/dashboard/overview/{physical,user,ranking,root}.go | pkg/dashboard/overview/{physical,user,ranking,default}.go → RunPhysical/RunUser/RunRanking/RunDefault | monitoring.go, workloads.go, ranking.go, users.go, numbers.go |
cmd/ctl/dashboard/overview/{cpu,memory,pods,network}.go | pkg/dashboard/overview/{cpu,memory,pods,network}.go → RunCPU/RunMemory/RunPods/RunNetwork (built on nodes.go's RunPerNodeMetric scaffold) | monitoring.go, system.go (network → capi /system/ifs), numbers.go |
cmd/ctl/dashboard/overview/disk/{root,main,partitions}.go | pkg/dashboard/overview/disk/{default,main,partitions}.go → RunDefault/RunMain/RunPartitions | monitoring.go (node_disk_*), lsblk.go, numbers.go, format/... |
cmd/ctl/dashboard/overview/fan/{root,live,curve}.go | pkg/dashboard/overview/fan/{default,live,curve}.go → RunDefault/RunLive/RunCurve | system.go (FetchSystemFan), gates.go (GateOlaresOne), gpu.go (graphics list co-render), fan_curve.go |
cmd/ctl/dashboard/overview/gpu/{root,list,tasks,get,task,detail,task_detail}.go | pkg/dashboard/overview/gpu/{list,tasks,get,task,detail,task_detail}.go → RunList/RunTasks/RunGet/RunTask/RunDetail/RunTaskDetail (+ specs.go for the gauge/trend catalogue and concurrent fan-out) | gpu.go, gpu_format.go, gpu_query.go, gates.go, client.go |
cmd/ctl/dashboard/schema/root.go | n/a — schema loader stays in pkg root | schema.go (loader + go:embed bundle) |
- HTTP base =
ResolvedProfile.DashboardURL = https://dashboard.<localPrefix><terminusName>(derived incli/pkg/olares/id.goDashboardURL). - All requests go through factory's
refreshingTransport— header injection + 401 retry happen for free; dashboard code MUST NOT touchX-Authorizationor instantiate its ownhttp.Client. - Per-command leaf / aggregate decision is hard-coded; do not flip them.
- The shared
*pkgdashboard.CommonFlagsis bound once in cmd-root (bindPersistentFlags(&common, cmd)), then passed by pointer to every area factory (overview.NewOverviewCommand(f, &common)etc.). Each area subpackage stores it in a package-levelvar common *pkgdashboard.CommonFlagsso leaf RunE bodies keepcommon.Output / common.Validate() / common.Timezoneselectors.
JSON envelope (frozen shapes)
Every dashboard command emits exactly one of two shapes; the choice is fixed per command and MUST NOT be changed.
Shape A — leaf
Used by every command except parent commands with a sections envelope (see Shape B).
{
"kind": "dashboard.<area>.<verb>",
"meta": { "fetched_at": "...", "iteration": 0, "recommended_poll_seconds": 60, "empty": false, "empty_reason": "", "error": "", "http_status": 200 },
"items": [ { "raw": { /* upstream-shape, units in source unit */ }, "display": { /* table-friendly strings */ } } ]
}
rawis the canonical machine-friendly shape: numbers as numbers, timestamps as Unix seconds, temperatures as raw Celsius (NOT converted by--temp-unit). Agents pin onraw.displayis human-presentation only; values are formatted via theformatpkg with current--temp-unit/--timezone. Agents MUST NOT pin ondisplay— it can change with locale / format-pkg fixes.meta.recommended_poll_seconds— the page-level polling cadence the SPA uses; agents driving--watchSHOULD respect it.meta.iteration— present in every--watchpayload, 1-based.meta.empty/meta.empty_reason— three-state empty data; see Three-state empty data.meta.error— only set on a failed--watchiteration or per-section failure inside Shape B.
Shape B — sections envelope
Used by parent commands that aggregate multiple sub-views:
| Parent command | Sections | Section kinds |
|---|---|---|
dashboard overview | physical / user / ranking | dashboard.overview.physical / .user / .ranking |
dashboard overview disk | main / partitions | dashboard.overview.disk.main / .partitions (the latter is itself a sections envelope, keyed by device) |
dashboard overview fan | live / curve | dashboard.overview.fan.live / .curve |
dashboard schema | n/a — emits Shape A index | dashboard.schema.index |
{
"kind": "dashboard.overview",
"meta": { ... },
"sections": {
"physical": { "kind": "dashboard.overview.physical", "meta": {...}, "items": [...] },
"user": { "kind": "dashboard.overview.user", "meta": {...}, "items": [...] },
"ranking": { "kind": "dashboard.overview.ranking", "meta": {...}, "items": [...] }
}
}
- Sections are fetched concurrently; a single failed section degrades to
meta.error="..."on that section, the others still return. - The section names (the Section column above) are part of the contract — do not rename or drop.
Kind constants
Declared in cli/pkg/dashboard/schema.go. AllKinds() MUST stay 1:1 with the actual command surface. Adding a command means:
- Add a
KindXxxconstant inpkg/dashboard/schema.go. - Append it to
AllKinds()(same file). - Add
cli/pkg/dashboard/schemas/<kind-without-namespace>.json. - Register in
LoadSchemaIndex()(same file). - Bump golden tests if the new shape appears in fixtures.
- Add a leaf file in the matching
cmd/ctl/dashboard/<area>/subdirectory and AddCommand it from the area'sroot.go.
Do NOT rename existing Kind values — agents rely on string equality.
The authoritative listing of every command, its envelope shape, and
its default-action semantics is the cmd/pkg directory tree itself
(see File layout) and olares-cli dashboard --help. Run olares-cli dashboard schema -o json for the live dashboard.<area>.<verb> Kind index.
Frozen modules
The following are load-bearing. Touching them requires a separate user-confirmed plan; otherwise reject the request with this list.
| Frozen module | Why it's frozen | Where |
|---|---|---|
Two-layer split with mirrored cmd/<area> ↔ pkg/dashboard/<area> subpackages: cmd is a thin cobra shell that calls pkg<area>.RunXxx(...); pkg owns every envelope builder, table writer, watch loop and fan-out. The pkg root holds only infrastructure + shared fetchers. | Architectural decision (P1–P7); each leaf's heavy logic is testable directly via RunXxx without dragging cobra in, and cmd-side line budgets stay flat as the surface grows | cli/cmd/ctl/dashboard/ + cli/pkg/dashboard/ |
| Directory tree mirrors command tree on BOTH sides | One file per leaf in cmd (<verb>.go) AND in the matching pkg subpackage (<verb>.go exposing one RunXxx entry); root.go+common.go per cmd parent, default.go per pkg parent — prevents drift between dashboard --help, the cmd ls, and the pkg ls | every cmd subdirectory under cli/cmd/ctl/dashboard/ and every pkg subpackage under cli/pkg/dashboard/<area>/ |
Kind constants + AllKinds() | Public agent contract | pkg/dashboard/schema.go |
| Envelope shapes A / B | Public agent contract | pkg/dashboard/output.go |
format pkg signatures + JS-parity behaviour | Backed by golden.json oracle; even rounding quirks are intentional | pkg/dashboard/format/format.go |
FetchClusterMetrics / FetchNodeMetrics / FetchUserMetric shared fetchers | Single upstream call powers multiple commands; must not fork | pkg/dashboard/monitoring.go |
FetchWorkloadsMetrics dual-fetch + merge + BuildRankingEnvelope | Single pkg function powers overview ranking + applications first-row parity; keep the dual-fetch shape | pkg/dashboard/workloads.go, pkg/dashboard/ranking.go |
FetchSystemIFS / FetchSystemFan / graphics+task helpers | Wire shape pinned in tests; capi vs hami URL prefixes are part of the contract | pkg/dashboard/system.go, pkg/dashboard/gpu.go |
FanCurveTable + FanSpeedMaxCPU + FanSpeedMaxGPU constants | 1:1 with SPA Fan/config.ts.tableData; CLI ships them as Go constants because the SPA does too | pkg/dashboard/fan_curve.go |
| Parent-command default = sections envelope (overview, overview disk, overview fan) | Plan decision; consumers branch on kind == "dashboard.overview.*" to know whether to expect sections or items | overview/root.go, overview/disk/root.go, overview/fan/root.go |
Runner + --watch semantics (interval / iterations / timeout / 3-consecutive-failure exit / NDJSON per-iteration / SIGINT graceful / ErrTokenInvalidated short-circuit) | Plan decision; agents script around it | pkg/dashboard/runner.go |
--since vs --start --end mutual-exclusion + sliding-window rules | Plan decision (since_slides_abs_repeats) | pkg/dashboard/flags.go Validate / ResolveWindow |
| Three-state empty-data semantics | Distinguishes "feature not installed" from "no hardware" without throwing | overview gpu {list,tasks} / overview fan live (cmd) ↔ pkg/dashboard/gates.go |
Client.EnsureUser sync.Once cache + RequireAdmin / ResolveTargetUser guards | Auth correctness for --user / overview user <name> | pkg/dashboard/client.go |
ResolvedProfile.DashboardURL derivation chain | URL is derived, not configured | cli/pkg/olares/id.go, cli/pkg/credential/types.go, cli/pkg/credential/default_provider.go |
Cobra SilenceUsage = true + SilenceErrors = true on every dashboard cmd | Clean machine-readable error text; pinned by TestAllLeafCommandsSilenced in cmd-root | every cobra.Command literal under cmd/ctl/dashboard/ |
Default output = table | Plan decision (table_default) | pkg/dashboard/output.go ParseOutputFormat |
Dependency direction (cmd side) — cli/cmd/ctl/dashboard/<area> may import cli/pkg/dashboard, cli/pkg/dashboard/<area>/..., and its own cobra children only; horizontal cmd↔cmd imports are FORBIDDEN. The cmd shell is forbidden from owning envelope shapes / table writers / fan-out — those belong in the matching pkg/dashboard/<area> subpackage. | Architectural integrity; prevents the circular-import class of bugs that killed the 1.x layout, and keeps the cmd surface trivially line-budgeted | every import block under cli/cmd/ctl/dashboard/ |
Dependency direction (pkg side) — cli/pkg/dashboard/<area>/... may import cli/pkg/dashboard (the root infra: Client, Runner, fetchers, format, numbers, gpu_format, gpu_query, gates, schema, output, emit, lsblk, fan_curve); the pkg root MUST NOT import any subpackage (no upward import). Sibling subpackages (e.g. overview/disk ↔ overview/fan) MUST NOT import each other — if two areas need the same heavy helper, hoist it to cli/pkg/dashboard/ (the BuildRankingEnvelope precedent). | Keeps the pkg root independently testable + reusable, and keeps each <area> subpackage a self-contained mirror of its cmd sibling | every import block under cli/pkg/dashboard/ |
Three-state empty data
Optional hardware (GPU / fan) and optional integrations have three legitimate "empty" states; the CLI distinguishes them in meta.empty_reason so agents can branch without parsing prose.
| Upstream response | meta.empty | meta.empty_reason | Example |
|---|---|---|---|
| HTTP 404 | true | no_<feature>_integration | HAMI vGPU not installed → dashboard overview gpu list |
| HTTP 200, empty body / empty metric | true | no_<feature>_detected | Server has no fans → dashboard overview fan live |
| HTTP 200, non-empty | false | "" | Normal case |
| Any other 4xx/5xx | n/a (envelope meta.error) | n/a | Real failure → propagate via watch error path |
Forbidden: turning a 404 into an error, or merging the two reasons into a single "empty=true" — agents currently key on the reason.
Capability gates (overview fan / overview gpu)
The fan / gpu subtrees mirror the SPA's gates — but with two different strengths because the SPA itself treats them differently:
| Subtree | Gate strength | Source check |
|---|---|---|
overview fan * (default / live / curve) | Hard — empty envelope before any fetch | Fan.ts:67 (isOlaresOneDevice); fan data is meaningless on non-Olares One hardware |
overview gpu * (list / tasks / get / task) | Soft — advisory meta.note + stderr hint, then proceed with the HAMI fetch | ClusterResource.vue:232+278 hides the sidebar entry for non-admins / no-CUDA clusters; the GPU detail pages themselves do not hard-block |
Hard gate (fan) — empty envelope shape
{
"kind": "dashboard.overview.fan.live",
"meta": {
"empty": true,
"empty_reason": "not_olares_one",
"note": "Fan / cooling integration is only available on Olares One devices",
"device_name": "DIY-PC",
"fetched_at": "..."
},
"items": []
}
Soft gate (gpu) — advisory, not a block
The CLI ALWAYS calls HAMI for GPU verbs. The SPA-equivalent "would-have-been-hidden" reason is recorded as meta.note and (in table mode) as a one-liner on stderr. The actual meta.empty / empty_reason are determined by HAMI's response:
// non-admin caller, HAMI returned an empty list:
{
"kind": "dashboard.overview.gpu.list",
"meta": {
"empty": true,
"empty_reason": "no_gpu_detected",
"note": "GPU sidebar entry is hidden for non-admin profiles in the SPA; HAMI was queried directly",
...
},
"items": []
}
// any caller, HAMI returned 5xx:
{
"kind": "dashboard.overview.gpu.list",
"meta": {
"empty": true,
"empty_reason": "vgpu_unavailable",
"http_status": 500,
"error": "unknown request error",
"note": "HAMI vGPU controller responded with Internal Server Error; the integration is installed but unhealthy",
...
},
"items": []
}
empty_reason taxonomy:
| Reason | Where | Meaning |
|---|---|---|
not_olares_one | fan default / live / curve (hard gate) | Active device's device_name is not Olares One |
no_fan_integration | fan live (HTTP 404 fallback) | /user-service/api/mdns/olares-one/cpu-gpu returned 404 |
no_vgpu_integration | gpu list / tasks / get / task (HTTP 404) | HAMI vGPU controller absent |
vgpu_unavailable | gpu list / tasks / get / task (HTTP 5xx) | HAMI installed but unhealthy. meta.error carries the upstream message; meta.http_status carries the original status |
no_gpu_detected | gpu list / tasks / get / task (HTTP 200, empty body) | HAMI healthy but device list / detail is empty |
Soft-gate advisories never appear as empty_reason; they live in meta.note:
| Note | When |
|---|---|
GPU sidebar entry is hidden for non-admin profiles in the SPA; HAMI was queried directly | meta.user.globalrole != platform-admin (and not admin) |
no node carries gpu.bytetrade.io/cuda-supported=true; SPA hides the GPU card. HAMI was queried directly | Admin caller, but the cluster's node list has no CUDA-supported label |
When BOTH a HAMI failure AND a soft advisory apply, meta.note is the concatenation "<advisory> | <hami-explanation>".
HAMI wire-shape contract (paid in blood)
HAMI's WebUI returns the payload at the top level for every endpoint we call — there is no data envelope around it. Adding one in the decoder yields a silent "0 GPUs" / "0 tasks" failure even on machines where the SPA shows devices.
| Endpoint | Real wire shape | Decoder target |
|---|---|---|
POST /hami/api/vgpu/v1/gpus | {"list":[ {graphics...} ]} | struct{ List []map[string]any }, return raw.List |
POST /hami/api/vgpu/v1/containers | {"items":[ {task...} ]} | struct{ Items []map[string]any }, return raw.Items |
GET /hami/api/vgpu/v1/gpu?uuid=… | { ...graphics fields... } | map[string]any, return verbatim |
GET /hami/api/vgpu/v1/container?name=…&podUid=… | { ...task fields... } | map[string]any, return verbatim |
The fetchers in pkg/dashboard/gpu.go are pinned to this shape; TestFetchGraphicsList_ParsesTopLevelList / TestFetchTaskList_ParsesTopLevelItems / TestFetchGraphicsDetail_ReturnsBodyAsIs / TestFetchTaskDetail_ReturnsBodyAsIs enforce the contract (in cli/pkg/dashboard/dashboard_test.go).
GPU/task field names that round-trip from HAMI (used in Raw envelopes — agents can pull anything not surfaced in the table):
gpu list raw fields | gpu tasks raw fields |
|---|---|
uuid, type, shareMode, nodeName, nodeUid, health (bool), coreUtilizedPercent (already a percent — NOT a 0..1 ratio), coreUsed, coreTotal, memoryUsed / memoryUtilized / memoryTotal (all MiB), memoryUtilizedPercent, vgpuUsed, vgpuTotal, power, powerLimit (W), temperature (°C), node, mode | name, status, podUid, nodeName, nodeUid, type, deviceShareModes[], devicesCoreUtilizedPercent[], devicesMemUtilized[], allocatedDevices, allocatedCores, allocatedMem, createTime, startTime, endTime, namespace, deviceIds[], resourcePool, flavor, priority, appName |
Common renames the SPA does NOT do (we don't either — earlier revisions guessed at e.g. modelName / hostname / usedMem / totalMem / coreUtilization / sharemode; none of those field names appear on the wire):
- "model" →
type(yes, the field is literally calledtype) - "host_node" →
nodeName - "core_util" →
coreUtilizedPercent(already a percent) - "vram total / used" →
memoryTotal/memoryUsed(both MiB; multiply by 1024² before passing toformat.GetDiskSize) - "mode" →
shareMode(string"0"/"1"/"2"/"3"; SPA'sVRAMModeLabelenum only knows 0/1/2 — the CLI surfaces unknown values asmode=<raw>)
Monitor query endpoints (instant-vector / range-vector)
The wire-shape contract above ONLY covers the list/detail endpoints. The two monitor query endpoints behave the opposite way — they DO wrap the result in a single-level data envelope, matching the SPA's InstantVectorResponse / RangeVectorResponse types in packages/app/src/apps/dashboard/types/gpu.ts.
| Endpoint | Wire shape | Decoder |
|---|---|---|
POST /hami/api/vgpu/v1/monitor/query/instant-vector | {"data":[ {metric, value, timestamp} ]} | struct{ Data []instantVectorSample }, return raw.Data |
POST /hami/api/vgpu/v1/monitor/query/range-vector | {"data":[ {metric, values:[{value, timestamp}]} ]} | struct{ Data []rangeVectorSeries }, return raw.Data |
Both fetchers (fetchInstantVector / fetchRangeVector) are pinned by TestFetchInstantVector_ParsesDataEnvelope / TestFetchRangeVector_ParsesDataEnvelope. Do NOT "harmonise" with the list endpoints — different microservices behind HAMI, different wire shapes, period.
Range body shape: {"query":"<promql>","range":{"start":"YYYY-MM-DD HH:mm:ss","end":"…","step":"30m"}}. start/end are SPA's timeParse(date) (local clock, no offset suffix); the step is computed by gpuTrendStep(start, end) — a 1:1 port of timeRangeFormate(diff_s, 16) which preserves the SPA's preset table (10m→1m, 60m→10m, 480m→30m, 1440m→60m, etc.) and falls back to floor(minutes/16)m capped at [1m..60m].
dashboard overview gpu detail <uuid> / task-detail <name> <pod-uid> — full detail pages
Two new commands mirror the SPA's Overview2/GPU/GPUsDetails.vue and Overview2/GPU/TasksDetails.vue pages: basic info (HAMI /v1/gpu or /v1/container) + top gauges (instant-vector) + trend charts (range-vector), all assembled into a single sections envelope.
| Command | Kind | Default window | Sections | PromQL count |
|---|---|---|---|---|
overview gpu detail <uuid> | dashboard.overview.gpu.detail.full | 8h | detail / gauges / trends | 4 instant + 6 range |
overview gpu task-detail <name> <pod-uid> [--sharemode] | dashboard.overview.gpu.task.detail.full | 1h | detail / gauges / trends | 2 instant + 2 range |
Section item ordering is fixed and stable across releases — agents can index by position OR by raw.key:
GPU detail.full gauges keys (in order): alloc_core / alloc_mem / util_core / util_mem / power / temperature.
GPU detail.full trends keys (in order): alloc_trend (lines: core+memory) / usage_trend (core+memory) / power_trend / temp_trend.
Task task-detail.full gauges keys (in order): compute_usage / vram_usage.
Task task-detail.full trends keys (in order): compute_trend / vram_trend (each one line, label usage).
Time window flags reuse the existing --since / --start / --end. With neither set the SPA defaults apply (8h for GPU detail, 1h for task detail). meta.window carries {since, start, end, step} so an agent can replay the exact same query without recomputing.
--sharemode on task-detail mirrors the SPA's displayAllocation toggle: when set to "2" (Time slicing) the CLI skips the allocation gauges, matching the SPA's hidden-rows behaviour. Other values ("0", "1") keep the full gauge set. Pinned by TestBuildGPUTaskDetailFullEnvelope_TimeSlicingSkipsAllocation.
Soft-failure semantics — a single instant/range query failing does NOT abort the envelope:
- the failed gauge / trend item carries
raw.error(anddisplay.errorin the table); - the parent envelope's
meta.warningscollectsgauges "<key>": <err>/trends "<key>": <err>lines; - agents should branch on
len(meta.warnings) > 0to detect partial data; - pinned by
TestBuildGPUDetailFullEnvelope_PartialFailure.
Hard-failure semantics still apply on the detail fetch itself: HAMI 404 → empty_reason=no_vgpu_integration, 5xx → vgpu_unavailable (meta.error + meta.http_status), 200 with empty body → no_gpu_detected. Watch mode polls every 30s by default (Recommended = 30 * time.Second — the monitor endpoints are slower than plain Prom queries).
device_no / driver_version are surfaced on the detail item — the CLI mirrors the SPA's trick of harvesting them from the power_trend range query's metric labels (HAMI doesn't ship them on /v1/gpu).
Behavior contract:
- Exit code is
0for every gated / advisory / empty path — these are predictable states, not failures. Real failures still propagate viameta.error(or non-zero exit for top-level fatal errors). - Stderr carries a single human-readable line in non-JSON modes; JSON / NDJSON modes stay silent on stderr (agents read stdout exclusively). Examples:
fan is only available on Olares One devices (current: <device_name>)(advisory) GPU sidebar entry is hidden for non-admin profiles in the SPA; current user (<u>) is <role>(advisory) no node carries gpu.bytetrade.io/cuda-supported=true; SPA hides the GPU card. HAMI was queried directlygpu data temporarily unavailable: HAMI returned HTTP 500 (unknown request error)
- Capability calls (
/user-service/api/system/status,/kapis/.../nodes,/capi/app/detail) are cached per-Client (sync.Oncefor system status; per-Client map for CUDA presence;sync.Oncefor user detail) — repeated calls inside an aggregateddashboard overviewcost zero extra HTTP. - Hard gate runs inside
Runner.Iterfor--watchso a stream against the wrong device terminates with consistent empty envelopes per tick. Soft advisories likewise re-evaluate per iteration.
Agent decision tree (recommended):
inspect meta.empty + meta.empty_reason + meta.note →
not_olares_one → skip the fan subtree on this device
no_*_integration → upstream component absent (HAMI / capi system fan)
vgpu_unavailable → transient: retry; check meta.http_status / meta.error
no_*_detected → integration up but hardware empty
(none) + meta.note → data is present but the SPA would have hidden the entry; surface the note to the user
(none) + (no note) → items[] populated, proceed normally
Forbidden:
- Collapsing the empty reasons into a single "not supported" string.
- Skipping the soft advisory call to "save a request" — the cache makes that a non-issue.
- Treating a closed gate or HAMI 5xx as
meta.error-only —erroris reserved for the upstream message; the canonical signal isempty_reason. - Treating
vgpu_unavailableas a hard error / non-zero exit — agents must keep iterating in--watch.
--watch semantics (frozen)
Default behaviour: single execution. --watch opts into HTTP-polling that mirrors the SPA's setTimeout cadence.
flowchart TD
A[invoke leaf cmd] --> B{--watch set?}
B -- no --> C[RunOnce → emit one envelope → exit 0/1]
B -- yes --> D[Runner.Run]
D --> E[RunOnce iteration N]
E --> F{success?}
F -- yes --> G[emit envelope NDJSON line<br/>meta.iteration=N]
F -- transient err --> H[emit envelope NDJSON<br/>meta.error set, items:[]]
F -- ErrTokenInvalidated / ErrNotLoggedIn --> X[exit non-zero immediately]
G --> I{--watch-iterations cap reached?}
H --> J{N consecutive errors >= 3?}
J -- yes --> X
I -- yes --> Y[exit 0]
J -- no --> K[wait --watch-interval]
I -- no --> K
K --> L{SIGINT?}
L -- yes --> Y
L -- no --> E
Hard rules:
- One envelope per iteration; nothing else on stdout. Pretty
-o jsonis for one-shot mode;--watch -o jsonis always NDJSON. - Failed iteration ⇒
items:[]+meta.error="<msg>"; never raise to process exit unless it's an auth-class error or the 3-consecutive cap trips. --since Nslides forward each iteration (window = [now-N, now]);--start/--endstays fixed across all iterations. Mutual exclusion is enforced inCommonFlags.Validate; do NOT remove the check.--watch-iterationswithout--watchis rejected. Same for--watch-timeout/--watch-interval.RunnerexposesIter(the per-iteration callback). It used to be calledRunand was renamed to dodge a field/method-name clash withRunner.Run()— leave it.
Multi-user / --user flag
Client.EnsureUser(ctx)lazily fetches/capi/app/detail(once per process) and cachesglobalrole. Subsequent calls return the cached value.RequireAdmin(ctx)is the only admit gate for--user <olaresId>; admin-only verbs (cross-tenant queries) MUST call it before issuing any upstream request.resolveTargetUser(ctx, c, requested)is the helper for commands that accept a user via positional arg OR--user(today:overview user [<username>]). Empty / self → returns the active profile; explicit + admin → returns syntheticUserDetail{Name: requested}; explicit + non-admin → error mentioning "platform-admin".- The CLI never spoofs identity via headers —
--useronly changes upstream filter args; auth header is still the active profile's token.
Format pkg (frozen, JS-parity)
The format pkg is a byte-for-byte port of:
@bytetrade/core/src/monitoring.ts→UnitTypes,GetValueByUnit,GetSuitableUnit,GetSuitableValueapps/packages/app/src/apps/dashboard/utils/→WorthValue,FormatFrequency,ConvertTemperature,GetDiskSize,GetThroughput,FormatTime,GetMinuteValue,GetLastMonitoringData,GetResult
Rules:
- The reference is the JS source.
format/testdata/golden-gen.jsruns the JS oracle directly and writesgolden.json;TestFormat_GoldenOracleasserts byte equality. Any divergence is a Go bug, not a JS bug. - Use
github.com/shopspring/decimalfor arithmetic that the JS uses BigNumber for (e.g.WorthValue's cascading thresholds,FormatFrequencyrounding). Do not introduce a different decimal lib. - Quirks ARE the contract. Examples:
- JS
toFixed(2)of1.005returns"1.00"because of float repr — the Go port replicates that. Test cases must use unambiguous values like1.234567 → 1.23. WorthValueuses cascadingifthresholds, not generic ladder logic — keep the structure.
- JS
- No display-side number formatting outside this pkg. Never
fmt.Sprintf("%.2f", x)inoverview.go/applications.gofor a value that is shown to the user (the few%.2fleft inoverview.goare pinned to non-user-facing CPU core counts and are documented inline). - Temperature: JSON
rawcarries Celsius (the upstream value);displayis rendered with--temp-unit(default Celsius). Never strip Celsius fromraw.
Shared helpers (the only legal call sites)
All shared infrastructure helpers live under
cli/pkg/dashboard/ (the pkg root) and are
imported as pkgdashboard "github.com/beclab/Olares/cli/pkg/dashboard"
by the pkg subpackages (pkg/dashboard/<area>/...) and — for the
plumbing types like CommonFlags / Client / ErrAlreadyReported —
also by the cmd shells. The cmd shells additionally import
pkg<area> "github.com/beclab/Olares/cli/pkg/dashboard/<area>" to
reach RunXxx.
pkg/dashboard/ (root) API surface — infrastructure + shared fetchers
| Domain file | Exposed surface | Used by |
|---|---|---|
flags.go | CommonFlags struct (raw + parsed fields), Validate, ResolveWindow, OutputFormat, ParseOutputFormat | every cmd leaf RunE (validate), every pkg RunXxx (read fields), every fetcher; cobra binding lives in cmd-root options.go |
output.go / emit.go / envelope.go | Envelope / Item / Meta / TableColumn, WriteJSON (NDJSON-aware), WriteTable, HeadItems, DisplayString, EmitDefault, NewMeta | every BuildXxxEnvelope / WriteXxxTable site under pkg/dashboard/<area>/... (hand-rolling json.Marshal / fmt.Println for envelope output is forbidden) |
client.go / users.go | Client.DoJSON / DoEmpty / DoRaw, EnsureUser (sync.Once), RequireAdmin, ResolveTargetUser, UserDetail, EnsureSystemStatus, IsOlaresOne, HTTPError, ClassifyTransportErr, IsHTTPError, ErrAlreadyReported | every HTTP call (direct httpClient.Do forbidden); admin gating in pkg/dashboard/overview/user.go |
runner.go | Runner{Iter, RunOnce}, ParseStep | every --watch-aware RunXxx (instantiated and driven from inside the pkg subpackage, never from cmd) |
monitoring.go | FetchClusterMetrics(ctx,c,cf,metrics,window,now,instant), FetchNodeMetrics, FetchUserMetric, MonitoringQuery, MonitoringWindow, DefaultClusterWindow, DefaultDetailWindow | pkg/dashboard/overview/{physical,cpu,memory,pods,user}.go, pkg/dashboard/overview/disk/{main,partitions}.go |
workloads.go | FetchWorkloadsMetrics(ctx,c,cf,req,window,now), MergeWorkloadMetrics, WorkloadAggregate / WorkloadApp / WorkloadRequest, AggregateByDeployment, AggregateByNamespace, PodCountByDeployment, SortWorkloadAggregates | only via BuildRankingEnvelope (do NOT call FetchWorkloadsMetrics directly from <area> subpackages) |
apps.go | FetchAppsList, RawAppListItem (with empty-entrances[] filter mirroring SPA appsWithNamespace) | only via BuildRankingEnvelope |
ranking.go | BuildRankingEnvelope(ctx,c,cf,target,sortBy,sortDir,now) — assembles a workload-grain ranking envelope | consumed by pkg/dashboard/overview/ranking.go and pkg/dashboard/applications/list.go so first-row parity holds (the ONE legitimate cross-area share, hoisted to the pkg root) |
system.go | FetchSystemIFS(ctx,c,testConnectivity), FetchSystemFan, SystemIFSItem, SystemFanData, SystemStatus, EnsureSystemStatus | pkg/dashboard/overview/network.go (capi /system/ifs), pkg/dashboard/overview/fan/live.go (capi mdns/olares-one/cpu-gpu) |
gates.go | GateOlaresOne (hard gate, fan), GPUAdvisory (soft gate, gpu), HasCUDANode, VgpuUnavailableFromError, ResetCUDANodeCache | pkg/dashboard/overview/fan/* (hard gate runs inside Runner.Iter), every pkg/dashboard/overview/gpu/* leaf (advisory note + soft proceed) |
gpu.go | FetchGraphicsList, FetchTaskList, FetchGraphicsDetail, FetchTaskDetail, ExtractHAMIMessage, GraphicsListBody, TaskListBody | every file under pkg/dashboard/overview/gpu/ (list / tasks / get / task / detail / task_detail) |
gpu_format.go | PercentString, PercentDirect, GPUModeLabel, GPUHealthLabel, GPUVRAMHuman, GPUTrendStep, RenderTemperature | every pkg/dashboard/overview/gpu/* table writer; pkg/dashboard/overview/disk/main.go (renderDiskTemperature wraps RenderTemperature) |
gpu_query.go | FetchInstantVector, FetchRangeVector, InstantVectorSample, RangeVectorSeries / Range / Point | pkg/dashboard/overview/gpu/specs.go (gauge/trend runners), pkg/dashboard/overview/gpu/detail.go, pkg/dashboard/overview/gpu/task_detail.go |
lsblk.go | HasPknameLabels, CollectSubtreeByPkname, ResolveParent, BuildLsblkTreePrefix, FlattenLsblkHierarchy, LsblkRow / LsblkFlatRow | pkg/dashboard/overview/disk/main.go, pkg/dashboard/overview/disk/partitions.go |
numbers.go | FormatFloat, SafeRatio, FormatRateAny, ParseRFCTimestamp, SampleFloat, LastSampleFromRow, FirstAnyInArray, ToFloat, RenderTemperature | every display-side rendering site under pkg/dashboard/<area>/... that doesn't go through format directly (typically aliased lowercase in the area's helpers.go) |
fan_curve.go | FanCurveTable (10 rows), FanSpeedMaxCPU, FanSpeedMaxGPU | pkg/dashboard/overview/fan/curve.go (do NOT fetch from BFF — UI constant; SPA also hardcodes) |
schema.go | Kind* constants, AllKinds(), LoadSchemaIndex, //go:embed schemas/*.json | cmd/ctl/dashboard/schema/root.go (loader); every pkg/dashboard/<area>/<verb>.go references the matching Kind* constant when emitting envelopes |
pkg/dashboard/<area>/ subpackages — heavy command-specific logic
| Subpackage | Entry points | Main files |
|---|---|---|
pkg/dashboard/applications/ | RunList(ctx,c,cf,sortBy,sortDir) | list.go (+ list_test.go) |
pkg/dashboard/overview/ | RunDefault (sections envelope), RunPhysical, RunUser(target), RunRanking(sortDir), RunCPU, RunMemory(mode), RunPods, RunNetwork(testConn) + the per-node scaffold RunPerNodeMetric (PerNodeDisplayFn) | default.go, physical.go, user.go, ranking.go, cpu.go, memory.go, pods.go, network.go, nodes.go, helpers.go (+ tests) |
pkg/dashboard/overview/disk/ | RunDefault, RunMain, RunPartitions(device) | default.go, main.go, partitions.go, helpers.go (+ tests) |
pkg/dashboard/overview/fan/ | RunDefault, RunLive, RunCurve | default.go, live.go, curve.go, helpers.go (+ tests) |
pkg/dashboard/overview/gpu/ | RunList, RunTasks, RunGet(uuid), RunTask(name,podUID,sharemode), RunDetail(uuid), RunTaskDetail(name,podUID,sharemode) + the specs.go query catalogue (gaugeSpec / trendSpec / runGauge / runTrend / fanoutGaugeAndTrend) | list.go, tasks.go, get.go, task.go, detail.go, task_detail.go, specs.go, helpers.go (+ tests) |
Each subpackage's helpers.go may carry package-private lowercase
aliases (e.g. formatFloat = pkgdashboard.FormatFloat,
gpuVRAMHuman = pkgdashboard.GPUVRAMHuman) for primitives the area
calls many times. These are readability-only — they are NOT new
helpers, do NOT carry behaviour, and are shadowed by helpers_test.go
in the same area for the matching test bodies.
cmd area common.go — slim, intentionally near-identical
Every cmd subpackage carries a small common.go whose contents are
intentionally near-identical between areas. This is expected, not
drift: each area declares its own factory + flag pointer so leaf code
inside that area can read common.X selector-style without importing
sibling areas. The current shape (≤ 70 lines per area, tighter for
sub-areas) is:
// Package <area> hosts the cobra wiring for `olares-cli dashboard <area>`;
// business logic lives in cli/pkg/dashboard/<area>/.
package <area>
import (
"context"
"fmt"
"strings"
"github.com/spf13/cobra"
"github.com/beclab/Olares/cli/pkg/cmdutil"
pkgdashboard "github.com/beclab/Olares/cli/pkg/dashboard"
)
// common is wired by NewXxxCommand; cobra's persistent-flag inheritance
// mutates the pointed-at struct before any leaf RunE fires.
var common *pkgdashboard.CommonFlags
// prepareClient is the area-private *pkgdashboard.Client factory.
func prepareClient(ctx context.Context, f *cmdutil.Factory) (*pkgdashboard.Client, error) { /* ResolveProfile + HTTPClient + NewClient */ }
// unknownSubcommandRunE prints a typed typo hint + returns ErrAlreadyReported.
func unknownSubcommandRunE(c *cobra.Command, args []string) error { /* SuggestionsFor + ErrAlreadyReported */ }
That's the WHOLE file. No trampoline aliases, no pkgdashboard.X
re-exports, no envelope helpers. If a leaf needs a pkg helper, it
imports pkgdashboard (or the area's pkg subpackage) and calls it
directly. The earlier ~400-line common.go shape — full of lowercase
re-exports of every pkg name — has been retired; reintroducing it is a
forbidden regression (see Iteration red-lines).
Do NOT "consolidate" common.go files by importing one area's
from another — that violates the no-horizontal-import red-line. If two
areas need the same heavyweight helper, hoist it to
cli/pkg/dashboard/ (as we did with BuildRankingEnvelope).
Coding rules (project-specific, hold tight)
- Package layout: cmd shell mirrors the command tree (one dir per parent, one
<verb>.goper leaf); pkg subpackages mirror the same tree (one<verb>.goper leaf, exposing oneRunXxxentry); pkg root is by domain (one file per fetcher / aggregator / infrastructure family). Extract a NEW pkg-root domain file ONLY when its function set is independently testable and isn't a shim over an existing domain. Cmd subpackages NEVER import each other horizontally; pkg subpackages NEVER import each other horizontally and the pkg root NEVER imports a subpackage. - File organization (cmd): each parent has
root.go(cobra assembly + AddCommand for children, RunE =pkg<area>.RunDefaultorRunListper the parent-default rule) +common.go(≤ 70 lines:var common, area-localprepareClient,unknownSubcommandRunE— and nothing else). Each leaf is a single file named after the verb (e.g.cpu.go,disk/main.go,gpu/list.go); RunE =validate flags → prepareClient → pkg<area>.RunXxx(...). Stub-style trampoline aliases are forbidden (see Iteration red-lines). - File organization (pkg): pkg-root files are domain-named (
monitoring.go,workloads.go,gpu.go,gpu_format.go,gpu_query.go,lsblk.go,gates.go,fan_curve.go,numbers.go, …). Pkg subpackages (<area>/) carry<verb>.gofiles that own the RunXxx + BuildEnvelope + WriteTable triplet for each leaf, plus an optionalhelpers.go(package-private lowercase aliases of pkg-root primitives, readability only). - Command constructors: signature
NewXxxCommand(f *cmdutil.Factory, cf *pkgdashboard.CommonFlags) *cobra.Command. The factory + flags pointer MUST be passed in explicitly; the cmd area subpackage storescfinvar common *pkgdashboard.CommonFlagsfor leaf RunE bodies; no package-level / global factory variable. - Cobra: every leaf has
Use/Short/ (where helpful)Examplefilled.SilenceUsage = true,SilenceErrors = true. Stub-style commands (none today) print an envelope withmeta.error="not implemented"AND a clear stderr line. The contract is pinned byTestAllLeafCommandsSilencedin cmd-root. - Errors: HTTP non-2xx →
*pkgdashboard.HTTPError; auth errors → reformatted at the pkg layer. Never wrap typed*credential.ErrTokenInvalidated/*credential.ErrNotLoggedIn— surface them directly so the standard "run profile login" CTA fires. Soft-printed errors (already shown to the user) returnpkgdashboard.ErrAlreadyReportedso the cmd-root wrapper does not double-print. - Reuse before extend: new cluster metric → check
pkgdashboard.FetchClusterMetrics; new per-node metric → checkFetchNodeMetrics; new user-grain metric → checkFetchUserMetric; new workload-grain view → checkBuildRankingEnvelope; new per-node leaf → reusepkg/dashboard/overview/nodes.RunPerNodeMetricwith aPerNodeDisplayFn; new watchable command → usepkgdashboard.Runnerfrom inside the pkgRunXxxbody. - Tests stay tiered (hard floors):
format100%, runner 100%, new pkg-root fetcher fields 100% (httptest wire-shape pin), per-leafRunXxxhappy-path + non-trivial branches (3-state empty, gating, soft-failure) 100%. The ratchet only goes up. - Help text: when a flag's behaviour depends on
--watch(--since,--watch-interval,--watch-iterations,--watch-timeout), say so verbatim in the flag help to keepolares-cli ... --helpself-documenting.
Iteration red-lines
Allowed (incremental work):
- Add a NEW leaf to an existing area:
- Create the pkg entry first:
cli/pkg/dashboard/<area>/<verb>.goexposingRunXxx(ctx, c, cf, ...)plus aBuildXxxEnvelopebuilder and (if it has a non-trivial table) aWriteXxxTable. Add a<verb>_test.gonext to it (one happy-path test minimum, plus error-class coverage matching nearby leaves; reuse the area'shelpers_test.gohttptest fixtures). - Add the cmd shell:
cli/cmd/ctl/dashboard/<area>/<verb>.go— a single cobra command literal whoseRunEispkg<area>.RunXxx(...)aftercommon.Validate()+prepareClient(...). No envelope / table code in cmd. - Wire it in the area's
root.goviaAddCommand. - Register the schema: new
KindXxxinpkg/dashboard/schema.go, append toAllKinds(), drop the JSON file underpkg/dashboard/schemas/, register inLoadSchemaIndex(). - If the leaf participates in the
--watchloop, drivepkgdashboard.Runnerfrom insideRunXxx(do NOT spawn it from cmd).
- Create the pkg entry first:
- Add a NEW area (new top-level parent command):
- Create the cmd shell:
cli/cmd/ctl/dashboard/<area>/{root.go, common.go}—var common *pkgdashboard.CommonFlags, area-localprepareClient,unknownSubcommandRunE.common.gostays ≤ 70 lines; do NOT add trampoline aliases. - Create the pkg subpackage:
cli/pkg/dashboard/<area>/{helpers.go, helpers_test.go, default.go}(and per-leaf<verb>.gofiles as you go).helpers.gomay carry package-private lowercase aliases for frequently called pkg-root primitives (readability only, no new symbols). - The area's
default.goexposesRunDefault— either the parent's sections envelope (Shape B) or a single-leaf default (Shape A, asoverview/gpudoes).
- Create the cmd shell:
- Add a NEW
omitemptyfield onMeta(forward-compatible). - Add a NEW optional
CommonFlagsflag (inpkg/dashboard/flags.go; bind in cmd-rootoptions.go;Validatemust keep all current rules). - Add a NEW per-leaf flag (today:
overview memory --mode,overview network --test-connectivity,overview gpu task --sharemode,overview ranking --sort,applications --sort). - Extend smoke / unit / golden tests; add httptest-based wire-shape pins to whichever tier owns the call site (pkg-root infra →
pkg/dashboard/dashboard_test.go; per-leaf envelope →pkg/dashboard/<area>/<verb>_test.go; cobra wiring →cmd/ctl/dashboard/dashboard_test.go). - Tighten doc strings, fix typos, fix
displayrendering as long asgolden.jsonstill passes.
Forbidden (regression / scope creep):
- Putting any leaf file directly under
cli/cmd/ctl/dashboard/— every leaf belongs in its area subpackage. The cmd-root only holdsroot.go(assembler),options.go(cobra↔CommonFlags binding), anddashboard_test.go(root-tier tests). - Putting envelope / table / watch / fan-out logic in cmd — cmd leaf RunE bodies are a few lines (validate flags, build client, call
pkg<area>.RunXxx). Any function that builds anEnvelope, writes a table, runsRunner.Run, or fans out concurrent HTTP belongs incli/pkg/dashboard/<area>/.... - Horizontal imports between cmd subpackages —
applications/may NOT importoverview/;overview/disk/may NOT importoverview/fan/. Parent-area imports (overview/root.goimportingoverview/disketc. for AddCommand wiring) are fine. - Upward imports inside
cli/pkg/dashboard/— the pkg root MUST NOT import any subpackage; sibling subpackages (overview/disk,overview/fan,overview/gpu) MUST NOT import each other. If two area subpackages need the same heavy helper, hoist it tocli/pkg/dashboard/(theBuildRankingEnvelopeprecedent — consumed by bothpkg/dashboard/overview/ranking.goandpkg/dashboard/applications/list.go). - Re-introducing trampoline alias files in cmd
common.go— earlier revisions had ~400-linecommon.gofiles re-exporting every pkg name. The slim shape is the contract:common.gocarriesvar common,prepareClient,unknownSubcommandRunE, and nothing else. - Reverting to a flat cmd package — every fix-it-quick attempt to "just dump it next to overview.go" is a regression. The directory tree IS the command tree; keep them aligned on both sides of the split.
- Renaming, removing, or repurposing any
Kind*constant. - Changing envelope shapes A / B (additive
omitemptyis fine; structural change is not). - Changing the parent-command default → sections envelope mapping. Specifically:
overview,overview disk,overview fandefault to Shape B;overview gpudefaults togpu list(Shape A);overview cpu/overview memory/overview podsetc. are leaves and keep their Shape A defaults. New parent commands MUST pick one of these two shapes and pin it in theUsedescription (and in the area's pkgdefault.go). - Performing unit / number formatting outside
pkg/dashboard/format/(including ad-hocfmt.Sprintf("%.2f", ...)for user-visible values). - Bypassing the factory-injected client (handwritten
http.Client, manualX-Authorization). - Editing
--watchexit semantics (3-fail cap,ErrTokenInvalidatedshort-circuit, SIGINT exit-0, NDJSON-per-iteration). Adding new exit conditions requires a separate plan. - Removing the
--since↔--start/--endmutual-exclusion check inpkg/dashboard/flags.go::CommonFlags.Validate. - Collapsing
meta.empty_reasonvalues or replacing them with prose. - Touching
EnsureUsercache lifetime /RequireAdminguard placement /ResolveTargetUseradmin check. - Adding routes / verbs whose source page is NOT under
Overview2orApplications2. - Swapping
FetchSystemIFS/FetchSystemFanto monitoring endpoints — these are deliberatecapipaths and the SPA uses the same. - Drift between
FanCurveTable(Go const inpkg/dashboard/fan_curve.go) and SPAFan/config.ts.tableData. If you have to change one, change BOTH and update the test pinning the row count. - Bypassing
BuildRankingEnvelope(or its underlyingFetchWorkloadsMetrics) with bespoke per-namespace fetches; first-row parity betweenoverview rankingandapplicationsMUST hold. - Re-doing the plan. Subsequent work is incremental only — increments must reference this skill, not re-derive design from the SPA.
If a task seems to require any forbidden item, STOP, surface this skill to the user, and ask for a documented exception. Do not silently refactor.
Test infrastructure
Tests sit at four tiers, mirroring the two-layer split. The rule is:
a test belongs next to the code that owns its non-trivial
behaviour. Almost every dashboard test is now alongside the
Run* it exercises, in the matching pkg/dashboard/<area>/
subpackage; the pkg-root dashboard_test.go has narrowed to
infrastructure only.
| Tier | Where | What lives there | How to run |
|---|---|---|---|
| Pkg-root infrastructure tests | cli/pkg/dashboard/dashboard_test.go + helpers_test.go (in-package trampolines) | INFRA-only: CommonFlags.Validate / ResolveWindow, Client.EnsureUser / EnsureSystemStatus / RequireAdmin / IsOlaresOne, Runner (one-shot + watch + 3-fail + watch-required-flag), Fetch* wire-shape pins (httptest), Merge* / aggregator math, lsblk tree algorithms, gates (HasCUDANode, GateOlaresOne, GPUAdvisory, VgpuUnavailableFromError, ExtractHAMIMessage), GPU formatters & body builders (Graphics/TaskListBody), Fetch{Instant,Range}Vector, GPUTrendStep, WriteJSON / HeadItems / ClassifyTransportErr. No envelope-shape or table-render tests for individual leaves — those live in the area subpackages. | go test ./cli/pkg/dashboard/ |
| Pkg-area envelope / RunXxx tests | cli/pkg/dashboard/<area>/<verb>_test.go + per-area helpers_test.go (shared httptest fixtures) | Per-leaf coverage of the RunXxx / BuildXxxEnvelope / WriteXxxTable triplet, including 3-state empty handling, capability gating, soft-failure semantics, and partial-failure fan-out (e.g. TestBuildGPUDetailFullEnvelope_PartialFailure, TestBuildGPUTaskDetailFullEnvelope_TimeSlicingSkipsAllocation, TestBuildLiveEnvelope_NoGraphicsIntegrationStillSucceeds, TestBuildSectionsEnvelope_ErrorPropagation). Lark-aligned density: at least one happy-path + the bug classes the file actually has. | go test ./cli/pkg/dashboard/<area>/... |
| Cmd-root behavioural tests | cli/cmd/ctl/dashboard/dashboard_test.go | Only what's tied to NewDashboardCommand's cobra binding — TestUnknownSubcommandRunE_PrintsSuggestionAndFailsOnTypo, TestUnknownSubcommandRunE_NoArgsPrintsHelp, TestLeafErrorsAreReported, TestLeafErrorsSentinelNotDoublePrinted, TestAllLeafCommandsSilenced (regression net for "Cobra printed usage when HAMI returned 5xx"). The bar for this tier is high — keep the file short. | go test ./cli/cmd/ctl/dashboard/ |
| Format JS oracle | pkg/dashboard/format/format_test.go + testdata/golden-gen.js | TestFormat_GoldenOracle against golden.json (skips if absent) | cd cli/pkg/dashboard/format/testdata && node golden-gen.js; go test ./cli/pkg/dashboard/format/... |
| Coverage | go test -coverprofile | coverage-dashboard.html | go test -coverprofile=coverage-dashboard.out -covermode=atomic ./cli/cmd/ctl/dashboard/... ./cli/pkg/dashboard/... && go tool cover -html=coverage-dashboard.out -o coverage-dashboard.html |
Test layout rules:
- New pkg-root infra test (a new fetcher / aggregator / gate / runner behaviour / format helper at the pkg root) →
cli/pkg/dashboard/dashboard_test.go(or, if the file gets too long, a new<domain>_test.gonext to it). Leveragehelpers_test.go'svar common *CommonFlagsand lowercase trampolines (fetchClusterMetrics,gateOlaresOne,newTestClient, …) so the body reads the same way the SPA-mirroring helpers do. - New leaf-level test (covers
RunXxx/BuildXxxEnvelope/WriteXxxTablefor one cmd verb) →cli/pkg/dashboard/<area>/<verb>_test.go, sharing the area'shelpers_test.gohttptest fixtures (newTestClient, area-specific*StubMux, etc.). Do NOT put it on the cmd side. - New root-tier behavioural test (cobra silencing, suggestion routing, leaf-error reporting) → cmd-root
dashboard_test.go. Keep the file short; reject any test that could equally live in pkg. - Cmd subpackages currently carry no
*_test.gofiles — every leaf's behaviour is covered throughRun*in pkg. Resist the urge to add cobra-bound tests on the cmd side; if you need them, justify it with a regression that genuinely depends on cobra binding (the existing pkg-area files already cover everything thatRun*can be exercised with).
Real-machine end-to-end smoke / regression scripts live outside the
repo — agents should drive olares-cli dashboard <command> -o json
directly against a live profile and pipe through jq -e for assertion.
There is no committed bash harness; prefer ad-hoc jq queries against
the documented Kind enum + envelope shape.
Common errors → fixes
| Error message | Cause | Fix |
|---|---|---|
server rejected the access token (HTTP 401/403) | Token expired / wrong / missing | Defer to ../olares-shared/SKILL.md. Do not catch + retry inside dashboard code. |
--since cannot be combined with --start/--end (mutually exclusive) | User passed both | Drop one; sliding window vs fixed window are different intents. |
--watch-iterations requires --watch (similarly --watch-timeout, --watch-interval) | Watch knob without --watch | Add --watch, or drop the knob. |
--user "<id>" requires platform-admin; <self> does not have that role | Non-admin tried to query another user | Log in as admin, or drop --user. |
meta.empty_reason = no_vgpu_integration (200 envelope, items:[]) | HAMI vGPU absent (HTTP 404) | Not an error — install HAMI or skip the GPU view. |
meta.empty_reason = vgpu_unavailable (200 envelope, items:[]) + meta.http_status=5xx + meta.error="..." + stderr gpu data temporarily unavailable: HAMI returned HTTP 5xx ... | HAMI installed but unhealthy (HTTP 5xx) | Transient — retry; investigate the HAMI controller. CLI exit code stays 0 so --watch keeps streaming. |
meta.empty_reason = no_gpu_detected (200 envelope, items:[]) | HAMI healthy but the device list / detail is empty | Not an error. |
meta.empty_reason = no_fan_integration (200 envelope, items:[]) | /user-service/api/mdns/olares-one/cpu-gpu returned 404 (Olares One only) | Not an error on non-Olares-One deployments. |
meta.empty_reason = not_olares_one (200 envelope, items:[]) + stderr fan is only available on Olares One devices ... | Active device's device_name is not Olares One — fan subtree is hard-gated off | Not an error. Skip fan on this hardware; verify with curl /user-service/api/system/status. |
meta.note="GPU sidebar entry is hidden for non-admin profiles ..." + stderr (advisory) GPU sidebar entry ... | Non-admin profile invoked a gpu verb. Soft advisory only — data still fetched. | Not an error. If HAMI still returned data, surface the note alongside the items. Switch to a platform-admin profile if the agent needs the SPA-equivalent UX. |
meta.note="no node carries gpu.bytetrade.io/cuda-supported=true ..." + stderr (advisory) no node carries ... | Cluster has no node labelled gpu.bytetrade.io/cuda-supported=true. Soft advisory only. | Not an error. Surface the note; HAMI may still report devices. |
kind=dashboard.<x>, meta.error="..." inside an NDJSON line | Single failed --watch iteration | Fine — watch keeps running. Three in a row exits non-zero. |
unknown subcommand "application" (typo) | User typed application instead of applications | Use the suggestion printed; shell completion also helps. |
Typical workflows
One-shot snapshot for an agent:
olares-cli dashboard overview -o json | jq '.sections.user.items[0].raw.utilisation'
Stream CPU at the SPA's cadence, two iterations:
olares-cli dashboard overview cpu --watch --watch-iterations 2 -o json
Sliding 5-minute window, polled every 10s, until SIGINT:
olares-cli dashboard overview cpu --watch --since 5m --watch-interval 10s -o json
Fixed window (no sliding even if --watch set):
olares-cli dashboard overview cpu --start 2026-04-28T08:00:00Z --end 2026-04-28T08:30:00Z -o json
Disk sections (main table + per-disk partitions):
olares-cli dashboard overview disk -o json | jq '.sections.partitions.sections | keys'
olares-cli dashboard overview disk partitions sda -o json
Fan sections (live + curve):
olares-cli dashboard overview fan -o json | jq '.sections.curve.items | length' # → 10
olares-cli dashboard overview fan live --watch --watch-iterations 5 -o json
GPU + tasks (3-state):
olares-cli dashboard overview gpu list -o json
olares-cli dashboard overview gpu tasks -o json
olares-cli dashboard overview gpu get GPU-0123abcd -o json
olares-cli dashboard overview gpu task my-task abcdef-0123-...
Cross-user query as admin:
olares-cli dashboard overview user bob -o json
Real-machine smoke (drive -o json and jq -e directly):
olares-cli profile use olares-id
olares-cli dashboard schema -o json | jq -e '.kind == "dashboard.schema.index"'
olares-cli dashboard overview -o json | jq -e '.sections.physical.kind == "dashboard.overview.physical"'
olares-cli dashboard overview gpu list -o json | jq -e '.kind == "dashboard.overview.gpu.list"'