Skip to content

Container status API: compute live status instead of static DB column#352

Open
runleveldev wants to merge 15 commits into
mainfrom
350-container-status-api
Open

Container status API: compute live status instead of static DB column#352
runleveldev wants to merge 15 commits into
mainfrom
350-container-status-api

Conversation

@runleveldev

@runleveldev runleveldev commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Resolves #350. Merge #351 first.

Summary

Container status was a static Containers.status column updated only by the create/reconfigure job scripts, so it drifted from reality whenever Proxmox changed out of band or a job died mid-run. This replaces it with status computed live on demand, embedded directly on the existing container API payloads. The static column is dropped.

Live status resolver — create-a-container/utils/container-status.js

Status is derived from two cheap sources of truth:

  • Proxmox run-state — does the LXC exist and is it running/stopped? Read from a single clusterResources('lxc') snapshot per node.
  • Create job — for containers not yet in Proxmox: is the create job active, did it fail, or is it gone? The create job is linked by the creationJobId foreign key, so this is resolved from the eager-loaded creationJob association (zero extra queries) or a single PK lookup.

Resolved statuses:

Status Condition
running online in Proxmox
offline exists in Proxmox but stopped
creating no Proxmox VM yet, active create job
failed no Proxmox VM, create job returned failure
missing no Proxmox VM, create succeeded or no create job found
unknown Proxmox unreachable / node has no API credentials

Where status is returned

The live status is embedded on the existing list, show, and create container payloads — no new endpoint.

Earlier revisions of this PR also implemented restarting (active reconfigure job) and out-of-sync (live LXC config differs from expectation), plus a dedicated GET /containers/:id/status endpoint. All three were removed for performance/simplicity (see below). The PUT endpoint still enqueues a reconfigure job and returns its jobId; only the restarting status label is gone.

Performance (why the scope was trimmed)

Resolving the list must stay cheap. Two things were inherently O(N):

  • Reconfigure-job lookups (for restarting) — reconfigure jobs have no FK, so detecting them needed per-container command-string queries.
  • Config-drift reads (for out-of-sync) — Proxmox has no bulk config endpoint, so this needed one lxcConfig call per in-Proxmox container.

Both were dropped. The index now issues one Proxmox call per node and zero per-container Proxmox or DB job queries (verified at 30 containers). The create job comes free from the eager-loaded association.

Compatibility

  • status remains present on the same endpoints, so existing consumers keep working. Checked mieweb/launchpad (an external API client): it only checks status == "running" and never uses restarting/out-of-sync, so dropping them is safe there.
  • This narrows the status set vs. what main could emit (main could return restarting), so it's not a strict superset anymore — but no known client depends on the removed values.

Removed the static column

  • Migration 20260615000000-remove-container-status.js drops Containers.status; removed from the model.
  • All former writers updated to stop persisting status (create-container.js, reconfigure-container.js, nodes.js import, json-to-sql.js). Failure is now captured by the job's failure status.
  • The create-time status !== 'pending' guard became a Proxmox-VMID-presence guard.
  • routers/templates.js (nginx/dnsmasq config) previously filtered where: { status: 'running' }; it now selects routable containers by ipv4Address != null.
  • routers/api/v1/resource-requests.js (applyResourceToExistingContainers) previously filtered where: { ..., status: 'running' } — that would have thrown "Unknown column 'status'" after the migration. It now filters on containerId != null (a VMID is required to reconfigure anyway).

Frontend

  • Typed ContainerStatus union; the list page renders the live status from the list response via a small presentational StatusBadge (friendly labels + status → badge-variant mapping). No per-row polling.

Docs

  • OpenAPI: ContainerStatus enum, referenced from the Container schema.
  • README: ER diagram (removed status) and a container-status field description.

Testing

No test framework exists in the repo, so logic was validated with standalone Node assertion scripts (not committed):

  • Decision-table assertions for every status path.
  • Batch-resolution assertions confirming 0 per-container Proxmox/DB calls and 1 cluster snapshot per node at scale (30 containers).
  • All backend files pass node --check; OpenAPI YAML parses; client type-check and build pass.

Notes / follow-ups

  • Still a draft. This drops a DB column, so coordinate with deploy/migration timing.
  • Config-drift (out-of-sync) can be revisited later if Proxmox gains a bulk config endpoint or if drift is computed lazily on the detail view / cached.
  • Happy to add a prove/test harness if the team wants committed tests.

Comment thread create-a-container/client/src/pages/containers/ContainersListPage.tsx Outdated
Comment thread create-a-container/routers/templates.js Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the persisted Containers.status column and replaces it with a live, computed container status derived from Proxmox state, job state, and config drift, while keeping status present on existing container API payloads. It also updates the UI and docs to reflect the expanded status set.

Changes:

  • Add a shared status resolver (utils/container-status.js) and wire it into container list/show/create responses.
  • Drop Containers.status from the Sequelize model + DB schema; remove status writes from job scripts/import paths.
  • Update templates generation filters and frontend types/rendering to support the new live status values.

Reviewed changes

Copilot reviewed 10 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
create-a-container/utils/container-status.js New live status resolver (Proxmox snapshot + job lookup + drift detection).
create-a-container/routers/templates.js Switch template container selection from status='running' to a provisioned/proxy predicate.
create-a-container/routers/api/v1/nodes.js Stop persisting status during node/container import.
create-a-container/routers/api/v1/containers.js Embed computed status in list/show; return creating on create response; stop persisting status updates.
create-a-container/README.md Remove status from ER diagram and document the live status field and enum values.
create-a-container/openapi.v1.yaml Add ContainerStatus enum schema and reference it from Container.status.
create-a-container/models/container.js Remove status attribute from the Container model definition.
create-a-container/migrations/20260615000000-remove-container-status.js Migration to drop the status column (with rollback restoring a default).
create-a-container/client/src/pages/containers/ContainersListPage.tsx Render typed live status via StatusBadge and map new states to badge variants/labels.
create-a-container/client/src/lib/types.ts Introduce ContainerStatus union and type container/status-bearing payloads accordingly.
create-a-container/bin/reconfigure-container.js Stop updating container status; keep MAC/IP updates and adjust logs.
create-a-container/bin/json-to-sql.js Stop setting/updating container status during JSON import/update.
create-a-container/bin/create-container.js Replace the old “pending” guard with a VMID presence guard; stop persisting creating/running/failed status.

Comment thread create-a-container/utils/container-status.js Outdated
Comment thread create-a-container/utils/container-status.js Outdated
Comment thread create-a-container/routers/templates.js
Comment thread create-a-container/routers/templates.js
Comment thread create-a-container/README.md Outdated
@runleveldev runleveldev force-pushed the 350-container-status-api branch from ccc31d4 to 793ba81 Compare June 16, 2026 14:12
@runleveldev runleveldev marked this pull request as ready for review June 16, 2026 15:16
Admin-defined default environment variables were merged into a
container's DB record when the container was created. This froze the
defaults to their creation-time values and exposed them in the edit UI.

Now the container record stores only user-defined env vars. System
defaults (from Settings) and NVIDIA GPU defaults are merged in at
configure-time, just before the Proxmox API call, with user-defined
values taking precedence.

- models/container.js: buildLxcEnvConfig() now merges
  defaults < NVIDIA defaults < user vars; add getSystemDefaultEnvVars(),
  parseEnvironmentVars(), and nvidiaDefaultEnvVars() helpers.
- bin/create-container.js: merge system defaults at configure-time and
  stop writing merged env/entrypoint back into the DB record.
- bin/reconfigure-container.js: merge system defaults on edit so newly
  added defaults also reach existing containers.
- routers/api/v1/containers.js: stop baking NVIDIA defaults into the
  saved record on create.

Closes #349
Move all environment-variable merge logic into Container#buildLxcEnvConfig
so there is exactly one place that determines the env to deploy. The
method now loads the admin-defined system defaults itself (it became
async) and composes them with the NVIDIA and user-defined values:

  system defaults < NVIDIA defaults < user-defined values

Callers in create-container.js and reconfigure-container.js are reduced
to a single `await container.buildLxcEnvConfig()` call and no longer load
or pass defaults themselves. getSystemDefaultEnvVars, parseEnvironmentVars
and nvidiaDefaultEnvVars are now internal helpers of that method.
… the Proxmox env string

Addresses PR review: env var keys were only trimmed, so a key containing
'=' or NUL (or a non-string/array value) could corrupt the NUL-separated
KEY=value blob sent to Proxmox (splitting one var into many, breaking
parsing, or stringifying objects to "[object Object]").

Add Container.normalizeEnvVars() as the single definition of a valid env
var: input must be a plain object; keys are trimmed and must match a
conventional env-var name (no '=', NUL, or whitespace); values are coerced
to strings, with null/undefined, objects/arrays, and NUL-containing values
dropped.

Apply it at every ingest/load point:
- getSystemDefaultEnvVars() and parseEnvironmentVars() normalize their output.
- API create/update routes serialize user env vars via a shared
  serializeUserEnvVars() helper that normalizes before persisting.

buildLxcEnvConfig can now trust its inputs, so the per-pair guards are
dropped.
buildLxcEnvConfig always added env/entrypoint to Proxmox's `delete` list
when the container record had no value. On create the container is freshly
cloned from a template that may carry its own env/entrypoint, so emitting
`delete` actively wiped those template-provided defaults.

Add a deleteMissing option (default false). Proxmox's config PUT is a
partial update, so omitting a key leaves it untouched:

- create-container.js uses the default (false): empty env/entrypoint are
  omitted, preserving whatever the cloned template provides.
- reconfigure-container.js passes deleteMissing:true so clearing the last
  env var or removing a custom entrypoint still unsets it on the existing
  container.

The reconfigure restart-detection already ignores the `delete` key, so a
pure-delete config does not, by itself, trigger a restart.
Reconfigure uses deleteMissing, so any env/entrypoint that exists only on
the cloned template (and was never stored on the container) would be
unset on the first reconfigure — surprising the user. We can't re-query
the template later: templates are mutable Docker refs that may change or
disappear, and we don't cache them.

Compromise: at creation, after the template is cloned, read its LXC
config and fold its env/entrypoint into the container record as if the
user had supplied them (precedence: template < user). This guarantees
those values persist across future reconfigures without changing API
client expectations.

System and NVIDIA defaults are deliberately NOT persisted — they remain
configure-time-only (merged by buildLxcEnvConfig) so they aren't frozen
into the record or exposed as user values in the edit UI.

- models/container.js: add Container.parseLxcEnvString() and
  Container#persistTemplateDefaults().
- bin/create-container.js: call persistTemplateDefaults() with the cloned
  template's config before building/applying the LXC env config.
Resolves #350.

Container status was a static column updated only by the create/reconfigure
job scripts, so it drifted from reality whenever Proxmox changed out of band
or a job died mid-run. This replaces it with status computed on demand.

New utils/container-status.js resolves status from three sources:
  - Proxmox: does the LXC exist, and is it running/stopped?
  - Jobs: active create/reconfigure job, or did the last create job fail?
  - Config: does the live LXC config match what the API server expects?

Resolved statuses: running, offline, creating, restarting, failed, missing,
out-of-sync (plus unknown when Proxmox is unreachable).

New endpoint:
  GET /api/v1/sites/:siteId/containers/:id/status -> { data: { status } }

Non-breaking: the same live status is still embedded on the list/show/create
payloads, so existing consumers keep working.

The static Containers.status column is dropped (migration + model). All former
writers were updated to stop persisting status; templates.js now selects
provisioned containers by VMID + IPv4 instead of status='running'. The list
page queries the per-container status endpoint (seeded by the list value for
instant paint, polling while creating/restarting).

OpenAPI and README updated.
Live status is already embedded in the existing list/show/create container
payloads, so the dedicated GET /containers/:id/status endpoint is redundant.
Per-row polling + hydration on the index page is not needed at this time, so
revert the ContainerStatusBadge to a simple presentational badge that renders
the live status from the list response.

Removes the endpoint, the client query/key/result type, the OpenAPI path, and
the README endpoint section (the status-value reference is retained as a field
description). The ContainerStatus enum and the live resolver remain.
- out-of-sync is a warning, not a danger, badge variant
- templates.js routes containers by ipv4Address only (a VMID isn't required to route)
- container-status.js: match --container-id as a whole argument when looking up
  create/reconfigure jobs, so container 12 no longer matches container 123's
  jobs (which could misclassify status). LIKE narrows at the DB level; a regex
  confirms the id is terminated by a space or end-of-string.
- resource-requests.js: the applyResourceToExistingContainers query filtered on
  the now-removed Containers.status column ("Unknown column 'status'" after the
  migration). Use containerId != null (a VMID is required to reconfigure anyway).
- README: note the create endpoint also returns the live status field.
The calling convention for create/reconfigure job commands is stable and
deterministic, so match the command string exactly in the query rather than
fetching LIKE candidates and regex-filtering in JS:

- create: exact match on `node bin/create-container.js --container-id=<id>`
  (create commands never carry trailing args).
- reconfigure: exact match OR `<base> %` (exact base followed by a space), which
  covers the optional ` --<resource>=<value>` flags while still preventing id 12
  from matching id 123.
Resolving the container list status was O(N) DB queries because every container
ran its own create/reconfigure job lookup (up to 2 queries each). Add
prefetchLatestJobs() which fetches the latest create + reconfigure job for the
whole set of containers in a single query (bounded WHERE clauses, independent of
N), buckets them per container, and feeds the result into computeContainerStatus
via an optional `jobs` param.

computeContainerStatuses now issues one job query total instead of ~2N. The
single-container show path is unchanged (still queries per container when no
prefetched jobs are passed).
…ng, out-of-sync)

Resolving the container list was still slow because each container did per-item
work: a command-string job lookup for 'restarting', and a Proxmox lxcConfig call
for 'out-of-sync'. Both are inherently O(N) (Proxmox has no bulk config endpoint),
which dominated index latency.

Drop both statuses:
- 'restarting': not used by known API clients (launchpad only checks == running),
  and reconfigure jobs have no FK so detecting it required command-string queries.
- 'out-of-sync': required one lxcConfig call per in-Proxmox container; no bulk
  Proxmox config API exists to batch it.

Status now derives only from:
- Proxmox run-state, via one clusterResources('lxc') snapshot per node, and
- the create job, resolved from the eager-loaded creationJob association (or the
  creationJobId FK) — no command-string queries.

Result: the index makes one Proxmox call per node and zero per-container Proxmox
or DB job queries (verified at 30 containers). Remaining statuses: running,
offline, creating, failed, missing, unknown. PUT still enqueues a reconfigure job
and returns its jobId; only the 'restarting' status label is removed.

Updates the resolver, containers router (eager-load creationJob on the index),
client types/badges, OpenAPI enum, and README.
@runleveldev runleveldev force-pushed the 350-container-status-api branch from 91a3ddf to e790cfb Compare June 18, 2026 16:19
runleveldev added a commit that referenced this pull request Jun 19, 2026
… + DummyApi

Run the Manager standalone on localhost (SQLite, `make dev`) and create
containers without Proxmox, keeping the real provisioning code path.

- NodeApi abstraction: Node.api() dispatches on a new `nodeType` column.
  `proxmox` (default) returns ProxmoxApi; `dummy` returns a new DummyApi that
  implements the same surface and simulates Proxmox calls. Creation still runs
  POST /containers -> Job -> job-runner -> bin/create-container.js -> node.api().
- DummyApi.clusterResources('lxc') derives the running set from the DB so the
  live container-status resolver (#352) reports dummy containers correctly;
  dev-bootstrap gives the dummy node placeholder creds so nodeHasCreds() passes.
- Provider-agnostic node selection that only excludes unprovisionable nodes
  (dummy, or proxmox with apiUrl/tokenId/secret); DELETE routes teardown through
  node.api(). Node.api() validates apiUrl alongside credentials.
- Postgres-safe migration adding `nodeType` (existing rows default to proxmox).
- Fix the erroneous UNIQUE on Containers.nodeId via patch-package (v6 backport of
  sequelize/sequelize#17583); replaces the SQLite-specific migration.
- bin/dev-bootstrap.js seeds a localhost site + dummy node + external domain,
  no-op if any Site exists, so it never clobbers the docker-compose bootstrap.
- Make idempotent seeders (groups, push-notification) so re-running them in
  `make dev` doesn't violate unique constraints.
- Gate Sequelize SQL query logging behind LOG_LEVEL=trace; off otherwise.
- Replace the unmaintained sqlite3 with @vscode/sqlite3 (dev-only, builds from
  source via dialectModule) so there is no prebuilt-binary glibc mismatch; prod
  (Postgres) never installs or loads it.
- `make dev` lives in create-a-container/Makefile (root delegates): self-contained
  install (--no-package-lock, so it never rewrites the lockfile), migrate, seed,
  then run server (nodemon) + job-runner + client watch together.
- Docs: "Run the Manager Locally (make dev)" section; README + release-pipeline
  pointers; LOG_LEVEL documented in example.env.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Container status API

2 participants