Skip to content

vmm-cli update --compose with --env-file silently drops compose changes when allowed_envs differs #707

@lloydmak99

Description

@lloydmak99

Summary

vmm-cli.py update <vm_id> --compose new.yaml --env-file new.env --kms-url ... silently drops the --compose update when the env-file's keys differ from the VM's current allowed_envs. The resulting VMM-stored compose_file keeps the old docker_compose_file but with the new allowed_envs. vmm-cli update exits 0 and reports success.

Reproduction

Any combined update where --env-file introduces (or removes) any env var changes allowed_envs. For us this surfaced when adding LAUNCHER_CHANNEL to the env list alongside a new service in the compose YAML — the new service was silently dropped on two hosts.

Root cause

vmm/src/vmm-cli.py, update_vm() (current master, lines 1051–1124): two unrelated branches both write to upgrade_params["compose_file"], and the env-file branch runs last:

# Branch 1 — compose update (line 1051)
if needs_compose_update:
    vm_configuration = vm_info_response["info"].get("configuration") or {}
    compose_file_content = vm_configuration.get("compose_file")
    app_compose = json.loads(compose_file_content) if compose_file_content else {}
    if docker_compose_content:
        app_compose["docker_compose_file"] = docker_compose_content   # ← inserts NEW YAML
    ...
    upgrade_params["compose_file"] = json.dumps(app_compose, ...)

# Branch 2 — env-file (line 1088)
if env_file:
    envs = parse_env_file(env_file)
    if envs:
        ...
        if compose_file_content:
            app_compose = json.loads(compose_file_content)            # ← RE-READS ORIGINAL (no new YAML)
            ...
            if app_compose.get("allowed_envs") != allowed_envs:
                app_compose["allowed_envs"] = allowed_envs
                compose_changed = True
            ...
            if compose_changed:
                upgrade_params["compose_file"] = json.dumps(app_compose, ...)   # ← OVERWRITES branch 1's result

Branch 2 reloads compose_file_content from vm_info_response (pre-update state) instead of continuing to mutate the app_compose dict already built by branch 1. When allowed_envs differs, compose_changed=True and branch 2's upgrade_params["compose_file"] = json.dumps(app_compose, ...) clobbers the new YAML.

Why it's hard to notice

  • vmm-cli update exits 0 and prints success
  • The resulting compose_file still has the new allowed_envs, so subsequent env operations look correct
  • The KMS hash registered by the operator (computed from app-compose.json) matches what VMM stores — both are wrong-but-internally-consistent
  • The CVM boots fine; the missing service simply… never existed

Suggested fix

Have branch 2 reuse the app_compose dict built by branch 1 instead of reloading from vm_configuration. Sketch:

app_compose = None  # accumulated across both branches

if needs_compose_update or env_file:
    vm_info_response = self.rpc_call("GetInfo", {"id": vm_id})
    ...

if needs_compose_update:
    vm_configuration = vm_info_response["info"].get("configuration") or {}
    compose_file_content = vm_configuration.get("compose_file")
    try:
        app_compose = json.loads(compose_file_content) if compose_file_content else {}
    except json.JSONDecodeError:
        app_compose = {}

    if docker_compose_content:
        app_compose["docker_compose_file"] = docker_compose_content
        updates.append("docker compose")
    # ... prelaunch_script, swap_size ...
    upgrade_params["compose_file"] = json.dumps(app_compose, ...)

if env_file:
    envs = parse_env_file(env_file)
    if envs:
        ...
        # Reuse the in-flight app_compose if branch 1 ran;
        # otherwise load from current VMM state.
        if app_compose is None:
            vm_configuration = vm_info_response["info"].get("configuration") or {}
            compose_file_content = vm_configuration.get("compose_file")
            try:
                app_compose = json.loads(compose_file_content) if compose_file_content else {}
            except json.JSONDecodeError:
                app_compose = {}

        compose_changed = False
        allowed_envs = list(envs.keys())
        if app_compose.get("allowed_envs") != allowed_envs:
            app_compose["allowed_envs"] = allowed_envs
            compose_changed = True
        # ... launch_token_hash ...
        if compose_changed or needs_compose_update:
            upgrade_params["compose_file"] = json.dumps(app_compose, ...)

Two key changes: (a) app_compose is shared across both branches; (b) when branch 1 ran, always re-serialize the merged result so the env updates don't drop the compose changes.

Workaround (no upstream change needed)

Split the single update into two sequential vmm-cli update calls:

  1. vmm-cli update <vm_id> --env-file new.env --kms-url ... — settles allowed_envs and encrypted_env
  2. vmm-cli update <vm_id> --compose new.yaml --vcpu ... --image ... --kms-url ... — applies the new compose against an already-matching allowed_envs, so branch 2 sees compose_changed=False and doesn't clobber

Environment

Reproduced on a downstream install (/usr/bin/vmm-cli.py, md5 da37c6fecd4219363e4c43076ca4fc30); upstream master at vmm/src/vmm-cli.py has the same code path. Hosts in question were built from a dstack release using dstack-nvidia-0.5.5.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions