Skip to content

docs(gpu): drop manual KubeVirt patch step now that the platform auto-wires permittedHostDevices#556

Open
Aleksei Sviridkin (lexfrei) wants to merge 2 commits into
mainfrom
feat/gpu-auto-wiring
Open

docs(gpu): drop manual KubeVirt patch step now that the platform auto-wires permittedHostDevices#556
Aleksei Sviridkin (lexfrei) wants to merge 2 commits into
mainfrom
feat/gpu-auto-wiring

Conversation

@lexfrei
Copy link
Copy Markdown
Contributor

@lexfrei Aleksei Sviridkin (lexfrei) commented May 28, 2026

Companion to cozystack/cozystack#2768.

Rewrites step 2 of the GPU Passthrough guide. Until now the page instructed operators to run kubectl edit kubevirt -n cozy-kubevirt and hand-paste a permittedHostDevices.pciHostDevices block — that is the friction that ticket #2765 asked the platform to remove. With cozystack/cozystack#2768 landed, the bundle mirrors the chosen GPU variant into the KubeVirt CR automatically: HostDevices is appended to the feature-gate list and a starter NVIDIA pciHostDevices table (Hopper, Ada Lovelace, Ampere, Turing, Volta) is rendered alongside the operator's .gpu.permittedHostDevices extensions.

The new step 2 documents:

  • The contract — what the platform auto-injects and why (HostDevices gate, NVIDIA default table, externalResourceProvider: true semantics).
  • How to verify (kubectl -n cozy-kubevirt get kubevirt kubevirt -o yaml | yq ...).
  • The escape hatch — gpu.replaceDefaults, gpu.permittedHostDevices.pciHostDevices, plus the consequence of replaceDefaults: true with an empty list (no admittable GPU VMs).
  • The manual Package-CR override path — when an operator hand-crafts cozystack.gpu-operator outside the bundle for advanced overrides, they also hand-craft cozystack.kubevirt with the matching extraFeatureGates / permittedHostDevices. The manual override takes precedence over the bundle render.

Only next/virtualization/gpu.md is touched. The released doc versions (v1.4 and earlier) describe earlier Cozystack releases that still require the manual kubectl edit, and stay as-is.

Release note

docs(gpu): the GPU Passthrough guide no longer instructs operators to manually patch the KubeVirt CR — Cozystack now auto-wires the HostDevices feature gate and a starter NVIDIA permittedHostDevices table whenever cozystack.gpu-operator is enabled in bundles.enabledPackages. Operators extend or replace the defaults via .gpu.permittedHostDevices and .gpu.replaceDefaults.

Summary by CodeRabbit

  • Documentation
    • Updated GPU virtualization documentation with automated configuration workflow when gpu-operator package is enabled
    • Added commands to verify GPU configuration and guidance for customization
    • Included upgrade instructions for existing manual GPU setup configurations
    • Documented manual override options for non-bundled deployments

…-wires permittedHostDevices

Step 2 of the GPU Passthrough guide instructed operators to
`kubectl edit kubevirt -n cozy-kubevirt` and hand-paste a
permittedHostDevices.pciHostDevices block. cozystack/cozystack#2768
removes the need for that step: when cozystack.gpu-operator is in
bundles.enabledPackages, the platform now mirrors the chosen GPU
variant into the KubeVirt CR automatically — appending HostDevices
to the feature-gate list and rendering a starter NVIDIA pciHostDevices
table covering Hopper, Ada Lovelace, Ampere, Turing and Volta.

The new step 2 documents the contract (what the platform auto-injects
and why), the verification recipe, the escape hatch via
.gpu.permittedHostDevices / .gpu.replaceDefaults, and the manual
Package-CR override path used by operators who need overrides the
bundle does not expose (driver settings, custom node selectors,
validator / dcgmExporter tweaks) — in that flow they also hand-craft
the matching cozystack.kubevirt Package CR.

Only next/virtualization/gpu.md is updated; v1.4 and earlier
describe releases that still require the manual patch and stay
as-is.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@netlify
Copy link
Copy Markdown

netlify Bot commented May 28, 2026

Deploy Preview for cozystack ready!

Name Link
🔨 Latest commit 5ba523c
🔍 Latest deploy log https://app.netlify.com/projects/cozystack/deploys/6a1f653c268721000711f1d1
😎 Deploy Preview https://deploy-preview-556--cozystack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR updates GPU configuration documentation for KubeVirt, replacing manual CR editing instructions with automated management via the cozystack.gpu-operator package. It adds verification commands, customization guidance via Platform Package values, upgrade instructions for existing users, and documents the manual override path for bundle opt-out scenarios.

Changes

GPU Configuration Workflow

Layer / File(s) Summary
Automatic configuration and customization
content/en/docs/next/virtualization/gpu.md
Replaces manual KubeVirt CR editing with automatic wiring of HostDevices feature gates and permittedHostDevices when gpu-operator is enabled. Adds verification commands and documents extending or replacing NVIDIA defaults through Platform Package values including replaceDefaults behavior.
Upgrade and manual override paths
content/en/docs/next/virtualization/gpu.md
Provides upgrade guidance for users with previously hand-edited permittedHostDevices entries, including steps to migrate custom entries into Platform Package values and validate resourceName. Documents the manual Package-CR override path when users opt out of bundle management.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related issues

Poem

🐰 A rabbit's gentle paw adjusts the GPU dials,
Where auto-wired configs now guide through trials,
No more kubectl edits, no manual blues—
Just smooth upgrades and feature-rich hues! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: dropping manual KubeVirt patch steps in favor of automatic platform wiring of GPU permittedHostDevices configuration.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/gpu-auto-wiring

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the GPU virtualization documentation to explain that Cozystack now automatically configures and wires KubeVirt when the GPU operator is enabled. It details the automatic injection of host devices, how to extend or replace NVIDIA defaults, and the manual Package-CR override path. The review feedback suggests improving command portability by replacing yq with jq in the verification step, and correcting the configuration path from components.kubevirt.values to spec.values for standalone Package CRs.

Comment on lines +115 to +116
kubectl -n cozy-kubevirt get kubevirt kubevirt -o yaml \
| yq '.spec.configuration | {featureGates: .developerConfiguration.featureGates, permittedHostDevices: .permittedHostDevices}'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using yq can sometimes lead to compatibility issues depending on whether the user has the Python-based yq (which supports full jq syntax) or the Go-based yq (which has a different expression syntax) installed.

Using kubectl ... -o json | jq ... is much more portable, standard, and guaranteed to work across different environments since jq is universally standardized.

Suggested change
kubectl -n cozy-kubevirt get kubevirt kubevirt -o yaml \
| yq '.spec.configuration | {featureGates: .developerConfiguration.featureGates, permittedHostDevices: .permittedHostDevices}'
kubectl -n cozy-kubevirt get kubevirt kubevirt -o json \
| jq '.spec.configuration | {featureGates: .developerConfiguration.featureGates, permittedHostDevices: .permittedHostDevices}'


### Manual Package-CR override path

If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `components.kubevirt.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When creating a standalone cozystack.kubevirt Package CR directly, the configuration values should be defined under spec.values rather than components.kubevirt.values. The components.<name>.values structure is used when configuring components within the umbrella cozystack-platform package.

Updating this path ensures the standalone Package CR is configured correctly.

Suggested change
If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `components.kubevirt.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist.
If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `spec.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist.

Copy link
Copy Markdown
Member

@kvaps Andrei Kvapil (kvaps) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes on one thing: keep a discoverable "GPU not in the default table" escape hatch — but route it through .gpu.permittedHostDevices, not kubectl edit. The rest of the rewrite is good.

I'd rather we not leave operators without a visible manual path. Two points:

  1. The reconcile-safe manual path already lives in this PR — the "Extending or replacing the NVIDIA defaults" section (.gpu.permittedHostDevices + replaceDefaults). That's the right answer for a card not in the static table, and it survives reconcile because it flows through platform values → the KubeVirt CR template. My only ask is to make it more discoverable — e.g. a short FAQ entry / callout titled "My GPU isn't in the default table" that links to it, since operators upgrading from the old flow will look for the removed kubectl edit step.

  2. Please don't reinstate the old kubectl edit kubevirt step verbatim behind a spoiler. Post-auto-wiring that field is owned by the chart template, so a hand edit is reverted on the next Flux/Helm reconcile — keeping it as-is would be a footgun. If we show the raw CR shape at all, it should be explicitly labelled "reference only — permittedHostDevices is reconciled from platform values; edit .gpu.permittedHostDevices instead" inside the collapsible.

This ties into the upgrade-safety request on the platform side — cozystack/cozystack#2768 (and the migration breakdown in cozystack/cozystack#2768 (comment)): operators who hand-edited permittedHostDevices need a clear, persistent migration target, and .gpu.permittedHostDevices is it. Worth surfacing the same upgrade note here too.

Net: keep the manual capability, just anchor it on the persistent knob and make it easy to find.

…ostDevices

The bundle now owns spec.configuration.permittedHostDevices, so the first
reconcile after upgrade overwrites manual kubectl-edit entries with the NVIDIA
default table. Tell operators to move custom entries into
.gpu.permittedHostDevices and verify each resourceName against node-advertised
names before upgrading, since the default slugs (e.g. TU104GL_T4) differ from
legacy names (e.g. TU104GL_TESLA_T4) and a mismatch silently rejects GPU VMs.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
content/en/docs/next/virtualization/gpu.md (1)

147-150: ⚡ Quick win

Inconsistent kubectl command pattern.

This command uses kubectl get kubevirt -n cozy-kubevirt -o yaml without specifying the resource name, then indexes into .items[0]. However, line 115 uses kubectl get kubevirt kubevirt with the explicit resource name, which returns the object directly without needing .items[] indexing.

For consistency and clarity, use the same pattern as line 115:

📝 Suggested fix for consistency
-   kubectl get kubevirt -n cozy-kubevirt -o yaml \
-     | yq '.items[0].spec.configuration.permittedHostDevices'
+   kubectl -n cozy-kubevirt get kubevirt kubevirt -o yaml \
+     | yq '.spec.configuration.permittedHostDevices'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@content/en/docs/next/virtualization/gpu.md` around lines 147 - 150, The
kubectl command uses the list-style invocation and then indexes into .items[0],
which is inconsistent with the explicit resource call used earlier; update the
command so it targets the specific KubeVirt resource name (same pattern as the
earlier `kubectl get kubevirt kubevirt`) and remove the need for `.items[0]`
when extracting `.spec.configuration.permittedHostDevices` to keep command style
consistent and clearer.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@content/en/docs/next/virtualization/gpu.md`:
- Line 110: Update the wording around resourceName in the
spec.configuration.permittedHostDevices.pciHostDevices paragraph to reflect the
actual slug format used by nvidia-sandbox-device-plugin (v25.x): state that
resourceName slugs are typically two-component identifiers like
`nvidia.com/GA102GL_A10` or `nvidia.com/TU104GL_T4` and clarify that optional
`<form>` and `<mem>` components may be appended for more specific devices (i.e.,
`<arch>_<model>` is the common case, with optional `_ <form>_ <mem>` when
present); keep the note about externalResourceProvider: true and mention the
plugin as the source of these resource names.

---

Nitpick comments:
In `@content/en/docs/next/virtualization/gpu.md`:
- Around line 147-150: The kubectl command uses the list-style invocation and
then indexes into .items[0], which is inconsistent with the explicit resource
call used earlier; update the command so it targets the specific KubeVirt
resource name (same pattern as the earlier `kubectl get kubevirt kubevirt`) and
remove the need for `.items[0]` when extracting
`.spec.configuration.permittedHostDevices` to keep command style consistent and
clearer.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cc18be5d-d493-4e16-a7de-27048f475ce7

📥 Commits

Reviewing files that changed from the base of the PR and between ef54f10 and 5ba523c.

📒 Files selected for processing (1)
  • content/en/docs/next/virtualization/gpu.md

Specifically, the platform injects:

- `HostDevices` into `spec.configuration.developerConfiguration.featureGates` (current KubeVirt splits this from the `GPU` gate; the admission webhook rejects `domain.devices.hostDevices` without it).
- A starter `spec.configuration.permittedHostDevices.pciHostDevices` table covering common NVIDIA datacenter GPUs — Hopper (H100, H200), Ada Lovelace (L4, L40, L40S), Ampere (A100 PCIe/SXM, A40, A30, A10), Turing (T4), Volta (V100, V100S). PCI vendor:device pairs are stable; `resourceName` slugs follow the `<arch>_<model>_<form>_<mem>` convention `nvidia-sandbox-device-plugin` v25.x emits (e.g. `nvidia.com/GA102GL_A10`). `externalResourceProvider: true` is set on every entry because the resources are advertised by the sandbox plugin, not by KubeVirt's in-tree device plugin.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check the actual resource naming pattern in nvidia-sandbox-device-plugin documentation

# Search for resourceName examples and naming documentation
rg -C5 'resourceName.*nvidia\.com' --type=md

# Also check for any naming convention documentation
rg -C3 'arch.*model.*form.*mem|naming.*convention' --type=md

Repository: cozystack/website

Length of output: 42109


Fix resourceName naming-convention wording (line 110)

  • Line 110 states resourceName slugs follow <arch>_<model>_<form>_<mem>, but the document’s examples only use two components (e.g., nvidia.com/GA102GL_A10, nvidia.com/TU104GL_T4). Update the convention text to match the actual slug format (and clarify when/if <form> and <mem> are present) or adjust the examples accordingly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@content/en/docs/next/virtualization/gpu.md` at line 110, Update the wording
around resourceName in the
spec.configuration.permittedHostDevices.pciHostDevices paragraph to reflect the
actual slug format used by nvidia-sandbox-device-plugin (v25.x): state that
resourceName slugs are typically two-component identifiers like
`nvidia.com/GA102GL_A10` or `nvidia.com/TU104GL_T4` and clarify that optional
`<form>` and `<mem>` components may be appended for more specific devices (i.e.,
`<arch>_<model>` is the common case, with optional `_ <form>_ <mem>` when
present); keep the note about externalResourceProvider: true and mention the
plugin as the source of these resource names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants