Skip to content

feat(backup): incremental NAS backup support for KVM#13074

Open
jmsperu wants to merge 20 commits into
apache:mainfrom
jmsperu:feature/nas-backup-incremental
Open

feat(backup): incremental NAS backup support for KVM#13074
jmsperu wants to merge 20 commits into
apache:mainfrom
jmsperu:feature/nas-backup-incremental

Conversation

@jmsperu
Copy link
Copy Markdown
Collaborator

@jmsperu jmsperu commented Apr 27, 2026

Summary

Implements incremental backup support for the NAS backup provider on KVM, using QEMU dirty bitmaps and libvirt's backup-begin API. RFC: #12899.

For large VMs this reduces daily backup storage 80–95% and shortens backup windows from hours to minutes (e.g. a 500 GB VM with moderate writes goes from ~500 GB/day to ~5–15 GB/day after the initial full backup).

What's in the PR

Commit What
f2a9202d74 RFC document at the RFC comment on issue #12899
1981469099 NASBackupChainKeys constants + zone-scoped nas.backup.full.every ConfigKey (default 10)
fbb916b254 nasbackup.sh mode-aware: full+checkpoint or incremental+rebase via backup-begin
1f2aebca36 Java orchestration: full-vs-incremental decision in provider, chain metadata in backup_details
43e2f7504a On-demand bitmap recreation when CloudStack rebuilt the domain XML on VM restart
39303fbf88 Restore path: relative-path rebase + qemu-img convert flatten for file-based primary
b8d069e127 Cascade delete: RebaseBackupCommand, chain repair for delete-middle, refuse-delete-full-with-children
49edc7f22c Five new smoke tests in test/integration/smoke/test_backup_recovery_nas.py

Full diff: 11 files, +1617 / −30.

Review feedback addressed (all from #12899 thread)

# Reviewer Concern Resolution
1 @JoaoJandre No new columns on backups Chain metadata stored in existing backup_details kv table via NASBackupChainKeys
2 @abh1sar nas.backup.full.interval (days) doesn't fit hourly/ad-hoc Replaced with count-based nas.backup.full.every (default 10)
3 @abh1sar Use backup-begin for full backups too Done — both modes use backup-begin; full omits <incremental>
4 @abh1sar Timestamp-based bitmap names backup-<epoch> (System.currentTimeMillis()/1000)
5 @abh1sar No explicit block-dirty-bitmap-add libvirt manages bitmaps via --checkpointxml; manual bitmap commands removed
6 @abh1sar qemu-img rebase after each incremental Done in nasbackup.sh, with relative backing path so chain survives mount-point churn
7 @abh1sar Stopped VMs Stopped VMs always full; agent emits INCREMENTAL_FALLBACK= if cadence asked for inc
8 @abh1sar Cascade delete behaviour Implemented: middle-inc rebases child onto grandparent; full-with-children refuses unless forced=true
9 @abh1sar Bitmap recreation on VM restart Lazy recreation at next backup attempt — agent checks virsh checkpoint-list, recreates if missing, emits BITMAP_RECREATED=
10 @abh1sar Smoke tests 5 new cases in test_backup_recovery_nas.py
11 @abh1sar Single PR for 4.23 This PR

Backwards compatibility

  • The new -M / --bitmap-* flags on nasbackup.sh are optional. Without them, the script preserves the legacy full-only behaviour exactly (no checkpoint creation, same XML).
  • TakeBackupCommand new fields default to null; LibvirtTakeBackupCommandWrapper only emits the new flags when set, so a 4.22 management server talking to a 4.23 agent still works.
  • Existing backups (no chain_id in backup_details) are treated as standalone fulls by the cascade-delete logic — no migration needed.

Test plan

  • Build (CI)
  • Unit tests (CI; existing NASBackupProviderTest should still pass)
  • Smoke tests in test_backup_recovery_nas.py (5 new cases — require required_hardware="true")
  • Manual: take 5 backups with nas.backup.full.every=3, verify chain pattern FULL, INC, INC, FULL, INC
  • Manual: restore from a tail incremental and diff the restored disk against the live disk at backup time
  • Manual: stop-start a VM between two incrementals and check BITMAP_RECREATED= shows up in agent logs
  • Manual: delete a middle inc and confirm the next restore from the tail still has the deleted backup's blocks

Refs

Adds the design document for incremental NAS backups using QEMU dirty
bitmaps and libvirt's backup-begin API. Reduces daily backup storage
80-95% for large VMs.

Refs: apache#12899
NASBackupChainKeys defines the keys this provider stores under the
existing backup_details kv table (parent_backup_id, bitmap_name,
chain_id, chain_position, type). This keeps the backups table
provider-agnostic per the RFC review.

nas.backup.full.every is a zone-scoped ConfigKey that controls how
often a full backup is taken; the remaining backups in the cycle are
incremental. Counts backups (not days), so it works for hourly,
daily, and ad-hoc schedules. Default 10. Set to 1 to disable
incrementals (every backup is full).

Refs: apache#12899
Adds three new optional CLI flags to nasbackup.sh:
  -M|--mode <full|incremental>
  --bitmap-new <name>          (checkpoint to create with this backup)
  --bitmap-parent <name>       (incremental: parent bitmap to read changes since)
  --parent-path <path>         (incremental: parent backup file for rebase)

Behavior:
  - When -M is omitted, behavior is unchanged (legacy full-only, no checkpoint
    created), so existing callers are not affected.
  - With -M full + --bitmap-new, a full backup is taken AND a libvirt
    checkpoint of that name is registered atomically (via backup-begin's
    --checkpointxml), giving the next incremental its starting bitmap.
  - With -M incremental, libvirt's <incremental> element references the
    parent bitmap; only changed blocks are written. After completion,
    qemu-img rebase wires the new file to its parent so the chain on the
    NAS is self-describing for restore.
  - Stopped VMs cannot use backup-begin; if -M incremental is requested
    while VM is stopped, the script falls back to a full and emits
    INCREMENTAL_FALLBACK= on stderr so the orchestrator can record it
    correctly in the chain.
  - The script echoes BITMAP_CREATED=<name> on success so the Java caller
    can store it under backup_details (NASBackupChainKeys.BITMAP_NAME).

Works across local file, NFS-file, and LINSTOR primary storage. Ceph RBD
running-VM support is a pre-existing limitation of this script, not
affected by this change.

Refs: apache#12899
jmsperu added 5 commits April 27, 2026 19:07
Adds the Java side of the incremental NAS backup feature:

  TakeBackupCommand
    + mode, bitmapNew, bitmapParent, parentPath fields (null for legacy
      callers — script preserves its existing behaviour when these are
      omitted).

  BackupAnswer
    + bitmapCreated (echoed by the agent on success)
    + incrementalFallback (true when an incremental was requested but the
      agent had to fall back to full because the VM was stopped).

  LibvirtTakeBackupCommandWrapper
    - Forwards the new fields to nasbackup.sh.
    - Strips the new BITMAP_CREATED= / INCREMENTAL_FALLBACK= marker lines
      out of stdout before the existing numeric-suffix size parser runs,
      so the script can keep the same "size as last line(s)" contract.
    - Surfaces both markers on the BackupAnswer.

  NASBackupProvider
    - decideChain(vm) walks backup_details (chain_id, chain_position,
      bitmap_name) for the latest BackedUp backup of the VM and decides:
        * Stopped VM      -> full (libvirt backup-begin needs running QEMU)
        * No prior chain  -> full (chain_position=0)
        * chain_position+1 >= nas.backup.full.every -> new full
        * otherwise       -> incremental, parent=last bitmap
    - Generates timestamp-based bitmap names ("backup-<epoch>") matching
      what the script then registers as the libvirt checkpoint name.
    - persistChainMetadata() writes parent_backup_id, bitmap_name,
      chain_id, chain_position, type into the existing backup_details
      key/value table (per the RFC review — no new columns on backups).
    - Honours the agent's INCREMENTAL_FALLBACK= signal: re-records the
      backup as a full and starts a fresh chain.
    - createBackupObject() now takes a type argument so the BackupVO
      reflects the actual decision instead of always being "FULL".

Refs: apache#12899
CloudStack rebuilds the libvirt domain XML on every VM start, which means
persistent QEMU dirty bitmaps don't survive a stop/start cycle. Rather
than hooking into the VM start lifecycle (intrusive across the
orchestration layer), this commit handles the missing bitmap *lazily* at
the next backup attempt:

  nasbackup.sh
    - When -M incremental is requested, the script first checks
      `virsh checkpoint-list` for the parent bitmap. If absent, it
      recreates the checkpoint on the running domain so libvirt accepts
      the <incremental> reference. The next incremental will be larger
      than usual (it captures all writes since recreate, not since the
      previous incremental) but is correct; subsequent ones return to
      normal size.
    - On recreation, emits BITMAP_RECREATED=<name> on stdout for the
      orchestrator to record.

  BackupAnswer
    + bitmapRecreated field surfaced from the agent.

  LibvirtTakeBackupCommandWrapper
    - Strips BITMAP_RECREATED= line from stdout before size parsing.
    - Sets answer.setBitmapRecreated(...).

  NASBackupChainKeys
    + BITMAP_RECREATED key for backup_details.

  NASBackupProvider
    - When the agent reports a recreated bitmap, persists it under
      backup_details and logs an info-level message so operators can
      correlate larger-than-usual incrementals with VM restarts.

This satisfies the bitmap-loss-on-VM-restart concern from the RFC review
without touching VirtualMachineManager / StartCommand / agent lifecycle.

Refs: apache#12899
Two changes that together let an incremental NAS backup be restored
without manual chain assembly:

  scripts/vm/hypervisor/kvm/nasbackup.sh
    - qemu-img rebase now writes a backing-file path that is RELATIVE to
      the new qcow2's directory (e.g. ../<parent-ts>/root.<uuid>.qcow2)
      rather than the absolute path on the current mount point. NAS mount
      points are ephemeral (mktemp -d), so an absolute reference would
      not resolve when the backup is re-mounted at restore time. Relative
      references are resolved by qemu-img against the file's own
      directory, so the chain stays valid no matter where the NAS is
      mounted next.
    - Verifies the parent file exists on the NAS before rebasing.

  LibvirtRestoreBackupCommandWrapper
    - For file-based primary storage (local, NFS-file), the existing
      code rsync'd the source qcow2 to the volume. That copies only the
      differential blocks of an incremental, leaving a volume whose
      backing-file reference points at a path the primary storage host
      doesn't have. Now: detect a backing-chain via qemu-img info JSON
      and flatten via 'qemu-img convert -O qcow2', which follows the
      chain and produces a self-contained qcow2. Full backups continue
      to use rsync (faster, no chain to flatten).
    - The block-storage path (RBD/Linstor) already used qemu-img convert
      via the QemuImg helper, which auto-flattens chains, so that path
      needed no change.

Refs: apache#12899
Adds the delete-with-chain-repair semantics agreed in the RFC review:

  scripts/vm/hypervisor/kvm/nasbackup.sh
    - New '-o rebase' operation: rebases an existing on-NAS qcow2 onto
      a new backing parent. Uses a SAFE rebase (no -u) so the target
      absorbs blocks of the about-to-be-deleted parent before the
      backing pointer is moved up to the grandparent. Writes the new
      backing reference relative to the target's directory so it
      survives mount-point changes.
    - New CLI flags --rebase-target, --rebase-new-backing (both passed
      mount-relative).

  RebaseBackupCommand + LibvirtRebaseBackupCommandWrapper
    - New agent command that wraps the script's rebase operation. The
      provider sends one of these per child that needs re-pointing.

  NASBackupProvider.deleteBackup
    - Now plans the chain repair before touching files via
      computeChainRepair():
        * No chain metadata     -> single-file delete (legacy behaviour)
        * Tail incremental      -> single delete, no rebase
        * Middle incremental    -> rebase immediate child onto our
                                   parent, then delete; shift
                                   chain_position of all later
                                   descendants by -1
        * Full with descendants -> refuse unless forced=true; with
                                   forced=true delete full + every
                                   descendant newest-first
    - Updates parent_backup_id, chain_position metadata in
      backup_details after each rebase so the model in the DB matches
      the on-disk chain.

This implements the cascade-delete behaviour requested in @abh1sar's
review point apache#7.

Refs: apache#12899
Adds five new test cases to test_backup_recovery_nas.py covering the
end-to-end behaviour of the incremental NAS backup feature:

  * test_incremental_chain_cadence
      - Sets nas.backup.full.every=3, takes 5 backups, verifies the
        type pattern is FULL, INC, INC, FULL, INC.

  * test_restore_from_incremental
      - FULL + 2 INCs, each with a marker file. Restores from the
        latest INC and verifies all three markers are present
        (i.e. qemu-img convert flattened the chain correctly).

  * test_delete_middle_incremental_repairs_chain
      - Builds FULL, INC1, INC2; deletes INC1 (no force needed);
        restores from the surviving INC2 and verifies that markers
        from FULL, INC1 (which was deleted), and INC2 are all present
        — proving the rebase merged INC1's blocks into INC2.

  * test_refuse_delete_full_with_children
      - Verifies plain delete of a FULL that has children fails, and
        delete with forced=true succeeds and removes the whole chain.

  * test_stopped_vm_falls_back_to_full
      - Sets cadence to 2, takes one backup (FULL), stops the VM,
        triggers another (cadence would say INC). Verifies the second
        backup is recorded as FULL because the agent fell back when
        backup-begin couldn't run on a stopped VM.

All tests restore nas.backup.full.every to 10 in finally blocks.

Refs: apache#12899
@boring-cyborg boring-cyborg Bot added component:integration-test Python Warning... Python code Ahead! labels Apr 27, 2026
@jmsperu jmsperu changed the title [WIP] feat(backup): incremental NAS backup support for KVM feat(backup): incremental NAS backup support for KVM Apr 27, 2026
@jmsperu jmsperu marked this pull request as ready for review April 27, 2026 16:26
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 29.71014% with 291 lines in your changes missing coverage. Please review.
✅ Project coverage is 18.13%. Comparing base (6f4445c) to head (bedcc2a).
⚠️ Report is 60 commits behind head on main.

Files with missing lines Patch % Lines
...rg/apache/cloudstack/backup/NASBackupProvider.java 38.76% 144 Missing and 25 partials ⚠️
...ource/wrapper/LibvirtTakeBackupCommandWrapper.java 0.00% 41 Missing ⚠️
.../apache/cloudstack/backup/RebaseBackupCommand.java 0.00% 26 Missing ⚠️
...ava/org/apache/cloudstack/backup/BackupAnswer.java 0.00% 18 Missing ⚠️
...rce/wrapper/LibvirtRebaseBackupCommandWrapper.java 5.55% 17 Missing ⚠️
...rg/apache/cloudstack/backup/TakeBackupCommand.java 33.33% 16 Missing ⚠️
...g/apache/cloudstack/backup/NASBackupChainKeys.java 0.00% 3 Missing ⚠️
...ce/wrapper/LibvirtRestoreBackupCommandWrapper.java 87.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13074      +/-   ##
============================================
+ Coverage     18.02%   18.13%   +0.10%     
- Complexity    16621    16788     +167     
============================================
  Files          6029     6040      +11     
  Lines        542184   543183     +999     
  Branches      66451    66525      +74     
============================================
+ Hits          97740    98513     +773     
- Misses       433428   433595     +167     
- Partials      11016    11075      +59     
Flag Coverage Δ
uitests 3.51% <ø> (-0.01%) ⬇️
unittests 19.30% <29.71%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@winterhazel winterhazel added this to the 4.23.0 milestone Apr 27, 2026
@sureshanaparti
Copy link
Copy Markdown
Contributor

@jmsperu can you check the build failure. thanks.

@weizhouapache
Copy link
Copy Markdown
Member

@jmsperu
is this ready for review ?

Phase 6 added a hasBackingChain() check before rsync that uses
qemu-img info to detect chained incrementals. The existing
testExecuteWithRsyncFailure test mocks Script.runSimpleBashScriptForExitValue
to return 0 for any command, so the new qemu-img info check
incorrectly evaluates as "has backing chain" and routes the test
through the chain-flatten path instead of rsync — the test then
asserts a failure that never occurs.

Add a clause to the mock that returns 1 (no backing chain) for the
qemu-img info backing-filename probe, so the test continues to
exercise the rsync path it was designed for.
@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented Apr 28, 2026

@weizhouapache yes — ready for review.

@sureshanaparti — apologies, I missed your earlier ping. The build failure was a unit test in LibvirtRestoreBackupCommandWrapperTest.testExecuteWithRsyncFailure (NPE on currentDevice after my new chain-flatten check incorrectly routed the test through the qemu-img convert path).

Fixed in d80ed16: the test's Script.runSimpleBashScriptForExitValue mock now returns 1 (no backing chain) for the new qemu-img info | grep "backing-filename" probe, so the test continues to exercise the rsync path it was designed for.

CI should be green on the next run. Cc @abh1sar @JoaoJandre @harikrishna-patnala in case you also want to take a look.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds incremental backup-chain support to the NAS backup provider for KVM by leveraging libvirt backup-begin with checkpoints/dirty-bitmaps, plus restore/flatten and chain-aware delete/repair semantics.

Changes:

  • Introduces backup-chain metadata keys (NASBackupChainKeys) and zone-scoped cadence config nas.backup.full.every, with orchestration logic to choose full vs incremental and persist chain details in backup_details.
  • Extends the KVM agent + nasbackup.sh to support full-with-checkpoint and incremental-with-rebase, plus a new “rebase” operation used for chain repair during delete.
  • Updates restore logic to detect qcow2 backing chains and flatten via qemu-img convert, and adds new integration smoke tests for incremental-chain behavior.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
test/integration/smoke/test_backup_recovery_nas.py Adds incremental-chain smoke tests (cadence, restore, delete-middle repair, forced delete behavior, stopped-VM fallback).
scripts/vm/hypervisor/kvm/nasbackup.sh Adds mode-aware backup (full/incremental), checkpoint creation, incremental rebase, and a new rebase operation for delete-middle chain repair.
plugins/hypervisors/kvm/src/test/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtRestoreBackupCommandWrapperTest.java Extends restore wrapper tests to exercise the “no backing chain => rsync” path.
plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtTakeBackupCommandWrapper.java Passes incremental args to nasbackup.sh and parses bitmap/fallback markers from script output.
plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtRestoreBackupCommandWrapper.java Detects qcow2 backing chains and flattens incrementals during restore using qemu-img convert.
plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtRebaseBackupCommandWrapper.java New wrapper to run nasbackup.sh -o rebase for chain repair.
plugins/backup/nas/src/main/java/org/apache/cloudstack/backup/NASBackupProvider.java Implements full-vs-incremental decisions, stores chain metadata in backup_details, and adds chain-aware delete/repair logic.
plugins/backup/nas/src/main/java/org/apache/cloudstack/backup/NASBackupChainKeys.java Defines backup_details keys for chain id/position/type/bitmap/parent linkage.
docs/rfcs/incremental-nas-backup.md Adds an RFC document describing incremental NAS backup approach (needs alignment with final implementation).
core/src/main/java/org/apache/cloudstack/backup/TakeBackupCommand.java Adds optional incremental-mode fields (mode/bitmap names/parent path).
core/src/main/java/org/apache/cloudstack/backup/RebaseBackupCommand.java New agent command to rebase a backup qcow2 onto a new backing file for chain repair.
core/src/main/java/org/apache/cloudstack/backup/BackupAnswer.java Adds fields to return bitmap creation/recreation and incremental-fallback markers back to orchestration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/rfcs/incremental-nas-backup.md Outdated
Comment on lines +276 to +283
# All tests set nas.backup.full.every to a small value (3) so a chain
# forms quickly without needing many backup iterations. They restore
# the original value at teardown.

def _set_full_every(self, value):
Configurations.update(self.apiclient, name='nas.backup.full.every',
value=str(value), zoneid=self.zone.id)

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The incremental tests always reset nas.backup.full.every to 10 in finally, but they never read/preserve the original zone value. If a test environment has a non-default value configured, these tests will leave the zone config changed. Consider capturing the current value once (e.g., via Configurations.list(...) in setup) and restoring that exact value in each finally (or in tearDown).

Copilot uses AI. Check for mistakes.
Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated
# to full and signal the fallback so the orchestrator can record it as a full
# in the chain.
if [[ "$MODE" == "incremental" ]]; then
echo "INCREMENTAL_FALLBACK=full (VM stopped — incremental requires running VM)" >&2
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INCREMENTAL_FALLBACK=... is printed to stderr (>&2), but Script.executePipedCommands(...) only captures stdout. As a result, the management server will never see this marker and can't reliably record a stopped-VM backup as FULL / start a new chain. Emit the fallback marker on stdout (or change the Java wrapper to capture stderr too).

Suggested change
echo "INCREMENTAL_FALLBACK=full (VM stopped — incremental requires running VM)" >&2
echo "INCREMENTAL_FALLBACK=full (VM stopped — incremental requires running VM)"

Copilot uses AI. Check for mistakes.
command.setBitmapParent(decision.bitmapParent);
command.setParentPath(decision.parentPath);

if (VirtualMachine.State.Stopped.equals(vm.getState())) {
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For stopped VMs decideChain returns fullStart(newBitmap) and takeBackup always sets command.mode="full" / bitmapNew. But nasbackup.sh's stopped-VM path doesn’t create a checkpoint/bitmap (no BITMAP_CREATED=), so persisting nas.bitmap_name from the requested bitmap can cause the next backup to attempt an incremental against a bitmap that was never created. Consider clearing mode/bitmapNew (legacy full) for stopped VMs and/or only persisting nas.bitmap_name when the agent confirms it via BITMAP_CREATED=.

Suggested change
if (VirtualMachine.State.Stopped.equals(vm.getState())) {
if (VirtualMachine.State.Stopped.equals(vm.getState())) {
// Stopped-VM backups use the offline path and do not create checkpoints/bitmaps.
// Clear chain metadata so a full backup does not imply a bitmap was created.
command.setMode(null);
command.setBitmapNew(null);
command.setBitmapParent(null);
command.setParentPath(null);

Copilot uses AI. Check for mistakes.
Comment on lines +269 to +287
if [[ "$effective_mode" == "incremental" ]]; then
volUuid="${fullpath##*/}"
if [[ "$fullpath" == /dev/drbd/by-res/* ]]; then
volUuid=$(get_linstor_uuid_from_path "$fullpath")
fi
# PARENT_PATH from the orchestrator is the parent backup's path relative to the
# NAS mount root (e.g. "i-2-X/2026.04.27.12.00.00/root.UUID.qcow2"). Convert it to
# a path relative to THIS new qcow2's directory so the backing reference resolves
# correctly the next time the NAS is mounted (mount points are ephemeral).
local parent_abs="$mount_point/$PARENT_PATH"
if [[ ! -f "$parent_abs" ]]; then
echo "Parent backup file does not exist on NAS: $parent_abs"
cleanup
exit 1
fi
local parent_rel
parent_rel=$(realpath --relative-to="$dest" "$parent_abs")
if ! qemu-img rebase -u -b "$parent_rel" -F qcow2 "$dest/$name.$volUuid.qcow2" >> "$logFile" 2> >(cat >&2); then
echo "qemu-img rebase failed for $dest/$name.$volUuid.qcow2 onto $parent_rel"
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In incremental mode, each exported qcow2 (root.* and datadisk.*) is rebased onto the same PARENT_PATH. For VMs with multiple volumes this will rebase data-disk incrementals onto the root-disk parent file, corrupting the chain for non-root volumes. PARENT_PATH needs to be per-disk (or pass the parent backup directory and derive the correct parent filename for each volUuid/disk role).

Copilot uses AI. Check for mistakes.
Comment on lines +116 to +136
// Strip out our incremental marker lines before parsing size, so the legacy
// numeric-suffix parser keeps working.
String stdout = result.second().trim();
String bitmapCreated = null;
String bitmapRecreated = null;
boolean incrementalFallback = false;
StringBuilder filtered = new StringBuilder();
for (String line : stdout.split("\n")) {
String trimmed = line.trim();
if (trimmed.startsWith("BITMAP_CREATED=")) {
bitmapCreated = trimmed.substring("BITMAP_CREATED=".length());
continue;
}
if (trimmed.startsWith("BITMAP_RECREATED=")) {
bitmapRecreated = trimmed.substring("BITMAP_RECREATED=".length());
continue;
}
if (trimmed.startsWith("INCREMENTAL_FALLBACK=")) {
incrementalFallback = true;
continue;
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wrapper tries to detect INCREMENTAL_FALLBACK= from result.second(), but Script.executePipedCommands(...) only returns stdout from the process pipeline (it does not capture stderr). Since nasbackup.sh currently prints the fallback marker to stderr, incrementalFallback will always remain false here. Align the script/wrapper so the marker is emitted on stdout (or update the execution helper to merge/capture stderr).

Copilot uses AI. Check for mistakes.
Comment on lines +456 to +460
echo "Incremental backup options (running VMs only; requires QEMU >= 4.2 and libvirt >= 7.2):"
echo " -M|--mode full Take a full backup AND create a checkpoint (--bitmap-new required) for future incrementals."
echo " -M|--mode incremental Take an incremental backup since --bitmap-parent and create new checkpoint --bitmap-new."
echo " Requires --bitmap-parent, --bitmap-new, and --parent-path (parent backup file for rebase)."
echo " Without -M, behaves as legacy full-only backup with no checkpoint creation."
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage text says “Without -M, behaves as legacy full-only backup with no checkpoint creation”, but the script still runs sanity_checks unconditionally (QEMU>=4.2/libvirt>=7.2) even for legacy callers and non-backup ops (delete/stats/rebase). To preserve the documented legacy behavior/backward compatibility, gate the version checks to only the code paths that actually require backup-begin/incremental features (or only when MODE is set).

Copilot uses AI. Check for mistakes.
Comment thread docs/rfcs/incremental-nas-backup.md Outdated
@bernardodemarco pointed out that design docs / RFCs go in the project
wiki or as a separate issue rather than into the source tree. The RFC
content has been posted as a comment on the existing tracking issue
apache#12899 (which is where the design discussion already lives), and the
docs/rfcs/ directory is removed from this PR.
@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented Apr 28, 2026

@bernardodemarco thanks — good point. Done in 9764025:

PR is now purely the implementation. Updated PR description to drop the doc reference.

jmsperu added 2 commits May 16, 2026 19:45
Per @abh1sar's review at NASBackupProvider.java:251, the "find the latest
backup with a bitmap" heuristic in decideChain breaks after a restore — the
restored disk image has no QEMU dirty-bitmap, so taking the last-backed-up
row as parent for the next incremental produces a chain whose first link
references blocks that no longer exist.

Track the active checkpoint per-VM in vm_instance_details under the new
nas.active_checkpoint_id key:

* takeBackup writes it after every successful backup (cleared on the
  stopped-VM offline path where no bitmap is created).
* restoreVMFromBackup / restoreBackupToVM clear it on success, so the next
  backup is automatically a full and starts a fresh chain.
* decideChain reads it first: null => full; non-null => find the matching
  bitmap_name on a BackedUp backup row; if none, fall back to full.

The new VM_ACTIVE_CHECKPOINT_ID key lives next to the existing chain keys
in NASBackupChainKeys (it's a vm_instance_details detail, not a backup
detail, so no schema migration is needed).
…tals

Per @abh1sar's review at NASBackupProvider.java:696, the previous
rebase-the-child / chain-repair logic is hard to reason about and not
consistent with how CloudStack handles incremental-snapshot deletion
elsewhere. Replace it with the same pattern used by
DefaultSnapshotStrategy#deleteSnapshotChain:

* Delete request for backup B1:
  * If B1 has live children, mark B1 with nas.delete_pending=true and
    return. The on-NAS file and DB row stay until the last descendant
    is gone.
  * If B1 has no live children, physically delete its qcow2 directory
    and DB row, then walk up the chain — for each ancestor that is
    delete-pending and now childless, sweep it too.
* forced=true cascades the entire subtree leaf-first in one call (for
  operators who want a chain gone immediately).

Drops the rebase code path entirely from NASBackupProvider (the old
ChainRepairPlan / RebaseStep / PositionShift inner classes and the
RebaseBackupCommand dispatch). The RebaseBackupCommand class and its
libvirt wrapper remain in tree as dead code for now — separate cleanup.

Tests:
* New deleteWithLiveChildMarksDeletePendingAndPreservesFile verifies the
  tombstone-on-busy-parent path: no agent traffic, no DB removal, a
  DELETE_PENDING detail is persisted.
* New deletingLeafSweepsUpDeletePendingParent verifies the sweep: deleting
  the last child cascades up and physically deletes the tombstoned parent.
* Existing testDeleteBackup still passes — without a CHAIN_ID detail the
  backup hits the no-chain fast path.
@harikrishna-patnala
Copy link
Copy Markdown
Member

@jmsperu can you please fix one unnecessary stubbing

_Error:  Errors: 
Error:    NASBackupProviderTest.unnecessary Mockito stubbings » UnnecessaryStubbing 
Unn...
[INFO] 
Error:  Tests run: 14, Failures: 0, Errors: 1, Skipped: 0_

jmsperu added 4 commits May 22, 2026 23:37
… test

The deleteWithLiveChildMarksDeletePendingAndPreservesFile test asserts no
agent traffic (verified via agentManager.send is never called) because the
live-child branch only writes a DELETE_PENDING tombstone in backup_details.
host.getId() is therefore never invoked, so stubbing it triggered an
UnnecessaryStubbingException under MockitoJUnitRunner's strict mode.

Fixes the test error introduced in b7b74c4 and unblocks CI for apache#13074.
Address abh1sar's review at NASBackupProvider.java:90 — add an explicit
boolean ConfigKey `nas.backup.incremental.enabled` (zone-scoped, dynamic,
default true) so operators have a single toggle to disable incrementals
without having to set nas.backup.full.every=1.

decideChain() consults the flag first, before any other logic. When
disabled, every backup is taken as a fresh full anchor regardless of VM
state, chain length, or active checkpoint. Existing chain metadata stays
on each backup row, so already-taken incrementals remain restorable; the
next backup with the flag back on starts a brand-new chain.

Added decideChainReturnsFullWhenIncrementalDisabled test covering the
override path via ReflectionTestUtils on the @SPY provider.
Address abh1sar's review at nasbackup.sh:513 — before this change, a backup
taken while the VM was stopped left no parent bitmap on the qcow2, so the
NEXT backup (taken after the VM was started) had to fall back to full
regardless of mode. abh1sar's suggested fix: use qemu-img bitmap --add to
seed a persistent bitmap on the qcow2 during the offline path so the next
backup can find a parent.

backup_stopped_vm() now runs qemu-img bitmap --add per file-backed disk
after the qemu-img convert succeeds, using the bitmap name the orchestrator
passed via --bitmap-new. Persistent + enabled is the qemu-img default since
QEMU 5.0 (which our minimum-version check enforces anyway). RBD/LINSTOR
sources are skipped because qemu-img bitmap is not the correct primitive
there. On any qemu-img bitmap failure we log a warning and continue — the
backup itself succeeds, just the next one falls back to full.

When bitmap-seeding succeeds the function now emits BITMAP_CREATED= so the
orchestrator can persist it as the VM's active_checkpoint_id (same path as
online backups). decideChain then picks this backup as the parent on the
next attempt.

Also taught the recreate-checkpoint path on the online side to tolerate a
pre-seeded bitmap. If libvirt's checkpoint-create refuses because the named
bitmap already exists on the qcow2 (older libvirt — pre 7.2 reuse
semantics), we remove the orphan bitmap and retry. Either path ends with
libvirt's view in sync with the qcow2.
Address abh1sar's review at NASBackupProvider.java:340 — backup files on the
NAS are named after each volume's own UUID (root.<uuid>.qcow2 and
datadisk.<uuid>.qcow2), so a single parentPath shared across every disk in
the loop would have rebased every data disk's new qcow2 onto the *root*
parent file. Mirroring how RestoreBackupCommand passes a list of per-volume
backup files, TakeBackupCommand now carries a parentPaths list (one per VM
volume in deviceId order) and the script rebases each disk against the
matching entry.

Wire changes:
* TakeBackupCommand: replace parentPath:String with parentPaths:List<String>
* LibvirtTakeBackupCommandWrapper: pass --parent-paths <csv> (was --parent-path)
* nasbackup.sh: parse --parent-paths into an array, index by disk position
  in the rebase loop, fail loudly if the list is shorter than DISK_PATHS
* NASBackupProvider: new composeParentBackupPaths(parent, vmId) reads the
  parent's backedUpVolumes list (uuid order = deviceId order at parent time)
  and emits the correctly-named file per position. Returns null when the
  current VM's volume count differs from the parent — decideChain then
  falls back to a fresh full rather than risk a misaligned rebase.
* ChainDecision: parentPath:String -> parentPaths:List<String>

All 14 unit tests still pass. The bash script's --parent-path arg is
renamed to --parent-paths (no orchestrator outside this PR ever populated
the old single-path arg, so no compatibility shim needed).
@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 22, 2026

Thanks @harikrishna-patnala @abh1sar — pushed 4 commits addressing the feedback:

  • 9f4d61f31c — drop unused host.getId() stub in deleteWithLiveChildMarksDeletePendingAndPreservesFile (fixes the UnnecessaryStubbingException Harikrishna flagged)
  • 691931de23 — add nas.backup.incremental.enabled ConfigKey (abh1sar review thread @ line 90)
  • 0bdcdb1484 — pre-seed bitmap on stopped-VM backup via qemu-img bitmap --add so the next backup can be incremental (abh1sar review thread @ nasbackup.sh:513)
  • 5be1910ff0 — per-volume parent paths: TakeBackupCommand.parentPath -> parentPaths: List<String>, script rebases each disk onto its own parent file, falls back to fresh full when current/parent volume counts don't match (abh1sar review thread @ line 340)

All 14 unit tests pass locally. Could one of you trigger @blueorangutan test once GHA goes green? Happy to address any further review comments.

@DaanHoogland
Copy link
Copy Markdown
Contributor

@jmsperu , will you answer the rest of co-pilot’s remarks?

@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18035

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

Comment on lines +360 to +362
if [[ "$MODE" == "incremental" ]]; then
echo "INCREMENTAL_FALLBACK=full (VM stopped — incremental requires running VM)" >&2
fi
Comment on lines +320 to +322
finally:
self._set_full_every(10)
self.backup_offering.removeOffering(self.apiclient, self.vm.id)
Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated
Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated
Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated
Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated
jmsperu added 2 commits May 26, 2026 20:53
* nasbackup.sh: emit INCREMENTAL_FALLBACK= and BITMAP_CREATED= on stdout
  (Script.executePipedCommands in LibvirtTakeBackupCommandWrapper reads
  stdout only, not stderr — so these markers were never parsed). Fixes
  stopped-VM fallback recording and bitmap persistence.

* nasbackup.sh: fix `2>&1 > /dev/null` → `> /dev/null 2>&1` (the original
  ordering leaves stderr pointed at the now-redirected stdout, dropping
  virsh error details).

* nasbackup.sh: capture qemu-agent thaw response correctly (was being
  swallowed by `> /dev/null`).

* nasbackup.sh: usage text now reflects the implemented --parent-paths
  (plural, comma-separated) flag instead of misleading --parent-path.

* NASBackupProvider: default nas.backup.incremental.enabled to false so
  existing zones keep legacy full-only behavior on upgrade; opt in
  per-zone when ready. Description updated to make this explicit.

* NASBackupProvider.composeParentBackupPaths: sort current VM volumes by
  deviceId before positional comparison with parent's backed-up volumes,
  and verify per-position UUID alignment. If any disk was detached and a
  different one attached in its place, return null and force a full
  instead of silently rebasing onto the wrong parent file.

* NASBackupChainKeys.TYPE: doc now explicitly notes the lowercase casing
  difference from Backup.Status uppercase enum values.

* test_backup_recovery_nas.py: capture original zone-scoped
  nas.backup.full.every via Configurations.list at the start of each
  incremental test and restore that exact value in finally, instead of
  hardcoding 10 (which leaked into shared environments).
Two more Copilot review items on PR apache#13074:

* persistChainMetadata: only store nas.bitmap_name when the agent
  confirms it via BITMAP_CREATED=. Previously fell back to
  decision.bitmapNew if the agent didn't emit it, which could anchor
  the next incremental on a bitmap that doesn't exist (stopped-VM
  pre-seed failure was the trigger Copilot flagged, but the same risk
  applies any time the marker is lost). Leaving it empty correctly
  forces the next backup to a fresh full instead.

* nasbackup.sh: sanity_checks (QEMU >= 4.2 / libvirt >= 7.2) is only
  needed for the incremental backup-begin path. Gate it to
  `OP=backup && -n MODE` so legacy full-only callers and delete /
  stats / rebase operations still work on older host versions.
@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 26, 2026

Thanks for the thorough review @copilot-pull-request-reviewer (and @DaanHoogland for the nudge 🙏). Working through the comments — here's what's in the latest pushes (3e2e144fc6 and 4f3375a918):

Agent ↔ management server signalling (the most impactful — these would silently break the feature):

  • INCREMENTAL_FALLBACK= and BITMAP_CREATED= (both the stopped-VM and the warn paths) now emit on stdout, matching what Script.executePipedCommands actually captures in LibvirtTakeBackupCommandWrapper. The original stderr emission meant management never saw these markers.
  • Fixed the redirection order in virsh backup-begin and qemu-agent-command calls — 2>&1 > /dev/null was leaking stderr and swallowing the agent's thaw response. Now > /dev/null 2>&1 for the discard-everything cases, and the thaw response is captured properly.

Per-volume parent paths + identity check (mostly already done in 5be1910, plus the alignment fix):

  • composeParentBackupPaths now sorts current VM volumes by deviceId before positional comparison, and verifies the UUID at each position matches the parent's recorded volume UUID. If a disk was detached and a different one attached in its place, the chain can't be safely continued — returns null so the caller forces a full instead of silently rebasing onto the wrong parent file.

Defaults & docs:

  • nas.backup.incremental.enabled now defaults to false. Existing zones keep legacy full-only behavior on upgrade; opt in per-zone when ready to use chains. Description in the ConfigKey updated to be explicit about this.
  • NASBackupChainKeys.TYPE Javadoc now documents the lowercase-vs-uppercase distinction from Backup.Status instead of implying they match.
  • nasbackup.sh usage text now shows --parent-paths (plural, comma-separated) to match the implemented flag.
  • sanity_checks (QEMU >= 4.2 / libvirt >= 7.2) is now gated to OP=backup && -n MODE so legacy full-only callers and the delete / stats / rebase ops still work on older host versions.

Stopped-VM bitmap persistence:

  • persistChainMetadata only stores nas.bitmap_name when the agent confirms it via BITMAP_CREATED=. Previously fell back to the requested bitmap, which could anchor the next incremental on a bitmap that doesn't exist if pre-seeding failed.

Tests:

  • The incremental tests now capture the original zone-scoped nas.backup.full.every via Configurations.list in each test method and restore that exact value in finally, instead of hard-resetting to 10 (which leaked into shared environments).

Already addressed in earlier commits, noting for completeness: per-volume parent paths refactor (5be1910), chain repair per disk (b7b74c4), stopped-VM bitmap pre-seed (0bdcdb1), incremental.enabled master switch (691931d), RFC moved out of repo (9764025).

Will keep an eye on CI.

@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

After defaulting nas.backup.incremental.enabled to false (per the
PR-stated opt-in promise, addressed in earlier review-fix commit),
decideChainReturnsFullWhenVmHasNoActiveCheckpoint started failing
the Mockito strict-stubbing check — the new master-switch gate
short-circuits decideChain before vm.getState() and the
VM_ACTIVE_CHECKPOINT_ID lookup are reached, leaving three stubs
flagged as "unnecessary".

Mirror the same ReflectionTestUtils.setField pattern that
decideChainReturnsFullWhenIncrementalDisabled uses, but with
value="true" so we get past the master-switch gate and actually
exercise the active-checkpoint-id branch this test is meant to
cover.
@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 18046

@abh1sar
Copy link
Copy Markdown
Contributor

abh1sar commented May 27, 2026

Hi @jmsperu
The Test Plan needs to be expanded for such a complicated PR. Please update the Test Plan and also update the testing results when done.

@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 27, 2026

On the failed SL-JID 18046 packaging (✖ el8/el9/debian/suse15) — the diff between that and the green SL-JID 18035 is purely Java + bash + Python (touches only core/, plugins/backup/nas/, plugins/hypervisors/kvm/, scripts/vm/hypervisor/kvm/nasbackup.sh, and the smoke test). No packaging files (packaging/, debian/, .spec, pom.xml) were modified. GitHub Actions build is green across all matrix entries, nasbackup.sh passes bash -n, and the script isn't explicitly enumerated in any RPM/deb spec.

That points to a Jenkins-side transient (build agent / mirror / disk) rather than a code regression. @DaanHoogland or anyone with rights — could we get a re-run of @blueorangutan package when convenient? Happy to also bisect locally if it fails again with the same signature.

@blueorangutan
Copy link
Copy Markdown

@jmsperu a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18053

@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] Incremental NAS Backup Support for KVM Hypervisor

10 participants