Skip to content

Checkpoint tests should fail, not skip, when required cuda.bindings symbols are missing #2149

@rwgk

Description

@rwgk

The CUDA checkpoint test availability guard currently masks a class of cuda-bindings regressions as unsupported-environment skips.

This mattered for the CUDA 13.3.0 CUcheckpointRestoreArgs regression fixed by PR #2144. In that release, cuda.bindings.driver.CUcheckpointRestoreArgs was missing because a CUDA 13.3 header layout change caused generation to silently omit the restore-argument binding. cuda.core.checkpoint._get_driver() correctly detects this as a missing required binding attribute and raises:

CUDA checkpointing requires cuda.bindings with CUDA checkpoint API support. Missing: CUcheckpointRestoreArgs

However, the checkpoint tests use _checkpoint_available() as a broad skip gate. That helper catches any RuntimeError from checkpoint._get_driver() and returns False, so a missing required binding symbol is treated the same as a truly unsupported driver/platform:

  • Skip gate:
    def _checkpoint_available():
    """Return True if the checkpoint API is usable on this system."""
    try:
    checkpoint._get_driver()
    return True
    except RuntimeError:
    return False
    needs_checkpoint = pytest.mark.skipif(
    sys.platform != "linux" or not _checkpoint_available(),
    reason="CUDA checkpoint API requires Linux and a supported driver/bindings",
    )
  • Checkpoint tests guarded by that skip marker:
    @needs_checkpoint
    class TestCheckpointLifecycle:
    def test_initial_state_is_running(self):
    _run_checkpoint_scenario_or_skip("initial_state_is_running")
    def test_restore_thread_id_is_positive(self):
    _run_checkpoint_scenario_or_skip("restore_thread_id_is_positive")
    def test_lock_unlock(self):
    _run_checkpoint_scenario_or_skip("lock_unlock")
    def test_lock_default_timeout(self):
    """lock() with the default timeout_ms=0 (no timeout)."""
    _run_checkpoint_scenario_or_skip("lock_default_timeout")
    def test_lock_with_timeout(self):
    _run_checkpoint_scenario_or_skip("lock_with_timeout")
    def test_full_cycle_no_migration(self):
    """lock -> checkpoint -> restore -> unlock, verify state at each step."""
    _run_checkpoint_scenario_or_skip("full_cycle_no_migration")
    # -- GPU migration (>= 2 same-chip GPUs, real driver) ---------------------
    @needs_checkpoint
  • Missing required binding detection in checkpoint._get_driver():
    def _get_driver():
    global _driver_capability_checked
    if _driver_capability_checked:
    return _driver
    binding_ver = _binding_version()
    if not _binding_version_supports_checkpoint(binding_ver):
    raise RuntimeError(
    "CUDA checkpointing requires cuda.bindings with CUDA checkpoint API support. "
    f"Found cuda.bindings {'.'.join(str(part) for part in binding_ver[:3])}."
    )
    missing = [name for name in _REQUIRED_BINDING_ATTRS if not hasattr(_driver, name)]
    if missing:
    raise RuntimeError(
    f"CUDA checkpointing requires cuda.bindings with CUDA checkpoint API support. Missing: {', '.join(missing)}"
    )

As a result, the tests that would have exercised checkpoint restore were skipped instead of failing, even though the installed cuda-bindings version was expected to provide checkpoint API support.

Suggested fix:

  • Keep skipping true unsupported environments, such as non-Linux platforms, drivers older than checkpoint API support, or hardware/driver combinations where runtime checkpoint scenarios cannot complete.
  • Fail when checkpoint._get_driver() reports missing required cuda-bindings attributes for a cuda-bindings version that is expected to support checkpointing.
  • Consider adding a lower-level cuda-bindings API completeness test that asserts the required checkpoint symbols are present when the binding version advertises checkpoint support. This would catch omissions before cuda-core runtime tests depend on them.

Related:

Metadata

Metadata

Assignees

Labels

P0High priority - Must do!bugSomething isn't workingcuda.coreEverything related to the cuda.core module

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions