The CUDA checkpoint test availability guard currently masks a class of cuda-bindings regressions as unsupported-environment skips.
This mattered for the CUDA 13.3.0 CUcheckpointRestoreArgs regression fixed by PR #2144. In that release, cuda.bindings.driver.CUcheckpointRestoreArgs was missing because a CUDA 13.3 header layout change caused generation to silently omit the restore-argument binding. cuda.core.checkpoint._get_driver() correctly detects this as a missing required binding attribute and raises:
CUDA checkpointing requires cuda.bindings with CUDA checkpoint API support. Missing: CUcheckpointRestoreArgs
However, the checkpoint tests use _checkpoint_available() as a broad skip gate. That helper catches any RuntimeError from checkpoint._get_driver() and returns False, so a missing required binding symbol is treated the same as a truly unsupported driver/platform:
- Skip gate:
|
def _checkpoint_available(): |
|
"""Return True if the checkpoint API is usable on this system.""" |
|
try: |
|
checkpoint._get_driver() |
|
return True |
|
except RuntimeError: |
|
return False |
|
|
|
|
|
needs_checkpoint = pytest.mark.skipif( |
|
sys.platform != "linux" or not _checkpoint_available(), |
|
reason="CUDA checkpoint API requires Linux and a supported driver/bindings", |
|
) |
- Checkpoint tests guarded by that skip marker:
|
@needs_checkpoint |
|
class TestCheckpointLifecycle: |
|
def test_initial_state_is_running(self): |
|
_run_checkpoint_scenario_or_skip("initial_state_is_running") |
|
|
|
def test_restore_thread_id_is_positive(self): |
|
_run_checkpoint_scenario_or_skip("restore_thread_id_is_positive") |
|
|
|
def test_lock_unlock(self): |
|
_run_checkpoint_scenario_or_skip("lock_unlock") |
|
|
|
def test_lock_default_timeout(self): |
|
"""lock() with the default timeout_ms=0 (no timeout).""" |
|
_run_checkpoint_scenario_or_skip("lock_default_timeout") |
|
|
|
def test_lock_with_timeout(self): |
|
_run_checkpoint_scenario_or_skip("lock_with_timeout") |
|
|
|
def test_full_cycle_no_migration(self): |
|
"""lock -> checkpoint -> restore -> unlock, verify state at each step.""" |
|
_run_checkpoint_scenario_or_skip("full_cycle_no_migration") |
|
|
|
|
|
# -- GPU migration (>= 2 same-chip GPUs, real driver) --------------------- |
|
|
|
|
|
@needs_checkpoint |
- Missing required binding detection in
checkpoint._get_driver():
|
def _get_driver(): |
|
global _driver_capability_checked |
|
if _driver_capability_checked: |
|
return _driver |
|
|
|
binding_ver = _binding_version() |
|
if not _binding_version_supports_checkpoint(binding_ver): |
|
raise RuntimeError( |
|
"CUDA checkpointing requires cuda.bindings with CUDA checkpoint API support. " |
|
f"Found cuda.bindings {'.'.join(str(part) for part in binding_ver[:3])}." |
|
) |
|
|
|
missing = [name for name in _REQUIRED_BINDING_ATTRS if not hasattr(_driver, name)] |
|
if missing: |
|
raise RuntimeError( |
|
f"CUDA checkpointing requires cuda.bindings with CUDA checkpoint API support. Missing: {', '.join(missing)}" |
|
) |
As a result, the tests that would have exercised checkpoint restore were skipped instead of failing, even though the installed cuda-bindings version was expected to provide checkpoint API support.
Suggested fix:
- Keep skipping true unsupported environments, such as non-Linux platforms, drivers older than checkpoint API support, or hardware/driver combinations where runtime checkpoint scenarios cannot complete.
- Fail when
checkpoint._get_driver() reports missing required cuda-bindings attributes for a cuda-bindings version that is expected to support checkpointing.
- Consider adding a lower-level cuda-bindings API completeness test that asserts the required checkpoint symbols are present when the binding version advertises checkpoint support. This would catch omissions before cuda-core runtime tests depend on them.
Related:
The CUDA checkpoint test availability guard currently masks a class of cuda-bindings regressions as unsupported-environment skips.
This mattered for the CUDA 13.3.0
CUcheckpointRestoreArgsregression fixed by PR #2144. In that release,cuda.bindings.driver.CUcheckpointRestoreArgswas missing because a CUDA 13.3 header layout change caused generation to silently omit the restore-argument binding.cuda.core.checkpoint._get_driver()correctly detects this as a missing required binding attribute and raises:However, the checkpoint tests use
_checkpoint_available()as a broad skip gate. That helper catches anyRuntimeErrorfromcheckpoint._get_driver()and returnsFalse, so a missing required binding symbol is treated the same as a truly unsupported driver/platform:cuda-python/cuda_core/tests/test_checkpoint.py
Lines 31 to 43 in cc50515
cuda-python/cuda_core/tests/test_checkpoint.py
Lines 397 to 423 in cc50515
checkpoint._get_driver():cuda-python/cuda_core/cuda/core/checkpoint.py
Lines 133 to 149 in cc50515
As a result, the tests that would have exercised checkpoint restore were skipped instead of failing, even though the installed cuda-bindings version was expected to provide checkpoint API support.
Suggested fix:
checkpoint._get_driver()reports missing required cuda-bindings attributes for a cuda-bindings version that is expected to support checkpointing.Related:
CUcheckpointRestoreArgsgeneration/layout issue.