Skip to content

Make checkpoint tests fail on missing required binding symbols#2150

Draft
rwgk wants to merge 2 commits into
NVIDIA:mainfrom
rwgk:CUcheckpointRestoreArgs_error_masking_skip
Draft

Make checkpoint tests fail on missing required binding symbols#2150
rwgk wants to merge 2 commits into
NVIDIA:mainfrom
rwgk:CUcheckpointRestoreArgs_error_masking_skip

Conversation

@rwgk
Copy link
Copy Markdown
Contributor

@rwgk rwgk commented May 28, 2026

Closes #2149

Summary

  • Tighten the cuda.core checkpoint test availability guard so it still skips true unsupported environments, but no longer skips missing required cuda.bindings symbols.
  • Add a focused cuda.bindings completeness test for the checkpoint symbols required by cuda.core.checkpoint, including CUcheckpointRestoreArgs.
  • Cover the skip-gate behavior directly so old cuda-bindings versions and unsupported installed drivers remain skippable, while missing required bindings fail.

Context

This is a follow-up to #2144 and fixes the test coverage gap tracked in #2149.

The CUDA 13.3.0 CUcheckpointRestoreArgs generation issue fixed by #2144 could pass the existing test flow because the cuda.core checkpoint tests treated all RuntimeErrors from checkpoint._get_driver() as an unsupported environment. That included:

CUDA checkpointing requires cuda.bindings with CUDA checkpoint API support. Missing: CUcheckpointRestoreArgs

This PR keeps the intended skips for genuinely unsupported configurations, but lets missing required binding attributes propagate as test failures.

Validation

On the pre-#2144 base, these focused tests now expose the breakage:

pytest cuda_core/tests/test_checkpoint.py

fails during collection with:

RuntimeError: CUDA checkpointing requires cuda.bindings with CUDA checkpoint API support. Missing: CUcheckpointRestoreArgs

and:

pytest cuda_bindings/tests/test_cuda.py::test_cuCheckpoint_required_bindings_present

fails with:

missing == ['CUcheckpointRestoreArgs']

After PR #2144 lands and this branch is rebased onto it, the focused checkpoint tests should pass and demonstrate that the original generation issue is fixed while the error-masking skip is closed.

Related

Ensure checkpoint tests distinguish missing required cuda.bindings symbols from genuinely unsupported environments.
@rwgk rwgk added this to the cuda.bindings next milestone May 28, 2026
@rwgk rwgk self-assigned this May 28, 2026
@rwgk rwgk added bug Something isn't working P0 High priority - Must do! cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module labels May 28, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 28, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 28, 2026

/ok to test

@github-actions
Copy link
Copy Markdown

@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 28, 2026

PR 2150 first CI failure analysis

Workflow: https://github.com/NVIDIA/cuda-python/actions/runs/26591678170

Commit: 293258d

Workflow result: failed.

High-level result

The build and non-test infrastructure mostly passed:

  • Build jobs passed.
  • Docs passed.
  • pre-commit.ci passed.
  • The final Check job status job failed because matrix test jobs failed.

The failures are concentrated in test matrix jobs. There were 37 failed test jobs plus the final status aggregation job.

Failure counts by CUDA version:

  • CUDA 13.3.0: 24 failed test jobs.
  • CUDA 12.9.1: 13 failed test jobs.
  • CUDA 13.0.2: no failures observed in the failed-job list.

Failure counts by platform:

  • linux-64: 17 failed test jobs.
  • linux-aarch64: 12 failed test jobs.
  • win-64: 8 failed test jobs.

Failure mode 1: CUDA 13.3 missing CUcheckpointRestoreArgs

This is the expected proof-of-coverage failure for the issue fixed by PR #2144.

Representative failed jobs:

Observed failure:

tests/test_cuda.py::test_cuCheckpoint_required_bindings_present FAILED
E       AssertionError: assert ['CUcheckpointRestoreArgs'] == []
E         Left contains one more item: 'CUcheckpointRestoreArgs'

This shows the new cuda_bindings/tests/test_cuda.py::test_cuCheckpoint_required_bindings_present test catches the missing generated binding directly. This is exactly the failure PR #2150 was intended to expose before PR #2144 is merged/rebased in.

Failure mode 2: Linux CUDA 12.9 missing CUcheckpointGpuPair

This is a separate checkpoint binding gap surfaced by tightening the cuda.core checkpoint availability guard.

Representative failed jobs:

Observed failure:

ERROR collecting tests/test_checkpoint.py
RuntimeError: CUDA checkpointing requires cuda.bindings with CUDA checkpoint API support. Missing: CUcheckpointGpuPair
binding_ver = (12, 9, 7)
missing    = ['CUcheckpointGpuPair']

This happens during cuda_core/tests/test_checkpoint.py collection on Linux. Windows does not hit this path because the checkpoint tests are platform-skipped before _checkpoint_available() is evaluated.

This is not the CUDA 13.3 CUcheckpointRestoreArgs issue. It means the tightened guard also exposes that CUDA 12.9 Linux bindings report enough checkpoint API surface to reach cuda.core.checkpoint._get_driver(), but still lack CUcheckpointGpuPair, which cuda.core considers required.

Interpretation

CI behaved as intended for the main goal: PR #2150 converts the previous skip-masked CUDA 13.3 binding regression into clear test failures.

The run also identifies a follow-up decision for CUDA 12.9:

  • If CUDA 12.9 should support the cuda.core.checkpoint mapping helpers, then CUcheckpointGpuPair needs to be present in the CUDA 12.9 bindings.
  • If CUDA 12.9 should not support that surface, the cuda.core availability check needs to classify this specific older-binding condition as skippable rather than as a binding regression.

After PR #2144 is merged and PR #2150 is rebased onto it, the CUDA 13.3 CUcheckpointRestoreArgs failures should disappear. The CUDA 12.9 CUcheckpointGpuPair failures may remain unless they are handled separately.

@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 28, 2026

I looked into the CUDA 12.9 failures from the first PR #2150 CI run.

The short version: these failures look separate from the CUDA 13.3 CUcheckpointRestoreArgs regression that PR #2144 fixes.

In /usr/local/cuda-12.9, I do not see CUcheckpointGpuPair at all. Even

grep -r -i GpuPair /usr/local/cuda-12.9

returns no matches.

The CUDA 12.9 cuda.h checkpoint restore args are still the older reserved-only layout:

typedef struct CUcheckpointRestoreArgs_st {
    cuuint64_t reserved[8]; /**< Reserved for future use, must be zeroed */
} CUcheckpointRestoreArgs;

That matches the CUDA 12.9 CI failure mode from https://github.com/NVIDIA/cuda-python/actions/runs/26591678170: Linux CUDA 12.9 jobs now fail during cuda_core/tests/test_checkpoint.py collection with:

RuntimeError: CUDA checkpointing requires cuda.bindings with CUDA checkpoint API support. Missing: CUcheckpointGpuPair
binding_ver = (12, 9, 7)
missing    = ['CUcheckpointGpuPair']

So my current interpretation is:

  • The CUDA 13.3 failures are expected and useful: PR Make checkpoint tests fail on missing required binding symbols #2150 proves that missing CUcheckpointRestoreArgs would now fail loudly instead of being skip-masked.
  • The CUDA 12.9 failures are a separate compatibility issue surfaced by the tighter guard.
  • Since CUcheckpointGpuPair does not appear to exist in the CUDA 12.9 headers, this is not a missing 12.9 Python binding. It is more likely that cuda.core.checkpoint is treating CUcheckpointGpuPair as required too broadly for CUDA 12.9.

Possible follow-up direction: keep missing required symbols as failures for APIs that should exist in the active CUDA version, but treat the CUDA 12.9/no-CUcheckpointGpuPair path as an older checkpoint API shape that should remain skippable or should avoid enabling the GPU remapping surface.

Keep baseline CUDA checkpoint coverage active for CUDA versions whose headers do not expose GPU remapping structs, while still failing when required base checkpoint bindings such as CUcheckpointRestoreArgs are missing. Gate only the GPU migration path on CUcheckpointGpuPair so CUDA 12.9 can exercise state, lock, checkpoint, restore-without-mapping, and unlock.
@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 28, 2026

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Checkpoint tests should fail, not skip, when required cuda.bindings symbols are missing

1 participant