Skip to content

[PyTorch] Isolate CP pool worker stdout from NCCL/library banners#3074

Open
sudhakarsingh27 wants to merge 1 commit into
NVIDIA:mainfrom
sudhakarsingh27:sudhakars/cp_pool_stdout_isolation
Open

[PyTorch] Isolate CP pool worker stdout from NCCL/library banners#3074
sudhakarsingh27 wants to merge 1 commit into
NVIDIA:mainfrom
sudhakarsingh27:sudhakars/cp_pool_stdout_isolation

Conversation

@sudhakarsingh27
Copy link
Copy Markdown
Member

Problem

The CP attention test pool worker (added in #2993) uses rank-0 stdout as a line-delimited JSON protocol channel between the worker and the parent pytest process. PoolWorker._submit_once reads one line and json.loads it.

When NCCL_DEBUG is set to VERSION or INFO, NCCL writes a banner line — NCCL version 2.x.y+cudaA.B\n — to stdout (fd 1), not stderr. That banner reaches the parent ahead of the first JSON response, json.loads raises, and because the cp_pool fixture is session-scoped and the pool is killed on first failure, every subsequent CP test in the file fails with:

AssertionError: pool worker JSON protocol broke:
JSONDecodeError('Expecting value: line 1 column 1 (char 0)');
line='NCCL version 2.30.5+cuda13.3\n'

CI runners that export NCCL_DEBUG=VERSION (e.g. the cuDNN nightly GB200 job) hit this on all ~200 non-skipped test_cp_with_flash_attention / test_cp_with_fused_attention cases. With NCCL_DEBUG unset the banner isn't emitted, so the suite passes — which is why the regression only shows up on those runners.

The existing mitigation only redirected non-rank-0 stdout to /dev/null, so rank 0's own banner still corrupted the stream.

Fix (both sides)

Worker — run_attention_with_cp_pool.py: before import torch, dup() rank-0's original fd 1 into a private stream reserved for JSON, then dup2(2, 1) so fd 1 points at stderr. Any banner from NCCL / cuBLAS / cuDNN / torch — Python-level or C-level, import-time or runtime — now lands on stderr (still drained into CI logs by the parent's stderr thread) instead of the protocol pipe. Responses are written through the private stream, re-prefixed with a [CP_POOL_RESP] sentinel.

Parent — test_attention_with_cp.py: read until a sentinel-prefixed line, skipping (and echoing to stderr) any non-protocol chatter. This covers the one stdout source the worker-side fd redirect cannot touch: the torchrun agent process, which shares the same pipe. The select() deadline bounds the whole read loop, not just one readline.

Defense in depth: fd-level isolation keeps the pipe clean even under verbose NCCL_DEBUG=INFO, and the sentinel framing tolerates any residual launcher noise.

Validation

8×H100, TE built from this commit:

Scenario Result
NCCL_DEBUG=VERSION, flash p2p + all_gather + a2a (one session, pool reused across the banner) ✅ 3 passed
NCCL_DEBUG=INFO (verbose), flash p2p ✅ 1 passed
NCCL_DEBUG=VERSION, fused p2p ✅ 1 passed
NCCL_DEBUG unset (control) ✅ 1 passed
Without this fix, NCCL_DEBUG=VERSION pool worker JSON protocol broke (regression reproduced)

Type of change

  • Bug fix (test infrastructure; no library/runtime code change)

Checklist

  • Contributing guidelines followed
  • Changes commented where non-obvious
  • No new warnings
  • Existing test suite serves as input + validation
  • Existing tests pass locally (8×H100, NCCL_DEBUG=VERSION/INFO/unset)

@sudhakarsingh27 sudhakarsingh27 self-assigned this Jun 2, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 2, 2026

Greptile Summary

This PR fixes a CI regression where NCCL_DEBUG=VERSION caused NCCL to write a banner line to fd 1 (stdout) of the rank-0 pool worker, corrupting the line-delimited JSON protocol channel and cascading failures across all ~200 CP attention tests in the session.

  • Worker fix (run_attention_with_cp_pool.py): Before import torch, rank 0 dup()s its original stdout fd into a private _JSON_STDOUT file object reserved for JSON writes, then dup2(2, 1) redirects fd 1 to stderr. All import-time and runtime library banners (NCCL, cuBLAS, cuDNN) now land on stderr instead of the protocol pipe; _send_response is updated to write through _JSON_STDOUT.
  • Parent update (test_attention_with_cp.py): Comment-only changes that reflect the updated isolation guarantee; the select → readline → json.loads logic is unchanged.

Confidence Score: 5/5

Safe to merge — changes are confined to test infrastructure with no library/runtime impact.

The worker-side fd redirect is correctly ordered before import torch, ensuring all C-level and Python-level import-time writes go to stderr. The private _JSON_STDOUT dup is a module-level global that lives for the process lifetime, so there is no premature-close risk. Non-rank-0 workers return early from the redirect function and are unaffected. The parent read path is unchanged. No functional issues found in the changed code.

No files require special attention.

Important Files Changed

Filename Overview
tests/pytorch/attention/run_attention_with_cp_pool.py Adds fd-level stdout isolation before import torch: rank 0 dups fd 1 into a private _JSON_STDOUT stream for the JSON protocol, then redirects fd 1 to stderr so NCCL/cuBLAS banners can't reach the protocol pipe. Logic is correct and the module-level call order achieves the intended effect.
tests/pytorch/attention/test_attention_with_cp.py Comment-only updates in _submit_once to reflect the new worker-side isolation; no behavioral changes to the parent read path (selectreadlinejson.loads).

Sequence Diagram

sequenceDiagram
    participant P as pytest (parent)
    participant TR as torchrun agent
    participant R0 as rank-0 worker
    participant RN as rank 1..N workers

    Note over R0: module load (before import torch)<br/>os.dup(1) → _JSON_STDOUT (pipe)<br/>os.dup2(2, 1) → fd1 = stderr<br/>sys.stdout = sys.stderr

    P->>TR: "Popen(stdout=PIPE)"
    TR->>R0: spawn (inherits pipe as fd 1)
    TR->>RN: spawn (inherits pipe as fd 1)
    RN-->>RN: "os.dup2(devnull, 1) → fd1 = /dev/null"

    loop each test case
        P->>TR: stdin.write(JSON request)
        TR->>R0: broadcast via dist
        R0->>RN: dist.broadcast_object_list
        RN-->>R0: dist.gather_object
        R0->>P: _JSON_STDOUT.write(JSON resp)
        Note over P: select() → readline() → json.loads()
    end

    P->>TR: "stdin.write({op: shutdown})"
    TR-->>P: process exits
Loading

Reviews (2): Last reviewed commit: "[PyTorch] Isolate CP pool worker stdout ..." | Re-trigger Greptile

Comment on lines 80 to 83
_RESP_PREFIX = "[CP_POOL_RESP] "


def _send_response(rank: int, payload: dict) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _RESP_PREFIX duplicated across worker and parent without a shared source

The sentinel string "[CP_POOL_RESP] " is defined independently in both run_attention_with_cp_pool.py and test_attention_with_cp.py. If they ever drift (e.g. a typo during a rename), the parent will silently spin through every line until it times out, emitting a confusing "timed out" failure rather than a protocol-mismatch error. A brief comment calling out the coupling would make this risk explicit.

Suggested change
_RESP_PREFIX = "[CP_POOL_RESP] "
def _send_response(rank: int, payload: dict) -> None:
# NOTE: this string must stay identical to PoolWorker._RESP_PREFIX in
# test_attention_with_cp.py; the parent uses it to identify response lines.
_RESP_PREFIX = "[CP_POOL_RESP] "
def _send_response(rank: int, payload: dict) -> None:

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by removing the sentinel entirely in e543ef89 — the rank-0 fd redirect keeps the pipe clean on its own (verified: 0 non-protocol lines under NCCL_DEBUG=INFO), so there's no longer a duplicated _RESP_PREFIX to keep in sync across the two files.

Copy link
Copy Markdown
Member Author

@sudhakarsingh27 sudhakarsingh27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Targeted fix, mostly minimal. Main asks: drop the unreachable _JSON_STDOUT is None raise in _send_response and the dead re-entry guard in _redirect_rank0_stdout_to_stderr. The fd-redirect + sentinel pair is justified (worker process vs torchrun agent process — different coverage), but confirm the agent actually writes to this pipe; if that's speculative the sentinel alone suffices. fd-level ops are correct and the import-time placement is right. Comments are on the verbose side but explain real non-obvious intent, so fine to keep.

sys.stdout.write(json.dumps(payload) + "\n")
sys.stdout.flush()
if _JSON_STDOUT is None:
raise RuntimeError("rank-0 JSON stdout was not initialized")
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unreachable guard. _send_response only writes when rank == 0, and on rank 0 _redirect_rank0_stdout_to_stderr() (called unconditionally at import, when RANK=="0") always sets _JSON_STDOUT. The only way _JSON_STDOUT stays None on rank 0 is if RANK is unset/non-zero at import but rank resolves to 0 in main() — which can't happen since both read the same env var. Drop the check; it adds noise without covering a real path.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — guard removed in e543ef89. _send_response now writes directly to _JSON_STDOUT.

# still reaches the protocol pipe. The rank-0 fd redirect handles this worker's
# own library/Python writes, but the torchrun *agent* process shares the same
# pipe and is not covered by it; the prefix lets the parser pick out responses.
_RESP_PREFIX = "[CP_POOL_RESP] "
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two mechanisms now solve overlapping problems: the rank-0 fd redirect (covers worker process writes) and the sentinel prefix (covers the torchrun agent process, which the worker can't touch). That split is genuinely justified — keep both. But the redirect makes the worker's contribution to pipe noise effectively zero, so the only remaining producer the sentinel protects against is the torchrun agent. Worth stating that explicitly in one place. Also consider whether agent-process output on this pipe was actually observed or is speculative — if speculative, the sentinel alone is the cheaper standalone fix.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the speculative part: instrumented the parent to count non-sentinel lines reaching the pipe and ran 3 cases under NCCL_DEBUG=INFO (max verbosity) — 0 lines skipped. With the rank-0 fd redirect in place, the torchrun agent writes nothing to this stdout pipe (its logs go to stderr). So the sentinel wasn't load-bearing.

Removed the sentinel + parent skip-loop entirely in e543ef89. The fd redirect alone fully isolates the pipe, and it's actually better under NCCL_DEBUG=INFO (banner volume goes to stderr instead of forcing the parent to read+skip hundreds of lines within the deadline). The parent reverts to a single read + json.loads.

sys.stdout = sys.stderr


_redirect_rank0_stdout_to_stderr()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import-time side-effect call is correct here and must stay at module scope — the redirect has to happen before import torch to catch C-level/import-time writes, so it can't move into main(). Good. Minor: _redirect_rank0_stdout_to_stderr is only ever called once, immediately below its def; the _JSON_STDOUT is not None re-entry guard is dead. Either inline the body (no function) or drop the guard. The function wrapper does aid readability, so keeping the function and dropping the guard is the cleaner choice.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — dropped the _JSON_STDOUT is not None re-entry guard, kept the function. e543ef89.

return
_JSON_STDOUT = os.fdopen(os.dup(1), "w", buffering=1)
os.dup2(2, 1)
sys.stdout = sys.stderr
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fd ops are idiomatic and correct: os.dup(1) snapshots the original stdout, os.fdopen(..., buffering=1) gives a line-buffered text wrapper, os.dup2(2, 1) repoints fd 1 at stderr. One asymmetry worth a comment: rank 0's library output now goes to stderr (preserved in CI logs), but _silence_non_rank0_stdout sends rank>0 output to /dev/null (discarded). That's defensible (rank 0 is representative) but the two functions now implement deliberately different policies — note the rationale so a future reader doesn't "unify" them.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note in the _silence_non_rank0_stdout docstring explaining the deliberate asymmetry: rank 0 keeps a dup of fd 1 for the JSON channel and sends the rest to stderr (preserved in CI logs), while rank>0 has nothing to preserve so it goes to /dev/null. e543ef89.

break
# Non-protocol line — echo to stderr so it stays in CI logs.
sys.stderr.write(line)
sys.stderr.flush()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read loop is clean and the deadline correctly bounds the whole loop. Two edge cases: (1) select() says ready but readline() can still block if the worker emitted a partial line without a newline — pre-existing, low risk over a pipe, fine to leave. (2) A pathological worker that emits unbounded non-sentinel lines faster than the deadline would keep this loop echoing forever-ish until the deadline; acceptable since the deadline still caps it. No change required, just confirming the loop is sound.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moot now — the skip-loop was removed (see the sentinel thread). With the fd redirect the pipe carries only the rank-0 JSON line, so the parent is back to a single select + readline + json.loads, no loop. The two edge cases you noted no longer apply.

The CP attention test pool worker (PR NVIDIA#2993) uses rank-0 stdout as a
line-delimited JSON protocol channel to the parent pytest process. When
NCCL_DEBUG is set to VERSION or INFO, NCCL writes an "NCCL version ...\n"
banner to stdout (fd 1); that banner reaches the parent ahead of the
first JSON response, json.loads raises, and because the pool fixture is
session-scoped and killed on first failure, every subsequent CP test in
the file fails. CI runners that export NCCL_DEBUG hit this on all ~200
non-skipped cases.

The prior mitigation only redirected non-rank-0 stdout to /dev/null, so
rank 0's own banner still corrupted the stream. Fix it at the source:
in the worker, before importing torch, dup rank-0's original fd 1 into a
private stream reserved for JSON, then point fd 1 at stderr. Any banner
from NCCL/cuBLAS/cuDNN/torch (Python or C level, import-time or runtime)
now lands on stderr — still drained into CI logs by the parent's stderr
thread — instead of the protocol pipe. Combined with the existing
non-rank-0 /dev/null redirect, the pipe carries only rank-0 JSON, so the
parent's single-line read needs no change.

Validated on 8xH100 (TE built from this commit). With NCCL_DEBUG=VERSION
and =INFO, flash (p2p/all_gather/a2a) and fused cases pass and zero
non-protocol lines reach the pipe; without the fix the same cases fail
with "pool worker JSON protocol broke". Control with NCCL_DEBUG unset
also passes.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
@sudhakarsingh27 sudhakarsingh27 force-pushed the sudhakars/cp_pool_stdout_isolation branch from 8332034 to e543ef8 Compare June 2, 2026 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant