[PyTorch] Isolate CP pool worker stdout from NCCL/library banners by sudhakarsingh27 · Pull Request #3074 · NVIDIA/TransformerEngine

sudhakarsingh27 · 2026-06-02T00:09:29Z

Problem

The CP attention test pool worker (added in #2993) uses rank-0 stdout as a line-delimited JSON protocol channel between the worker and the parent pytest process. PoolWorker._submit_once reads one line and json.loads it.

When NCCL_DEBUG is set to VERSION or INFO, NCCL writes a banner line — NCCL version 2.x.y+cudaA.B\n — to stdout (fd 1), not stderr. That banner reaches the parent ahead of the first JSON response, json.loads raises, and because the cp_pool fixture is session-scoped and the pool is killed on first failure, every subsequent CP test in the file fails with:

AssertionError: pool worker JSON protocol broke:
JSONDecodeError('Expecting value: line 1 column 1 (char 0)');
line='NCCL version 2.30.5+cuda13.3\n'

CI runners that export NCCL_DEBUG=VERSION (e.g. the cuDNN nightly GB200 job) hit this on all ~200 non-skipped test_cp_with_flash_attention / test_cp_with_fused_attention cases. With NCCL_DEBUG unset the banner isn't emitted, so the suite passes — which is why the regression only shows up on those runners.

The existing mitigation only redirected non-rank-0 stdout to /dev/null, so rank 0's own banner still corrupted the stream.

Fix (both sides)

Worker — run_attention_with_cp_pool.py: before import torch, dup() rank-0's original fd 1 into a private stream reserved for JSON, then dup2(2, 1) so fd 1 points at stderr. Any banner from NCCL / cuBLAS / cuDNN / torch — Python-level or C-level, import-time or runtime — now lands on stderr (still drained into CI logs by the parent's stderr thread) instead of the protocol pipe. Responses are written through the private stream, re-prefixed with a [CP_POOL_RESP] sentinel.

Parent — test_attention_with_cp.py: read until a sentinel-prefixed line, skipping (and echoing to stderr) any non-protocol chatter. This covers the one stdout source the worker-side fd redirect cannot touch: the torchrun agent process, which shares the same pipe. The select() deadline bounds the whole read loop, not just one readline.

Defense in depth: fd-level isolation keeps the pipe clean even under verbose NCCL_DEBUG=INFO, and the sentinel framing tolerates any residual launcher noise.

Validation

8×H100, TE built from this commit:

Scenario	Result
`NCCL_DEBUG=VERSION`, flash p2p + all_gather + a2a (one session, pool reused across the banner)	✅ 3 passed
`NCCL_DEBUG=INFO` (verbose), flash p2p	✅ 1 passed
`NCCL_DEBUG=VERSION`, fused p2p	✅ 1 passed
`NCCL_DEBUG` unset (control)	✅ 1 passed
Without this fix, `NCCL_DEBUG=VERSION`	❌ `pool worker JSON protocol broke` (regression reproduced)

Type of change

Bug fix (test infrastructure; no library/runtime code change)

Checklist

Contributing guidelines followed
Changes commented where non-obvious
No new warnings
Existing test suite serves as input + validation
Existing tests pass locally (8×H100, NCCL_DEBUG=VERSION/INFO/unset)

greptile-apps · 2026-06-02T00:12:40Z

Greptile Summary

This PR fixes a CI regression where NCCL_DEBUG=VERSION caused NCCL to write a banner line to fd 1 (stdout) of the rank-0 pool worker, corrupting the line-delimited JSON protocol channel and cascading failures across all ~200 CP attention tests in the session.

Worker fix (run_attention_with_cp_pool.py): Before import torch, rank 0 dup()s its original stdout fd into a private _JSON_STDOUT file object reserved for JSON writes, then dup2(2, 1) redirects fd 1 to stderr. All import-time and runtime library banners (NCCL, cuBLAS, cuDNN) now land on stderr instead of the protocol pipe; _send_response is updated to write through _JSON_STDOUT.
Parent update (test_attention_with_cp.py): Comment-only changes that reflect the updated isolation guarantee; the select → readline → json.loads logic is unchanged.

Confidence Score: 5/5

Safe to merge — changes are confined to test infrastructure with no library/runtime impact.

The worker-side fd redirect is correctly ordered before import torch, ensuring all C-level and Python-level import-time writes go to stderr. The private _JSON_STDOUT dup is a module-level global that lives for the process lifetime, so there is no premature-close risk. Non-rank-0 workers return early from the redirect function and are unaffected. The parent read path is unchanged. No functional issues found in the changed code.

No files require special attention.

Important Files Changed

Filename	Overview
tests/pytorch/attention/run_attention_with_cp_pool.py	Adds fd-level stdout isolation before `import torch`: rank 0 dups fd 1 into a private `_JSON_STDOUT` stream for the JSON protocol, then redirects fd 1 to stderr so NCCL/cuBLAS banners can't reach the protocol pipe. Logic is correct and the module-level call order achieves the intended effect.
tests/pytorch/attention/test_attention_with_cp.py	Comment-only updates in `_submit_once` to reflect the new worker-side isolation; no behavioral changes to the parent read path (`select` → `readline` → `json.loads`).

Sequence Diagram

sequenceDiagram
    participant P as pytest (parent)
    participant TR as torchrun agent
    participant R0 as rank-0 worker
    participant RN as rank 1..N workers

    Note over R0: module load (before import torch)<br/>os.dup(1) → _JSON_STDOUT (pipe)<br/>os.dup2(2, 1) → fd1 = stderr<br/>sys.stdout = sys.stderr

    P->>TR: "Popen(stdout=PIPE)"
    TR->>R0: spawn (inherits pipe as fd 1)
    TR->>RN: spawn (inherits pipe as fd 1)
    RN-->>RN: "os.dup2(devnull, 1) → fd1 = /dev/null"

    loop each test case
        P->>TR: stdin.write(JSON request)
        TR->>R0: broadcast via dist
        R0->>RN: dist.broadcast_object_list
        RN-->>R0: dist.gather_object
        R0->>P: _JSON_STDOUT.write(JSON resp)
        Note over P: select() → readline() → json.loads()
    end

    P->>TR: "stdin.write({op: shutdown})"
    TR-->>P: process exits

_{Reviews (2): Last reviewed commit: "[PyTorch] Isolate CP pool worker stdout ..." | Re-trigger Greptile}

greptile-apps · 2026-06-02T00:12:45Z

+_RESP_PREFIX = "[CP_POOL_RESP] "
+
+
 def _send_response(rank: int, payload: dict) -> None:


_RESP_PREFIX duplicated across worker and parent without a shared source

The sentinel string "[CP_POOL_RESP] " is defined independently in both run_attention_with_cp_pool.py and test_attention_with_cp.py. If they ever drift (e.g. a typo during a rename), the parent will silently spin through every line until it times out, emitting a confusing "timed out" failure rather than a protocol-mismatch error. A brief comment calling out the coupling would make this risk explicit.

Suggested change

_RESP_PREFIX = "[CP_POOL_RESP] "

def _send_response(rank: int, payload: dict) -> None:

# NOTE: this string must stay identical to PoolWorker._RESP_PREFIX in

# test_attention_with_cp.py; the parent uses it to identify response lines.

_RESP_PREFIX = "[CP_POOL_RESP] "

def _send_response(rank: int, payload: dict) -> None:

Resolved by removing the sentinel entirely in e543ef89 — the rank-0 fd redirect keeps the pipe clean on its own (verified: 0 non-protocol lines under NCCL_DEBUG=INFO), so there's no longer a duplicated _RESP_PREFIX to keep in sync across the two files.

sudhakarsingh27

Targeted fix, mostly minimal. Main asks: drop the unreachable _JSON_STDOUT is None raise in _send_response and the dead re-entry guard in _redirect_rank0_stdout_to_stderr. The fd-redirect + sentinel pair is justified (worker process vs torchrun agent process — different coverage), but confirm the agent actually writes to this pipe; if that's speculative the sentinel alone suffices. fd-level ops are correct and the import-time placement is right. Comments are on the verbose side but explain real non-obvious intent, so fine to keep.

sudhakarsingh27 · 2026-06-02T00:14:49Z

-        sys.stdout.write(json.dumps(payload) + "\n")
-        sys.stdout.flush()
+        if _JSON_STDOUT is None:
+            raise RuntimeError("rank-0 JSON stdout was not initialized")


Unreachable guard. _send_response only writes when rank == 0, and on rank 0 _redirect_rank0_stdout_to_stderr() (called unconditionally at import, when RANK=="0") always sets _JSON_STDOUT. The only way _JSON_STDOUT stays None on rank 0 is if RANK is unset/non-zero at import but rank resolves to 0 in main() — which can't happen since both read the same env var. Drop the check; it adds noise without covering a real path.

Done — guard removed in e543ef89. _send_response now writes directly to _JSON_STDOUT.

sudhakarsingh27 · 2026-06-02T00:15:00Z

+# still reaches the protocol pipe. The rank-0 fd redirect handles this worker's
+# own library/Python writes, but the torchrun *agent* process shares the same
+# pipe and is not covered by it; the prefix lets the parser pick out responses.
+_RESP_PREFIX = "[CP_POOL_RESP] "


Two mechanisms now solve overlapping problems: the rank-0 fd redirect (covers worker process writes) and the sentinel prefix (covers the torchrun agent process, which the worker can't touch). That split is genuinely justified — keep both. But the redirect makes the worker's contribution to pipe noise effectively zero, so the only remaining producer the sentinel protects against is the torchrun agent. Worth stating that explicitly in one place. Also consider whether agent-process output on this pipe was actually observed or is speculative — if speculative, the sentinel alone is the cheaper standalone fix.

Tested the speculative part: instrumented the parent to count non-sentinel lines reaching the pipe and ran 3 cases under NCCL_DEBUG=INFO (max verbosity) — 0 lines skipped. With the rank-0 fd redirect in place, the torchrun agent writes nothing to this stdout pipe (its logs go to stderr). So the sentinel wasn't load-bearing.

Removed the sentinel + parent skip-loop entirely in e543ef89. The fd redirect alone fully isolates the pipe, and it's actually better under NCCL_DEBUG=INFO (banner volume goes to stderr instead of forcing the parent to read+skip hundreds of lines within the deadline). The parent reverts to a single read + json.loads.

sudhakarsingh27 · 2026-06-02T00:15:07Z

+    sys.stdout = sys.stderr
+
+
+_redirect_rank0_stdout_to_stderr()


Import-time side-effect call is correct here and must stay at module scope — the redirect has to happen before import torch to catch C-level/import-time writes, so it can't move into main(). Good. Minor: _redirect_rank0_stdout_to_stderr is only ever called once, immediately below its def; the _JSON_STDOUT is not None re-entry guard is dead. Either inline the body (no function) or drop the guard. The function wrapper does aid readability, so keeping the function and dropping the guard is the cleaner choice.

Done — dropped the _JSON_STDOUT is not None re-entry guard, kept the function. e543ef89.

sudhakarsingh27 · 2026-06-02T00:15:11Z

+        return
+    _JSON_STDOUT = os.fdopen(os.dup(1), "w", buffering=1)
+    os.dup2(2, 1)
+    sys.stdout = sys.stderr


fd ops are idiomatic and correct: os.dup(1) snapshots the original stdout, os.fdopen(..., buffering=1) gives a line-buffered text wrapper, os.dup2(2, 1) repoints fd 1 at stderr. One asymmetry worth a comment: rank 0's library output now goes to stderr (preserved in CI logs), but _silence_non_rank0_stdout sends rank>0 output to /dev/null (discarded). That's defensible (rank 0 is representative) but the two functions now implement deliberately different policies — note the rationale so a future reader doesn't "unify" them.

Added a note in the _silence_non_rank0_stdout docstring explaining the deliberate asymmetry: rank 0 keeps a dup of fd 1 for the JSON channel and sends the rest to stderr (preserved in CI logs), while rank>0 has nothing to preserve so it goes to /dev/null. e543ef89.

sudhakarsingh27 · 2026-06-02T00:15:18Z

+                break
+            # Non-protocol line — echo to stderr so it stays in CI logs.
+            sys.stderr.write(line)
+            sys.stderr.flush()


Read loop is clean and the deadline correctly bounds the whole loop. Two edge cases: (1) select() says ready but readline() can still block if the worker emitted a partial line without a newline — pre-existing, low risk over a pipe, fine to leave. (2) A pathological worker that emits unbounded non-sentinel lines faster than the deadline would keep this loop echoing forever-ish until the deadline; acceptable since the deadline still caps it. No change required, just confirming the loop is sound.

Moot now — the skip-loop was removed (see the sentinel thread). With the fd redirect the pipe carries only the rank-0 JSON line, so the parent is back to a single select + readline + json.loads, no loop. The two edge cases you noted no longer apply.

The CP attention test pool worker (PR NVIDIA#2993) uses rank-0 stdout as a line-delimited JSON protocol channel to the parent pytest process. When NCCL_DEBUG is set to VERSION or INFO, NCCL writes an "NCCL version ...\n" banner to stdout (fd 1); that banner reaches the parent ahead of the first JSON response, json.loads raises, and because the pool fixture is session-scoped and killed on first failure, every subsequent CP test in the file fails. CI runners that export NCCL_DEBUG hit this on all ~200 non-skipped cases. The prior mitigation only redirected non-rank-0 stdout to /dev/null, so rank 0's own banner still corrupted the stream. Fix it at the source: in the worker, before importing torch, dup rank-0's original fd 1 into a private stream reserved for JSON, then point fd 1 at stderr. Any banner from NCCL/cuBLAS/cuDNN/torch (Python or C level, import-time or runtime) now lands on stderr — still drained into CI logs by the parent's stderr thread — instead of the protocol pipe. Combined with the existing non-rank-0 /dev/null redirect, the pipe carries only rank-0 JSON, so the parent's single-line read needs no change. Validated on 8xH100 (TE built from this commit). With NCCL_DEBUG=VERSION and =INFO, flash (p2p/all_gather/a2a) and fused cases pass and zero non-protocol lines reach the pipe; without the fix the same cases fail with "pool worker JSON protocol broke". Control with NCCL_DEBUG unset also passes. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 self-assigned this Jun 2, 2026

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

sudhakarsingh27 commented Jun 2, 2026

View reviewed changes

sudhakarsingh27 force-pushed the sudhakars/cp_pool_stdout_isolation branch from 8332034 to e543ef8 Compare June 2, 2026 00:23

		_RESP_PREFIX = "[CP_POOL_RESP] "


		def _send_response(rank: int, payload: dict) -> None:

Conversation

sudhakarsingh27 commented Jun 2, 2026

Problem

Fix (both sides)

Validation

Type of change

Checklist

Uh oh!

greptile-apps Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 2, 2026 •

edited

Loading