[PyTorch] Isolate CP pool worker stdout from NCCL/library banners#3074
[PyTorch] Isolate CP pool worker stdout from NCCL/library banners#3074sudhakarsingh27 wants to merge 1 commit into
Conversation
Greptile SummaryThis PR fixes a CI regression where
Confidence Score: 5/5Safe to merge — changes are confined to test infrastructure with no library/runtime impact. The worker-side fd redirect is correctly ordered before No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant P as pytest (parent)
participant TR as torchrun agent
participant R0 as rank-0 worker
participant RN as rank 1..N workers
Note over R0: module load (before import torch)<br/>os.dup(1) → _JSON_STDOUT (pipe)<br/>os.dup2(2, 1) → fd1 = stderr<br/>sys.stdout = sys.stderr
P->>TR: "Popen(stdout=PIPE)"
TR->>R0: spawn (inherits pipe as fd 1)
TR->>RN: spawn (inherits pipe as fd 1)
RN-->>RN: "os.dup2(devnull, 1) → fd1 = /dev/null"
loop each test case
P->>TR: stdin.write(JSON request)
TR->>R0: broadcast via dist
R0->>RN: dist.broadcast_object_list
RN-->>R0: dist.gather_object
R0->>P: _JSON_STDOUT.write(JSON resp)
Note over P: select() → readline() → json.loads()
end
P->>TR: "stdin.write({op: shutdown})"
TR-->>P: process exits
Reviews (2): Last reviewed commit: "[PyTorch] Isolate CP pool worker stdout ..." | Re-trigger Greptile |
| _RESP_PREFIX = "[CP_POOL_RESP] " | ||
|
|
||
|
|
||
| def _send_response(rank: int, payload: dict) -> None: |
There was a problem hiding this comment.
_RESP_PREFIX duplicated across worker and parent without a shared source
The sentinel string "[CP_POOL_RESP] " is defined independently in both run_attention_with_cp_pool.py and test_attention_with_cp.py. If they ever drift (e.g. a typo during a rename), the parent will silently spin through every line until it times out, emitting a confusing "timed out" failure rather than a protocol-mismatch error. A brief comment calling out the coupling would make this risk explicit.
| _RESP_PREFIX = "[CP_POOL_RESP] " | |
| def _send_response(rank: int, payload: dict) -> None: | |
| # NOTE: this string must stay identical to PoolWorker._RESP_PREFIX in | |
| # test_attention_with_cp.py; the parent uses it to identify response lines. | |
| _RESP_PREFIX = "[CP_POOL_RESP] " | |
| def _send_response(rank: int, payload: dict) -> None: |
There was a problem hiding this comment.
Resolved by removing the sentinel entirely in e543ef89 — the rank-0 fd redirect keeps the pipe clean on its own (verified: 0 non-protocol lines under NCCL_DEBUG=INFO), so there's no longer a duplicated _RESP_PREFIX to keep in sync across the two files.
sudhakarsingh27
left a comment
There was a problem hiding this comment.
Targeted fix, mostly minimal. Main asks: drop the unreachable _JSON_STDOUT is None raise in _send_response and the dead re-entry guard in _redirect_rank0_stdout_to_stderr. The fd-redirect + sentinel pair is justified (worker process vs torchrun agent process — different coverage), but confirm the agent actually writes to this pipe; if that's speculative the sentinel alone suffices. fd-level ops are correct and the import-time placement is right. Comments are on the verbose side but explain real non-obvious intent, so fine to keep.
| sys.stdout.write(json.dumps(payload) + "\n") | ||
| sys.stdout.flush() | ||
| if _JSON_STDOUT is None: | ||
| raise RuntimeError("rank-0 JSON stdout was not initialized") |
There was a problem hiding this comment.
Unreachable guard. _send_response only writes when rank == 0, and on rank 0 _redirect_rank0_stdout_to_stderr() (called unconditionally at import, when RANK=="0") always sets _JSON_STDOUT. The only way _JSON_STDOUT stays None on rank 0 is if RANK is unset/non-zero at import but rank resolves to 0 in main() — which can't happen since both read the same env var. Drop the check; it adds noise without covering a real path.
There was a problem hiding this comment.
Done — guard removed in e543ef89. _send_response now writes directly to _JSON_STDOUT.
| # still reaches the protocol pipe. The rank-0 fd redirect handles this worker's | ||
| # own library/Python writes, but the torchrun *agent* process shares the same | ||
| # pipe and is not covered by it; the prefix lets the parser pick out responses. | ||
| _RESP_PREFIX = "[CP_POOL_RESP] " |
There was a problem hiding this comment.
Two mechanisms now solve overlapping problems: the rank-0 fd redirect (covers worker process writes) and the sentinel prefix (covers the torchrun agent process, which the worker can't touch). That split is genuinely justified — keep both. But the redirect makes the worker's contribution to pipe noise effectively zero, so the only remaining producer the sentinel protects against is the torchrun agent. Worth stating that explicitly in one place. Also consider whether agent-process output on this pipe was actually observed or is speculative — if speculative, the sentinel alone is the cheaper standalone fix.
There was a problem hiding this comment.
Tested the speculative part: instrumented the parent to count non-sentinel lines reaching the pipe and ran 3 cases under NCCL_DEBUG=INFO (max verbosity) — 0 lines skipped. With the rank-0 fd redirect in place, the torchrun agent writes nothing to this stdout pipe (its logs go to stderr). So the sentinel wasn't load-bearing.
Removed the sentinel + parent skip-loop entirely in e543ef89. The fd redirect alone fully isolates the pipe, and it's actually better under NCCL_DEBUG=INFO (banner volume goes to stderr instead of forcing the parent to read+skip hundreds of lines within the deadline). The parent reverts to a single read + json.loads.
| sys.stdout = sys.stderr | ||
|
|
||
|
|
||
| _redirect_rank0_stdout_to_stderr() |
There was a problem hiding this comment.
Import-time side-effect call is correct here and must stay at module scope — the redirect has to happen before import torch to catch C-level/import-time writes, so it can't move into main(). Good. Minor: _redirect_rank0_stdout_to_stderr is only ever called once, immediately below its def; the _JSON_STDOUT is not None re-entry guard is dead. Either inline the body (no function) or drop the guard. The function wrapper does aid readability, so keeping the function and dropping the guard is the cleaner choice.
There was a problem hiding this comment.
Done — dropped the _JSON_STDOUT is not None re-entry guard, kept the function. e543ef89.
| return | ||
| _JSON_STDOUT = os.fdopen(os.dup(1), "w", buffering=1) | ||
| os.dup2(2, 1) | ||
| sys.stdout = sys.stderr |
There was a problem hiding this comment.
fd ops are idiomatic and correct: os.dup(1) snapshots the original stdout, os.fdopen(..., buffering=1) gives a line-buffered text wrapper, os.dup2(2, 1) repoints fd 1 at stderr. One asymmetry worth a comment: rank 0's library output now goes to stderr (preserved in CI logs), but _silence_non_rank0_stdout sends rank>0 output to /dev/null (discarded). That's defensible (rank 0 is representative) but the two functions now implement deliberately different policies — note the rationale so a future reader doesn't "unify" them.
There was a problem hiding this comment.
Added a note in the _silence_non_rank0_stdout docstring explaining the deliberate asymmetry: rank 0 keeps a dup of fd 1 for the JSON channel and sends the rest to stderr (preserved in CI logs), while rank>0 has nothing to preserve so it goes to /dev/null. e543ef89.
| break | ||
| # Non-protocol line — echo to stderr so it stays in CI logs. | ||
| sys.stderr.write(line) | ||
| sys.stderr.flush() |
There was a problem hiding this comment.
Read loop is clean and the deadline correctly bounds the whole loop. Two edge cases: (1) select() says ready but readline() can still block if the worker emitted a partial line without a newline — pre-existing, low risk over a pipe, fine to leave. (2) A pathological worker that emits unbounded non-sentinel lines faster than the deadline would keep this loop echoing forever-ish until the deadline; acceptable since the deadline still caps it. No change required, just confirming the loop is sound.
There was a problem hiding this comment.
Moot now — the skip-loop was removed (see the sentinel thread). With the fd redirect the pipe carries only the rank-0 JSON line, so the parent is back to a single select + readline + json.loads, no loop. The two edge cases you noted no longer apply.
The CP attention test pool worker (PR NVIDIA#2993) uses rank-0 stdout as a line-delimited JSON protocol channel to the parent pytest process. When NCCL_DEBUG is set to VERSION or INFO, NCCL writes an "NCCL version ...\n" banner to stdout (fd 1); that banner reaches the parent ahead of the first JSON response, json.loads raises, and because the pool fixture is session-scoped and killed on first failure, every subsequent CP test in the file fails. CI runners that export NCCL_DEBUG hit this on all ~200 non-skipped cases. The prior mitigation only redirected non-rank-0 stdout to /dev/null, so rank 0's own banner still corrupted the stream. Fix it at the source: in the worker, before importing torch, dup rank-0's original fd 1 into a private stream reserved for JSON, then point fd 1 at stderr. Any banner from NCCL/cuBLAS/cuDNN/torch (Python or C level, import-time or runtime) now lands on stderr — still drained into CI logs by the parent's stderr thread — instead of the protocol pipe. Combined with the existing non-rank-0 /dev/null redirect, the pipe carries only rank-0 JSON, so the parent's single-line read needs no change. Validated on 8xH100 (TE built from this commit). With NCCL_DEBUG=VERSION and =INFO, flash (p2p/all_gather/a2a) and fused cases pass and zero non-protocol lines reach the pipe; without the fix the same cases fail with "pool worker JSON protocol broke". Control with NCCL_DEBUG unset also passes. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
8332034 to
e543ef8
Compare
Problem
The CP attention test pool worker (added in #2993) uses rank-0 stdout as a line-delimited JSON protocol channel between the worker and the parent pytest process.
PoolWorker._submit_oncereads one line andjson.loadsit.When
NCCL_DEBUGis set toVERSIONorINFO, NCCL writes a banner line —NCCL version 2.x.y+cudaA.B\n— to stdout (fd 1), not stderr. That banner reaches the parent ahead of the first JSON response,json.loadsraises, and because thecp_poolfixture is session-scoped and the pool is killed on first failure, every subsequent CP test in the file fails with:CI runners that export
NCCL_DEBUG=VERSION(e.g. the cuDNN nightly GB200 job) hit this on all ~200 non-skippedtest_cp_with_flash_attention/test_cp_with_fused_attentioncases. WithNCCL_DEBUGunset the banner isn't emitted, so the suite passes — which is why the regression only shows up on those runners.The existing mitigation only redirected non-rank-0 stdout to
/dev/null, so rank 0's own banner still corrupted the stream.Fix (both sides)
Worker —
run_attention_with_cp_pool.py: beforeimport torch,dup()rank-0's original fd 1 into a private stream reserved for JSON, thendup2(2, 1)so fd 1 points at stderr. Any banner from NCCL / cuBLAS / cuDNN / torch — Python-level or C-level, import-time or runtime — now lands on stderr (still drained into CI logs by the parent's stderr thread) instead of the protocol pipe. Responses are written through the private stream, re-prefixed with a[CP_POOL_RESP]sentinel.Parent —
test_attention_with_cp.py: read until a sentinel-prefixed line, skipping (and echoing to stderr) any non-protocol chatter. This covers the one stdout source the worker-side fd redirect cannot touch: the torchrun agent process, which shares the same pipe. Theselect()deadline bounds the whole read loop, not just onereadline.Defense in depth: fd-level isolation keeps the pipe clean even under verbose
NCCL_DEBUG=INFO, and the sentinel framing tolerates any residual launcher noise.Validation
8×H100, TE built from this commit:
NCCL_DEBUG=VERSION, flash p2p + all_gather + a2a (one session, pool reused across the banner)NCCL_DEBUG=INFO(verbose), flash p2pNCCL_DEBUG=VERSION, fused p2pNCCL_DEBUGunset (control)NCCL_DEBUG=VERSIONpool worker JSON protocol broke(regression reproduced)Type of change
Checklist
NCCL_DEBUG=VERSION/INFO/unset)