feat : deepseek r1 accuracy support#340
Conversation
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
| monkeypatch.setattr( | ||
| scoring_mod, "_lcb_ws_evaluate", lambda url, codes, timeout: None | ||
| ) | ||
| score, n_repeats = self._scorer(dataset, staged, project).score() |
| monkeypatch.setattr( | ||
| scoring_mod, "_lcb_ws_evaluate", lambda url, codes, timeout: None | ||
| ) | ||
| score, n_repeats = self._scorer(dataset, staged, project).score() |
There was a problem hiding this comment.
Code Review
This pull request introduces an end-to-end example and evaluation pipeline for DeepSeek-R1 (FP4, ModelOpt) on TensorRT-LLM, including an isolated subproject for the MLPerf accuracy evaluator and a new DeepSeekR1Scorer that integrates with a WebSocket-based sandbox for LiveCodeBench. It also optimizes dataset transforms by lazily initializing the Harmonizer to prevent offline loading failures. The review feedback provides critical improvements, including guarding against a potential NaN float conversion crash in the scorer, addressing a security risk by disabling trust_remote_code during tokenizer loading, and ensuring temporary files are cleaned up in the client execution script.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if em is None: # subset failed to grade (status != "ok") | ||
| text_complete = False | ||
| continue |
There was a problem hiding this comment.
If a text subset fails to grade or has no valid samples, its exact_match (em) value in the results can be NaN (represented as nan in Python). Checking only em is None is insufficient, as round(em / 100.0 * n) will raise a ValueError: cannot convert float NaN to integer when em is nan, crashing the finalization phase of the benchmark. Guarding against NaN using np.isnan(em) prevents this crash.
| if em is None: # subset failed to grade (status != "ok") | |
| text_complete = False | |
| continue | |
| if em is None or np.isnan(em): # subset failed to grade (status != "ok") | |
| text_complete = False | |
| continue |
| # basis for tokens_per_sample. Count with the DeepSeek tokenizer so the | ||
| # number matches MLPerf token accounting. | ||
| logger.info("Loading tokenizer: %s", args.tokenizer) | ||
| tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True) |
There was a problem hiding this comment.
Using trust_remote_code=True when loading a tokenizer from an arbitrary or user-specified path poses a security risk, as it allows the execution of arbitrary code present in the model repository. Since DeepSeek-R1 uses a standard tokenizer architecture that does not require custom code execution, it is safer to set trust_remote_code=False.
| tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True) | |
| tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=False) |
|
|
||
| # from-config takes the endpoint from the YAML, so render a run copy with this | ||
| # server's endpoint substituted in. | ||
| RUN_YAML="$(mktemp /tmp/dsr1_run_XXXX.yaml)" |
There was a problem hiding this comment.
The temporary YAML file RUN_YAML created in /tmp/ is never cleaned up, which can clutter the system's temporary directory over multiple runs. Adding an EXIT trap ensures that the temporary file is automatically deleted when the script terminates.
| RUN_YAML="$(mktemp /tmp/dsr1_run_XXXX.yaml)" | |
| RUN_YAML="$(mktemp /tmp/dsr1_run_XXXX.yaml)" | |
| trap 'rm -f "${RUN_YAML}"' EXIT |
arekay-nv
left a comment
There was a problem hiding this comment.
Review Council — Claude Code Review
Found 6 issues (Codex review unavailable — workspace auth error; Claude-only).
|
|
||
| if combined is None: | ||
| return None, n_repeats | ||
| return float(combined), n_repeats |
There was a problem hiding this comment.
[HIGH] Partial score returned without caller-visible flag when LCB is unreachable.
score() returns float(combined) here even when the LCB container was unreachable and only text subsets were graded. The caller in execute.py:913 stores this as "score": score in results.json with no indication of partial coverage. The complete=False flag lives only in the separate deepseek_eval/*_results.json artifact — consumers of results.json cannot distinguish a full 5-subset score from a text-only 4-subset score.
Consider returning None when not lcb_ok (forcing callers to treat it as unscored), or adding a complete field to the accuracy_scores dict in execute.py so the incompleteness is surfaced in the top-level report.
| if em is None: # subset failed to grade (status != "ok") | ||
| text_complete = False | ||
| continue | ||
| text_correct += round(em / 100.0 * n) |
There was a problem hiding this comment.
[MEDIUM] Floating-point round-trip may reconstruct incorrect sample count.
text_correct += round(em / 100.0 * n) recovers a per-subset correct count by inverting the runner's em = 100.0 * k / n formula. This round-trips through JSON serialization in the subprocess, where float precision may cause round() to yield k±1 for certain (k, n) pairs. A more reliable fix is to have the runner also emit "correct_samples": k in each per_dataset entry so the aggregation is exact integer arithmetic.
| total_samples, | ||
| ) | ||
| return None | ||
| passed = sum(sum(code_passed) for code_passed in per_problem.values()) |
There was a problem hiding this comment.
[MEDIUM] Unvalidated WebSocket response structure — malformed response raises TypeError instead of returning None.
sum(sum(code_passed) for code_passed in per_problem.values()) assumes every value in per_problem is an iterable of booleans. If the lcb-service returns a non-list for any question key (e.g. a scalar or dict), sum(code_passed) raises TypeError that propagates unhandled out of _score_lcb_via_container, causing score() to raise rather than fall back gracefully via None. Wrap this block in a try/except (TypeError, ValueError) or validate the schema of per_problem before iterating.
| raise FileNotFoundError( | ||
| f"eval_accuracy.py not found at {eval_dir}. Run setup_eval.sh first." | ||
| ) | ||
| sys.path.insert(0, str(eval_dir)) |
There was a problem hiding this comment.
[MEDIUM] sys.path.insert(0, ...) with a user-supplied --eval-dir allows arbitrary code execution.
sys.path.insert(0, str(eval_dir)) followed by import eval_accuracy executes whichever eval_accuracy.py is found first on sys.path. Because --eval-dir is an exposed CLI argument, pointing it at a directory containing a malicious eval_accuracy.py leads to arbitrary code execution in the runner subprocess. Inserting at index 0 also permanently pollutes sys.path for the process lifetime.
Replace with importlib.util.spec_from_file_location loading from the explicit absolute path eval_dir / 'eval_accuracy.py', which avoids both the path-injection risk and the sys.path side-effect.
| (default ``question``); both are passed through ``accuracy_config.extras``. | ||
| """ | ||
|
|
||
| REQUIRES_EXTRACTOR: ClassVar[bool] = False |
There was a problem hiding this comment.
[LOW] REQUIRES_EXTRACTOR = False but LCB extraction is hardcoded inside _score_lcb_via_container.
Setting REQUIRES_EXTRACTOR = False prevents the scorer framework from injecting an extractor, but _score_lcb_via_container hardcodes PythonCodeExtractor.extract directly. This means accuracy_config.extractor cannot override the LCB code-extraction strategy, inconsistent with how LiveCodeBenchScorer handles the same concern via its injected extractor. At minimum, add a comment explaining why the extractor is bypassed here; ideally respect self.extractor when set.
| text_correct = 0 | ||
| text_n = 0 | ||
| text_complete = True | ||
| for sub, d in per_dataset.items(): |
There was a problem hiding this comment.
[LOW] Customised lcb_subset key may not match the runner's emitted key, causing text-subset mis-classification.
if sub == self.lcb_subset: continue relies on self.lcb_subset matching the key the runner writes to per_dataset. If lcb_subset is set to a non-default value (e.g. "livecodebench_v3") while the runner still emits "livecodebench", the runner's entry passes the guard and is treated as a text subset. Its exact_match=None sets text_complete = False, marking the combined score incomplete even when all text subsets graded successfully. Consider making the skip-condition match on the actual runner entry key, or validate at construction that lcb_subset matches the runner's expected output.
Review Council — Multi-AI Code ReviewReviewed by: Claude | Depth: thorough | Codex: unavailable (workspace auth error) Found 6 issues across 2 files: 1 high · 3 medium · 2 low 🔴 Must Fix (high)
🟡 Should Fix (medium)
🔵 Consider (low)
|
nv-alicheng
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review (thorough)
Reviewed by Codex + Claude. Posting 6 inline findings that are not already covered by the existing automated review on this PR (the headline partial-score issue at scoring.py:1588, the rounding round-trip at 1519, the sys.path RCE at deepseek_eval_runner.py:61, and the temp-file leak at run_client.sh:43 are already flagged — see the synthesis comment for how they fit together).
|
|
||
| model_params: | ||
| # Served model id = the checkpoint dir basename trtllm-serve registers. | ||
| name: "${SERVED_MODEL_NAME:-deepseek_r1-torch-fp4-v2}" |
There was a problem hiding this comment.
[Codex] medium · bug: The smoke config defaults model_params.name to deepseek_r1-torch-fp4-v2, but this PR's own docs state that the -v2 checkpoint does not load with the stock TRT-LLM builds and that deepseek_r1-torch-fp4 is the working default. A user who runs the smoke path without exporting SERVED_MODEL_NAME will send requests for a model id the server can't serve, and the run fails immediately — i.e. the advertised quick-validation path is broken out of the box. Default to deepseek_r1-torch-fp4.
| `extras` are forwarded to the scorer. **`lcb_websocket_port` defaults to 13835**, | ||
| so if the `lcb_serve` container is reachable on the client node the livecodebench | ||
| subset is scored **in-run** (one number, no follow-up job); if it's unreachable | ||
| the scorer logs and falls back to grading LCB in-process. The LCB keys are |
There was a problem hiding this comment.
[Claude] medium · documentation: This sentence says that when the lcb_serve container is unreachable "the scorer logs and falls back to grading LCB in-process," but that is not what the code does for the common case (port configured, container down): _score_lcb_via_container returns None and the scorer leaves livecodebench unscored — see scoring.py:1527-1543 ("Do NOT re-run the in-process executor"). In-process grading happens only when lcb_websocket_port is null (the use_container == False path). README lines 201-202 below correctly describe the unscored/complete: false behavior, so the doc contradicts both itself and the implementation. A reader following this line will expect an LCB score that never appears.
| monkeypatch.setattr(scoring_mod.subprocess, "run", fake_run) | ||
| return calls | ||
|
|
||
| def _scorer(self, dataset, staged, project): |
There was a problem hiding this comment.
[Claude] medium · testing: The container tests assert complete in the sidecar *_results.json (test_container_unreachable_leaves_lcb_unscored, test_failed_text_subset_marks_incomplete) but never assert anything about the returned headline score — score, n_repeats = ...score() is captured and ignored. The actual data-integrity gap (already flagged at scoring.py:1588: a partial combined is returned and stored in results.json with no completeness marker) is therefore not guarded by any test. Two further branches are untested: the no-container path with LCB rows present (scoring.py:1483 reached with n_lcb>0 but lcb_websocket_port=None), and the LCB count-divergence warning (scoring.py:1562). _scorer always passes lcb_websocket_port=13835.
| ws = websocket.create_connection( | ||
| url, timeout=7200, ping_interval=30, ping_timeout=10 | ||
| ) | ||
| except (ConnectionRefusedError, OSError, Exception) as e: # noqa: BLE001 |
There was a problem hiding this comment.
[Claude] low · error-handling: except (ConnectionRefusedError, OSError, Exception) as e: — Exception already subsumes ConnectionRefusedError and OSError, so the tuple is misleading and, more importantly, this handler now treats any exception (including programming errors like AttributeError/TypeError raised inside create_connection usage) as a benign "connect failed" and returns None. Catch the specific network/protocol exceptions, or use a bare except Exception with the explanatory comment and let it be honest about what it swallows.
| ) | ||
| args = ap.parse_args() | ||
|
|
||
| df = pd.read_pickle(args.source) |
There was a problem hiding this comment.
[Claude] low · security: df = pd.read_pickle(args.source) deserializes an arbitrary path supplied via --source / $DEEPSEEK_R1_DATASET_PKL. pickle load executes arbitrary code, so pointing this at an untrusted .pkl is RCE. The official MLCommons dataset is the expected input and this is an operator-run script, so severity is low — but a one-line caveat in the script/docstring ("only load the official MLCommons .pkl; pickle deserialization executes arbitrary code") would be prudent since there is currently no such warning.
|
|
||
| @property | ||
| def harmonizer(self) -> Harmonizer: | ||
| if self._harmonizer is None: |
There was a problem hiding this comment.
[Claude] low · design: The harmonizer lazy property is not thread-safe: if self._harmonizer is None: self._harmonizer = Harmonizer(...) lets two threads both observe None and construct two Harmonizer instances. This is benign today because dataset transforms run single-threaded, but the lazy-init pattern silently bakes in that assumption (the previous eager construction made the attribute set-once). If a transform is ever shared across a thread pool (e.g. df.apply parallelism), this becomes a real double-construction race. Flagging so the single-threaded assumption is intentional, not accidental.
Review Council — Multi-AI Code ReviewReviewed by: Codex (gpt-5.4) + Claude | Depth: thorough Posted 6 inline findings. This PR already carries an automated review with several solid catches, so the council deliberately did not re-post issues that bot already covers — instead the synthesis below shows how everything fits together. 1 additional latent issue is out of diff scope and noted at the bottom. 🔴 Must Fix (critical/high)The single most important issue is already flagged by the existing review and is not re-posted:
🟡 Should Fix (medium)
🔵 Consider (low)
Already flagged by the existing bot review (not re-posted)
|
nvzhihanj
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review (second pass)
Reviewed by: Codex (--yolo) + Claude · Depth: thorough
This PR already has a prior council pass + author self-review covering the high/medium issues. This pass posts only net-new findings not already commented; see the summary for the confirmed/deduped set.
| tail -n 20 "${RESULTS}.raw" >&2 | ||
| exit 1 | ||
| fi | ||
| echo "Fold into the aggregate: full_exact_match = (text_subset_correct + LCB passed_samples) / 4388." |
There was a problem hiding this comment.
[Claude] medium · bug — The fold-in hint hardcodes the full-set denominator:
full_exact_match = (text_subset_correct + LCB passed_samples) / 4388.
But OUTPUTS_PARQUET is user-overridable and the README documents ~385-sample subset runs via offline_deepseek_r1_accuracy_subset.yaml. A user who scores LCB for a subset run and follows this formula divides by 4388 instead of their actual sample count, silently understating accuracy by ~10×. Derive the denominator from the run's actual total sample count (e.g. len(df) of the outputs parquet) rather than hardcoding 4388.
| # trtllm-serve registers the model under the checkpoint dir's basename; the | ||
| # /v1/completions `model` field (model_params.name in the YAML) must match it. | ||
| SERVED_MODEL_NAME="${SERVED_MODEL_NAME:-$(basename "${MODEL_DIR}")}" | ||
| SOURCE_PKL="${SOURCE_PKL:?Set SOURCE_PKL to the MLPerf DeepSeek-R1 source dataset .pkl}" |
There was a problem hiding this comment.
[Codex] medium · bug — SOURCE_PKL="${SOURCE_PKL:?…}" aborts the launch unconditionally when SOURCE_PKL is unset. But in the documented pre-prepared-parquet / SERVER_ONLY=1 workflow the source pickle isn't needed — the later if [ ! -f $PARQUET ] check is meant to skip conversion. This :? expansion fires first, so a run with data/deepseek_r1_eval.parquet already present fails even though no conversion is required. Only require (and mount) SOURCE_PKL when the parquet is missing.
| `<report_dir>/deepseek_eval/<name>_outputs.parquet`, then runs: | ||
|
|
||
| ```bash | ||
| uv run --project <this dir> python deepseek_eval_runner.py \ |
There was a problem hiding this comment.
[Claude] low · documentation — This documented invocation omits --external-subsets:
uv run --project <this dir> python deepseek_eval_runner.py --input … --output … --tokenizer …
In the default (container-reachable) path DeepSeekR1Scorer._run_eval_subprocess passes --external-subsets livecodebench (scoring.py:1500), which changes the runner's behavior (LCB tokenized but not graded, status=external). The documented command matches only the no-container fallback — note both invocation modes so readers understand when --external-subsets applies.
Review Council — Summary (second pass)Reviewed by: Codex ( This PR already has a prior review-council pass + the author's self-review + 🟡 Net-new — posted inline
🔵 Net-new — out-of-diff (can't anchor inline; noted here)
✅ Confirmed but already commented (deduped — not re-posted)The council independently reproduced these already-raised findings (cross-validation, not new):
Codex review ran via |
What does this PR do?
Adds accuracy support for Deepseek-r1 using an ensemble evaluator that will produce a combined score.
Needs #335 to push the container image.
Type of change
Related issues
Testing
Checklist