Skip to content

feat : deepseek r1 accuracy support#340

Open
arekay-nv wants to merge 8 commits into
mainfrom
arekay/deepseek_r1_accuracy
Open

feat : deepseek r1 accuracy support#340
arekay-nv wants to merge 8 commits into
mainfrom
arekay/deepseek_r1_accuracy

Conversation

@arekay-nv
Copy link
Copy Markdown
Collaborator

What does this PR do?

Adds accuracy support for Deepseek-r1 using an ensemble evaluator that will produce a combined score.
Needs #335 to push the container image.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

arekay-nv added 5 commits June 2, 2026 21:59
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
@arekay-nv arekay-nv requested a review from a team June 5, 2026 21:21
@github-actions github-actions Bot requested a review from nvzhihanj June 5, 2026 21:21
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

monkeypatch.setattr(
scoring_mod, "_lcb_ws_evaluate", lambda url, codes, timeout: None
)
score, n_repeats = self._scorer(dataset, staged, project).score()
monkeypatch.setattr(
scoring_mod, "_lcb_ws_evaluate", lambda url, codes, timeout: None
)
score, n_repeats = self._scorer(dataset, staged, project).score()
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an end-to-end example and evaluation pipeline for DeepSeek-R1 (FP4, ModelOpt) on TensorRT-LLM, including an isolated subproject for the MLPerf accuracy evaluator and a new DeepSeekR1Scorer that integrates with a WebSocket-based sandbox for LiveCodeBench. It also optimizes dataset transforms by lazily initializing the Harmonizer to prevent offline loading failures. The review feedback provides critical improvements, including guarding against a potential NaN float conversion crash in the scorer, addressing a security risk by disabling trust_remote_code during tokenizer loading, and ensuring temporary files are cleaned up in the client execution script.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +1516 to +1518
if em is None: # subset failed to grade (status != "ok")
text_complete = False
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If a text subset fails to grade or has no valid samples, its exact_match (em) value in the results can be NaN (represented as nan in Python). Checking only em is None is insufficient, as round(em / 100.0 * n) will raise a ValueError: cannot convert float NaN to integer when em is nan, crashing the finalization phase of the benchmark. Guarding against NaN using np.isnan(em) prevents this crash.

Suggested change
if em is None: # subset failed to grade (status != "ok")
text_complete = False
continue
if em is None or np.isnan(em): # subset failed to grade (status != "ok")
text_complete = False
continue

# basis for tokens_per_sample. Count with the DeepSeek tokenizer so the
# number matches MLPerf token accounting.
logger.info("Loading tokenizer: %s", args.tokenizer)
tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Using trust_remote_code=True when loading a tokenizer from an arbitrary or user-specified path poses a security risk, as it allows the execution of arbitrary code present in the model repository. Since DeepSeek-R1 uses a standard tokenizer architecture that does not require custom code execution, it is safer to set trust_remote_code=False.

Suggested change
tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True)
tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=False)


# from-config takes the endpoint from the YAML, so render a run copy with this
# server's endpoint substituted in.
RUN_YAML="$(mktemp /tmp/dsr1_run_XXXX.yaml)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The temporary YAML file RUN_YAML created in /tmp/ is never cleaned up, which can clutter the system's temporary directory over multiple runs. Adding an EXIT trap ensures that the temporary file is automatically deleted when the script terminates.

Suggested change
RUN_YAML="$(mktemp /tmp/dsr1_run_XXXX.yaml)"
RUN_YAML="$(mktemp /tmp/dsr1_run_XXXX.yaml)"
trap 'rm -f "${RUN_YAML}"' EXIT

Copy link
Copy Markdown
Collaborator Author

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Claude Code Review

Found 6 issues (Codex review unavailable — workspace auth error; Claude-only).


if combined is None:
return None, n_repeats
return float(combined), n_repeats
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] Partial score returned without caller-visible flag when LCB is unreachable.

score() returns float(combined) here even when the LCB container was unreachable and only text subsets were graded. The caller in execute.py:913 stores this as "score": score in results.json with no indication of partial coverage. The complete=False flag lives only in the separate deepseek_eval/*_results.json artifact — consumers of results.json cannot distinguish a full 5-subset score from a text-only 4-subset score.

Consider returning None when not lcb_ok (forcing callers to treat it as unscored), or adding a complete field to the accuracy_scores dict in execute.py so the incompleteness is surfaced in the top-level report.

if em is None: # subset failed to grade (status != "ok")
text_complete = False
continue
text_correct += round(em / 100.0 * n)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Floating-point round-trip may reconstruct incorrect sample count.

text_correct += round(em / 100.0 * n) recovers a per-subset correct count by inverting the runner's em = 100.0 * k / n formula. This round-trips through JSON serialization in the subprocess, where float precision may cause round() to yield k±1 for certain (k, n) pairs. A more reliable fix is to have the runner also emit "correct_samples": k in each per_dataset entry so the aggregation is exact integer arithmetic.

total_samples,
)
return None
passed = sum(sum(code_passed) for code_passed in per_problem.values())
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Unvalidated WebSocket response structure — malformed response raises TypeError instead of returning None.

sum(sum(code_passed) for code_passed in per_problem.values()) assumes every value in per_problem is an iterable of booleans. If the lcb-service returns a non-list for any question key (e.g. a scalar or dict), sum(code_passed) raises TypeError that propagates unhandled out of _score_lcb_via_container, causing score() to raise rather than fall back gracefully via None. Wrap this block in a try/except (TypeError, ValueError) or validate the schema of per_problem before iterating.

raise FileNotFoundError(
f"eval_accuracy.py not found at {eval_dir}. Run setup_eval.sh first."
)
sys.path.insert(0, str(eval_dir))
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] sys.path.insert(0, ...) with a user-supplied --eval-dir allows arbitrary code execution.

sys.path.insert(0, str(eval_dir)) followed by import eval_accuracy executes whichever eval_accuracy.py is found first on sys.path. Because --eval-dir is an exposed CLI argument, pointing it at a directory containing a malicious eval_accuracy.py leads to arbitrary code execution in the runner subprocess. Inserting at index 0 also permanently pollutes sys.path for the process lifetime.

Replace with importlib.util.spec_from_file_location loading from the explicit absolute path eval_dir / 'eval_accuracy.py', which avoids both the path-injection risk and the sys.path side-effect.

(default ``question``); both are passed through ``accuracy_config.extras``.
"""

REQUIRES_EXTRACTOR: ClassVar[bool] = False
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] REQUIRES_EXTRACTOR = False but LCB extraction is hardcoded inside _score_lcb_via_container.

Setting REQUIRES_EXTRACTOR = False prevents the scorer framework from injecting an extractor, but _score_lcb_via_container hardcodes PythonCodeExtractor.extract directly. This means accuracy_config.extractor cannot override the LCB code-extraction strategy, inconsistent with how LiveCodeBenchScorer handles the same concern via its injected extractor. At minimum, add a comment explaining why the extractor is bypassed here; ideally respect self.extractor when set.

text_correct = 0
text_n = 0
text_complete = True
for sub, d in per_dataset.items():
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] Customised lcb_subset key may not match the runner's emitted key, causing text-subset mis-classification.

if sub == self.lcb_subset: continue relies on self.lcb_subset matching the key the runner writes to per_dataset. If lcb_subset is set to a non-default value (e.g. "livecodebench_v3") while the runner still emits "livecodebench", the runner's entry passes the guard and is treated as a text subset. Its exact_match=None sets text_complete = False, marking the combined score incomplete even when all text subsets graded successfully. Consider making the skip-condition match on the actual runner entry key, or validate at construction that lcb_subset matches the runner's expected output.

@arekay-nv
Copy link
Copy Markdown
Collaborator Author

Review Council — Multi-AI Code Review

Reviewed by: Claude | Depth: thorough | Codex: unavailable (workspace auth error)

Found 6 issues across 2 files: 1 high · 3 medium · 2 low

🔴 Must Fix (high)

# File Line Category Summary
1 src/inference_endpoint/evaluation/scoring.py 1588 data-integrity Partial text-only score returned as float when LCB container unreachable — callers cannot distinguish it from a full 5-subset score

🟡 Should Fix (medium)

# File Line Category Summary
2 src/inference_endpoint/evaluation/scoring.py 1519 bug Float round-trip round(em / 100.0 * n) may reconstruct k±1 correct count after JSON serialization
3 src/inference_endpoint/evaluation/scoring.py 1433 bug sum(sum(code_passed)) raises TypeError on malformed WebSocket response instead of returning None
4 examples/10_DeepSeekR1_Example/accuracy/deepseek_eval_runner.py 61 security sys.path.insert(0, eval_dir) with user-controlled --eval-dir enables arbitrary code execution

🔵 Consider (low)

# File Line Category Summary
5 src/inference_endpoint/evaluation/scoring.py 1265 design REQUIRES_EXTRACTOR=False but LCB extraction hardcodes PythonCodeExtractoraccuracy_config.extractor cannot override it
6 src/inference_endpoint/evaluation/scoring.py 1511 bug Customised lcb_subset key mismatch with runner output marks combined score incomplete when all text subsets succeed

Copy link
Copy Markdown
Collaborator

@nv-alicheng nv-alicheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review (thorough)

Reviewed by Codex + Claude. Posting 6 inline findings that are not already covered by the existing automated review on this PR (the headline partial-score issue at scoring.py:1588, the rounding round-trip at 1519, the sys.path RCE at deepseek_eval_runner.py:61, and the temp-file leak at run_client.sh:43 are already flagged — see the synthesis comment for how they fit together).


model_params:
# Served model id = the checkpoint dir basename trtllm-serve registers.
name: "${SERVED_MODEL_NAME:-deepseek_r1-torch-fp4-v2}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Codex] medium · bug: The smoke config defaults model_params.name to deepseek_r1-torch-fp4-v2, but this PR's own docs state that the -v2 checkpoint does not load with the stock TRT-LLM builds and that deepseek_r1-torch-fp4 is the working default. A user who runs the smoke path without exporting SERVED_MODEL_NAME will send requests for a model id the server can't serve, and the run fails immediately — i.e. the advertised quick-validation path is broken out of the box. Default to deepseek_r1-torch-fp4.

`extras` are forwarded to the scorer. **`lcb_websocket_port` defaults to 13835**,
so if the `lcb_serve` container is reachable on the client node the livecodebench
subset is scored **in-run** (one number, no follow-up job); if it's unreachable
the scorer logs and falls back to grading LCB in-process. The LCB keys are
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium · documentation: This sentence says that when the lcb_serve container is unreachable "the scorer logs and falls back to grading LCB in-process," but that is not what the code does for the common case (port configured, container down): _score_lcb_via_container returns None and the scorer leaves livecodebench unscored — see scoring.py:1527-1543 ("Do NOT re-run the in-process executor"). In-process grading happens only when lcb_websocket_port is null (the use_container == False path). README lines 201-202 below correctly describe the unscored/complete: false behavior, so the doc contradicts both itself and the implementation. A reader following this line will expect an LCB score that never appears.

monkeypatch.setattr(scoring_mod.subprocess, "run", fake_run)
return calls

def _scorer(self, dataset, staged, project):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium · testing: The container tests assert complete in the sidecar *_results.json (test_container_unreachable_leaves_lcb_unscored, test_failed_text_subset_marks_incomplete) but never assert anything about the returned headline scorescore, n_repeats = ...score() is captured and ignored. The actual data-integrity gap (already flagged at scoring.py:1588: a partial combined is returned and stored in results.json with no completeness marker) is therefore not guarded by any test. Two further branches are untested: the no-container path with LCB rows present (scoring.py:1483 reached with n_lcb>0 but lcb_websocket_port=None), and the LCB count-divergence warning (scoring.py:1562). _scorer always passes lcb_websocket_port=13835.

ws = websocket.create_connection(
url, timeout=7200, ping_interval=30, ping_timeout=10
)
except (ConnectionRefusedError, OSError, Exception) as e: # noqa: BLE001
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low · error-handling: except (ConnectionRefusedError, OSError, Exception) as e:Exception already subsumes ConnectionRefusedError and OSError, so the tuple is misleading and, more importantly, this handler now treats any exception (including programming errors like AttributeError/TypeError raised inside create_connection usage) as a benign "connect failed" and returns None. Catch the specific network/protocol exceptions, or use a bare except Exception with the explanatory comment and let it be honest about what it swallows.

)
args = ap.parse_args()

df = pd.read_pickle(args.source)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low · security: df = pd.read_pickle(args.source) deserializes an arbitrary path supplied via --source / $DEEPSEEK_R1_DATASET_PKL. pickle load executes arbitrary code, so pointing this at an untrusted .pkl is RCE. The official MLCommons dataset is the expected input and this is an operator-run script, so severity is low — but a one-line caveat in the script/docstring ("only load the official MLCommons .pkl; pickle deserialization executes arbitrary code") would be prudent since there is currently no such warning.


@property
def harmonizer(self) -> Harmonizer:
if self._harmonizer is None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low · design: The harmonizer lazy property is not thread-safe: if self._harmonizer is None: self._harmonizer = Harmonizer(...) lets two threads both observe None and construct two Harmonizer instances. This is benign today because dataset transforms run single-threaded, but the lazy-init pattern silently bakes in that assumption (the previous eager construction made the attribute set-once). If a transform is ever shared across a thread pool (e.g. df.apply parallelism), this becomes a real double-construction race. Flagging so the single-threaded assumption is intentional, not accidental.

@nv-alicheng
Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.4) + Claude | Depth: thorough

Posted 6 inline findings. This PR already carries an automated review with several solid catches, so the council deliberately did not re-post issues that bot already covers — instead the synthesis below shows how everything fits together. 1 additional latent issue is out of diff scope and noted at the bottom.

🔴 Must Fix (critical/high)

The single most important issue is already flagged by the existing review and is not re-posted:

File Line Category Reviewer(s) Summary
evaluation/scoring.py 1588 data-integrity Both (already posted) score() returns the partial combined even when complete=False; finalize_benchmark (execute.py:976-985) stores only the score float in results.json with no completeness marker, so a run that left LiveCodeBench unscored (container down) or had a failed text subset publishes a partial number as the headline 5-subset exact_match. The Scorer.score() contract has no channel for complete. Fix: return None when incomplete, or propagate complete into accuracy_scores.

🟡 Should Fix (medium)

# File Line Category Reviewer(s) Summary
1 examples/.../offline_deepseek_r1_accuracy_smoke.yaml 12 bug Codex Default model deepseek_r1-torch-fp4-v2 is documented in this same PR as not loadable with stock TRT-LLM; smoke path fails out of the box without SERVED_MODEL_NAME. Default to deepseek_r1-torch-fp4.
2 examples/10_DeepSeekR1_Example/README.md 186 documentation Claude Self-contradiction: claims the scorer "falls back to grading LCB in-process" when the container is unreachable, but for the configured-port-but-down case the code leaves LCB unscored (scoring.py:1527-1543); README:201-202 itself says complete: false.
3 tests/unit/evaluation/test_deepseek_r1_scorer.py 307 testing Claude Container tests assert sidecar complete but never assert the returned headline score is gated when incomplete (the scoring.py:1588 gap is untested); the no-container+LCB-rows branch (scoring.py:1483, port None) and the count-divergence warning (scoring.py:1562) are also untested.

🔵 Consider (low)

# File Line Category Reviewer(s) Summary
4 evaluation/scoring.py 338 error-handling Claude except (ConnectionRefusedError, OSError, Exception)Exception subsumes the others and swallows programming errors as "connect failed".
5 examples/.../prepare_dataset.py 75 security Claude pd.read_pickle(args.source) on an env/CLI-supplied path is arbitrary-code-execution on load; add a "official MLCommons pkl only" caveat.
6 dataset_manager/transforms.py 192 design Claude Lazy harmonizer property has a double-construct race if ever shared across threads; benign today (single-threaded transforms) but the assumption is now implicit.

Already flagged by the existing bot review (not re-posted)

scoring.py:1588 (partial score, above) · scoring.py:1519 (float round-trip → wrong count) · scoring.py:1518 (NaN em crashes round()) · scoring.py:1433 (unvalidated WS response) · deepseek_eval_runner.py:61 (sys.path.insert RCE) · deepseek_eval_runner.py:109 (trust_remote_code) · run_client.sh:43 (temp-file leak) · scoring.py:1265/1511 (extractor / lcb_subset key).

⚠️ Out-of-diff advisory (not posted inline — unchanged line)

evaluation/scoring.py:583 references module inference_endpoint.dataset_manager.predefined.livecodebench.lcb_serve, but lcb_serve.py actually lives at inference_endpoint.evaluation.livecodebench.lcb_serve (the predefined/livecodebench/ package contains only presets.py/__init__.py). The subprocess fallback would fail with ModuleNotFoundError. Pre-existing, so out of this PR's scope, but worth a follow-up since the new LCB work documents a subprocess fallback story.


Commit hygiene: 6 commits, 0 fixups — fine. Build check skipped: only the example subproject pyproject.toml changed (isolated uv project), root package build is unaffected.

Copy link
Copy Markdown
Collaborator

@nvzhihanj nvzhihanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review (second pass)

Reviewed by: Codex (--yolo) + Claude · Depth: thorough

This PR already has a prior council pass + author self-review covering the high/medium issues. This pass posts only net-new findings not already commented; see the summary for the confirmed/deduped set.

tail -n 20 "${RESULTS}.raw" >&2
exit 1
fi
echo "Fold into the aggregate: full_exact_match = (text_subset_correct + LCB passed_samples) / 4388."
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium · bug — The fold-in hint hardcodes the full-set denominator:

full_exact_match = (text_subset_correct + LCB passed_samples) / 4388.

But OUTPUTS_PARQUET is user-overridable and the README documents ~385-sample subset runs via offline_deepseek_r1_accuracy_subset.yaml. A user who scores LCB for a subset run and follows this formula divides by 4388 instead of their actual sample count, silently understating accuracy by ~10×. Derive the denominator from the run's actual total sample count (e.g. len(df) of the outputs parquet) rather than hardcoding 4388.

# trtllm-serve registers the model under the checkpoint dir's basename; the
# /v1/completions `model` field (model_params.name in the YAML) must match it.
SERVED_MODEL_NAME="${SERVED_MODEL_NAME:-$(basename "${MODEL_DIR}")}"
SOURCE_PKL="${SOURCE_PKL:?Set SOURCE_PKL to the MLPerf DeepSeek-R1 source dataset .pkl}"
Copy link
Copy Markdown
Collaborator

@nvzhihanj nvzhihanj Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Codex] medium · bug — SOURCE_PKL="${SOURCE_PKL:?…}" aborts the launch unconditionally when SOURCE_PKL is unset. But in the documented pre-prepared-parquet / SERVER_ONLY=1 workflow the source pickle isn't needed — the later if [ ! -f $PARQUET ] check is meant to skip conversion. This :? expansion fires first, so a run with data/deepseek_r1_eval.parquet already present fails even though no conversion is required. Only require (and mount) SOURCE_PKL when the parquet is missing.

`<report_dir>/deepseek_eval/<name>_outputs.parquet`, then runs:

```bash
uv run --project <this dir> python deepseek_eval_runner.py \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low · documentation — This documented invocation omits --external-subsets:

uv run --project <this dir> python deepseek_eval_runner.py --input … --output … --tokenizer …

In the default (container-reachable) path DeepSeekR1Scorer._run_eval_subprocess passes --external-subsets livecodebench (scoring.py:1500), which changes the runner's behavior (LCB tokenized but not graded, status=external). The documented command matches only the no-container fallback — note both invocation modes so readers understand when --external-subsets applies.

@nvzhihanj
Copy link
Copy Markdown
Collaborator

Review Council — Summary (second pass)

Reviewed by: Codex (--yolo) + Claude · Depth: thorough

This PR already has a prior review-council pass + the author's self-review + gemini/github-code-quality bot comments. To avoid duplicate noise, this pass posts only net-new findings and records the dedup explicitly.

🟡 Net-new — posted inline

# File Line Severity Category Reviewer Summary
A examples/10_DeepSeekR1_Example/score_livecodebench.sh 77 medium bug Claude Fold-in hint hardcodes / 4388; README supports ~385-sample subset runs → wrong denominator, understates accuracy ~10×.
B examples/10_DeepSeekR1_Example/launch_and_run.sh 67 medium bug Codex SOURCE_PKL:? aborts unconditionally, so SERVER_ONLY=1 / pre-prepared-parquet runs fail before the [ ! -f $PARQUET ] skip.
E examples/10_DeepSeekR1_Example/accuracy/RUNBOOK.md 90 low documentation Claude Documented deepseek_eval_runner.py invocation omits --external-subsets livecodebench, which the container path actually passes.

🔵 Net-new — out-of-diff (can't anchor inline; noted here)

# File:Line Severity Category Reviewer Summary
C src/inference_endpoint/evaluation/scoring.py:206 low bug Claude Base Scorer.score() does len(scores) // self.dataset.num_samples() with no div-by-zero guard, while the PR's new scorers (DeepSeekR1Scorer, VBenchScorer) all guard with if num_samples else 0. Empty dataset → ZeroDivisionError. Pre-existing; the inconsistency is newly visible.
D src/inference_endpoint/config/schema.py:353 low api-contract Claude AccuracyConfig.extras is forwarded verbatim as **eval_cfg.extras into the scorer ctor (execute.py:981). A typo'd key (e.g. lcb_websocket_prt) surfaces as TypeError: unexpected keyword argument only after a multi-hour run, not at config-load time. (Field is pre-existing infra; this PR is its first heavy user — subset_column/question_column/lcb_websocket_port.) Consider an unknown-key check.

✅ Confirmed but already commented (deduped — not re-posted)

The council independently reproduced these already-raised findings (cross-validation, not new):

  • Partial DeepSeek score reported as complete — author [HIGH] scoring.py:1588 (also names the execute.py complete-field fix); gemini [high] :1518 (the round(NaN) crash variant). Flagged by both Codex (P1) and Claude here.
  • Round-trip count fragility scoring.py:1519 — author [MEDIUM].
  • lcb_subset key mismatch / subset mis-classification scoring.py:1511 — author [LOW].
  • Test gaps (count-divergence + no-container-with-LCB branches) — prior council [Claude] test_deepseek_r1_scorer.py:307.
  • Smoke config -v2 model default offline_deepseek_r1_accuracy_smoke.yaml:12 — prior council [Codex].
  • Redundant except (…, Exception) scoring.py:338 — prior council [Claude].

Codex review ran via codex --yolo (now working in this environment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants