feat : deepseek r1 accuracy support by arekay-nv · Pull Request #340 · mlcommons/endpoints

arekay-nv · 2026-06-05T21:21:26Z

What does this PR do?

Adds accuracy support for Deepseek-r1 using an ensemble evaluator that will produce a combined score.
Needs #335 to push the container image.

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-actions · 2026-06-05T21:21:36Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

+        monkeypatch.setattr(
+            scoring_mod, "_lcb_ws_evaluate", lambda url, codes, timeout: None
+        )
+        score, n_repeats = self._scorer(dataset, staged, project).score()


+        monkeypatch.setattr(
+            scoring_mod, "_lcb_ws_evaluate", lambda url, codes, timeout: None
+        )
+        score, n_repeats = self._scorer(dataset, staged, project).score()


gemini-code-assist

Code Review

This pull request introduces an end-to-end example and evaluation pipeline for DeepSeek-R1 (FP4, ModelOpt) on TensorRT-LLM, including an isolated subproject for the MLPerf accuracy evaluator and a new DeepSeekR1Scorer that integrates with a WebSocket-based sandbox for LiveCodeBench. It also optimizes dataset transforms by lazily initializing the Harmonizer to prevent offline loading failures. The review feedback provides critical improvements, including guarding against a potential NaN float conversion crash in the scorer, addressing a security risk by disabling trust_remote_code during tokenizer loading, and ensuring temporary files are cleaned up in the client execution script.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-05T21:23:16Z

+            if em is None:  # subset failed to grade (status != "ok")
+                text_complete = False
+                continue


If a text subset fails to grade or has no valid samples, its exact_match (em) value in the results can be NaN (represented as nan in Python). Checking only em is None is insufficient, as round(em / 100.0 * n) will raise a ValueError: cannot convert float NaN to integer when em is nan, crashing the finalization phase of the benchmark. Guarding against NaN using np.isnan(em) prevents this crash.

Suggested change

if em is None: # subset failed to grade (status != "ok")

text_complete = False

continue

if em is None or np.isnan(em): # subset failed to grade (status != "ok")

text_complete = False

continue

gemini-code-assist · 2026-06-05T21:23:16Z

+    # basis for tokens_per_sample. Count with the DeepSeek tokenizer so the
+    # number matches MLPerf token accounting.
+    logger.info("Loading tokenizer: %s", args.tokenizer)
+    tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True)


Using trust_remote_code=True when loading a tokenizer from an arbitrary or user-specified path poses a security risk, as it allows the execution of arbitrary code present in the model repository. Since DeepSeek-R1 uses a standard tokenizer architecture that does not require custom code execution, it is safer to set trust_remote_code=False.

Suggested change

tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True)

tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=False)

gemini-code-assist · 2026-06-05T21:23:16Z

+
+# from-config takes the endpoint from the YAML, so render a run copy with this
+# server's endpoint substituted in.
+RUN_YAML="$(mktemp /tmp/dsr1_run_XXXX.yaml)"


The temporary YAML file RUN_YAML created in /tmp/ is never cleaned up, which can clutter the system's temporary directory over multiple runs. Adding an EXIT trap ensures that the temporary file is automatically deleted when the script terminates.

Suggested change

RUN_YAML="$(mktemp /tmp/dsr1_run_XXXX.yaml)"

RUN_YAML="$(mktemp /tmp/dsr1_run_XXXX.yaml)"

trap 'rm -f "${RUN_YAML}"' EXIT

arekay-nv

Review Council — Claude Code Review

Found 6 issues (Codex review unavailable — workspace auth error; Claude-only).

arekay-nv · 2026-06-05T21:34:24Z

+
+        if combined is None:
+            return None, n_repeats
+        return float(combined), n_repeats


[HIGH] Partial score returned without caller-visible flag when LCB is unreachable.

score() returns float(combined) here even when the LCB container was unreachable and only text subsets were graded. The caller in execute.py:913 stores this as "score": score in results.json with no indication of partial coverage. The complete=False flag lives only in the separate deepseek_eval/*_results.json artifact — consumers of results.json cannot distinguish a full 5-subset score from a text-only 4-subset score.

Consider returning None when not lcb_ok (forcing callers to treat it as unscored), or adding a complete field to the accuracy_scores dict in execute.py so the incompleteness is surfaced in the top-level report.

arekay-nv · 2026-06-05T21:34:24Z

+            if em is None:  # subset failed to grade (status != "ok")
+                text_complete = False
+                continue
+            text_correct += round(em / 100.0 * n)


[MEDIUM] Floating-point round-trip may reconstruct incorrect sample count.

text_correct += round(em / 100.0 * n) recovers a per-subset correct count by inverting the runner's em = 100.0 * k / n formula. This round-trips through JSON serialization in the subprocess, where float precision may cause round() to yield k±1 for certain (k, n) pairs. A more reliable fix is to have the runner also emit "correct_samples": k in each per_dataset entry so the aggregation is exact integer arithmetic.

arekay-nv · 2026-06-05T21:34:24Z

+                total_samples,
+            )
+            return None
+        passed = sum(sum(code_passed) for code_passed in per_problem.values())


[MEDIUM] Unvalidated WebSocket response structure — malformed response raises TypeError instead of returning None.

sum(sum(code_passed) for code_passed in per_problem.values()) assumes every value in per_problem is an iterable of booleans. If the lcb-service returns a non-list for any question key (e.g. a scalar or dict), sum(code_passed) raises TypeError that propagates unhandled out of _score_lcb_via_container, causing score() to raise rather than fall back gracefully via None. Wrap this block in a try/except (TypeError, ValueError) or validate the schema of per_problem before iterating.

arekay-nv · 2026-06-05T21:34:24Z

+        raise FileNotFoundError(
+            f"eval_accuracy.py not found at {eval_dir}. Run setup_eval.sh first."
+        )
+    sys.path.insert(0, str(eval_dir))


[MEDIUM] sys.path.insert(0, ...) with a user-supplied --eval-dir allows arbitrary code execution.

sys.path.insert(0, str(eval_dir)) followed by import eval_accuracy executes whichever eval_accuracy.py is found first on sys.path. Because --eval-dir is an exposed CLI argument, pointing it at a directory containing a malicious eval_accuracy.py leads to arbitrary code execution in the runner subprocess. Inserting at index 0 also permanently pollutes sys.path for the process lifetime.

Replace with importlib.util.spec_from_file_location loading from the explicit absolute path eval_dir / 'eval_accuracy.py', which avoids both the path-injection risk and the sys.path side-effect.

arekay-nv · 2026-06-05T21:34:24Z

+    (default ``question``); both are passed through ``accuracy_config.extras``.
+    """
+
+    REQUIRES_EXTRACTOR: ClassVar[bool] = False


[LOW] REQUIRES_EXTRACTOR = False but LCB extraction is hardcoded inside _score_lcb_via_container.

Setting REQUIRES_EXTRACTOR = False prevents the scorer framework from injecting an extractor, but _score_lcb_via_container hardcodes PythonCodeExtractor.extract directly. This means accuracy_config.extractor cannot override the LCB code-extraction strategy, inconsistent with how LiveCodeBenchScorer handles the same concern via its injected extractor. At minimum, add a comment explaining why the extractor is bypassed here; ideally respect self.extractor when set.

arekay-nv · 2026-06-05T21:34:24Z

+        text_correct = 0
+        text_n = 0
+        text_complete = True
+        for sub, d in per_dataset.items():


[LOW] Customised lcb_subset key may not match the runner's emitted key, causing text-subset mis-classification.

if sub == self.lcb_subset: continue relies on self.lcb_subset matching the key the runner writes to per_dataset. If lcb_subset is set to a non-default value (e.g. "livecodebench_v3") while the runner still emits "livecodebench", the runner's entry passes the guard and is treated as a text subset. Its exact_match=None sets text_complete = False, marking the combined score incomplete even when all text subsets graded successfully. Consider making the skip-condition match on the actual runner entry key, or validate at construction that lcb_subset matches the runner's expected output.

arekay-nv · 2026-06-05T21:35:11Z

Review Council — Multi-AI Code Review

Reviewed by: Claude | Depth: thorough | Codex: unavailable (workspace auth error)

Found 6 issues across 2 files: 1 high · 3 medium · 2 low

🔴 Must Fix (high)

#	File	Line	Category	Summary
1	`src/inference_endpoint/evaluation/scoring.py`	1588	data-integrity	Partial text-only score returned as `float` when LCB container unreachable — callers cannot distinguish it from a full 5-subset score

🟡 Should Fix (medium)

#	File	Line	Category	Summary
2	`src/inference_endpoint/evaluation/scoring.py`	1519	bug	Float round-trip `round(em / 100.0 * n)` may reconstruct k±1 correct count after JSON serialization
3	`src/inference_endpoint/evaluation/scoring.py`	1433	bug	`sum(sum(code_passed))` raises `TypeError` on malformed WebSocket response instead of returning `None`
4	`examples/10_DeepSeekR1_Example/accuracy/deepseek_eval_runner.py`	61	security	`sys.path.insert(0, eval_dir)` with user-controlled `--eval-dir` enables arbitrary code execution

🔵 Consider (low)

#	File	Line	Category	Summary
5	`src/inference_endpoint/evaluation/scoring.py`	1265	design	`REQUIRES_EXTRACTOR=False` but LCB extraction hardcodes `PythonCodeExtractor` — `accuracy_config.extractor` cannot override it
6	`src/inference_endpoint/evaluation/scoring.py`	1511	bug	Customised `lcb_subset` key mismatch with runner output marks combined score incomplete when all text subsets succeed

nv-alicheng

Review Council — Multi-AI Code Review (thorough)

Reviewed by Codex + Claude. Posting 6 inline findings that are not already covered by the existing automated review on this PR (the headline partial-score issue at scoring.py:1588, the rounding round-trip at 1519, the sys.path RCE at deepseek_eval_runner.py:61, and the temp-file leak at run_client.sh:43 are already flagged — see the synthesis comment for how they fit together).

nv-alicheng · 2026-06-05T23:02:13Z

+
+model_params:
+  # Served model id = the checkpoint dir basename trtllm-serve registers.
+  name: "${SERVED_MODEL_NAME:-deepseek_r1-torch-fp4-v2}"


[Codex] medium · bug: The smoke config defaults model_params.name to deepseek_r1-torch-fp4-v2, but this PR's own docs state that the -v2 checkpoint does not load with the stock TRT-LLM builds and that deepseek_r1-torch-fp4 is the working default. A user who runs the smoke path without exporting SERVED_MODEL_NAME will send requests for a model id the server can't serve, and the run fails immediately — i.e. the advertised quick-validation path is broken out of the box. Default to deepseek_r1-torch-fp4.

nv-alicheng · 2026-06-05T23:02:13Z

+`extras` are forwarded to the scorer. **`lcb_websocket_port` defaults to 13835**,
+so if the `lcb_serve` container is reachable on the client node the livecodebench
+subset is scored **in-run** (one number, no follow-up job); if it's unreachable
+the scorer logs and falls back to grading LCB in-process. The LCB keys are


[Claude] medium · documentation: This sentence says that when the lcb_serve container is unreachable "the scorer logs and falls back to grading LCB in-process," but that is not what the code does for the common case (port configured, container down): _score_lcb_via_container returns None and the scorer leaves livecodebench unscored — see scoring.py:1527-1543 ("Do NOT re-run the in-process executor"). In-process grading happens only when lcb_websocket_port is null (the use_container == False path). README lines 201-202 below correctly describe the unscored/complete: false behavior, so the doc contradicts both itself and the implementation. A reader following this line will expect an LCB score that never appears.

nv-alicheng · 2026-06-05T23:02:13Z

+        monkeypatch.setattr(scoring_mod.subprocess, "run", fake_run)
+        return calls
+
+    def _scorer(self, dataset, staged, project):


[Claude] medium · testing: The container tests assert complete in the sidecar *_results.json (test_container_unreachable_leaves_lcb_unscored, test_failed_text_subset_marks_incomplete) but never assert anything about the returned headline score — score, n_repeats = ...score() is captured and ignored. The actual data-integrity gap (already flagged at scoring.py:1588: a partial combined is returned and stored in results.json with no completeness marker) is therefore not guarded by any test. Two further branches are untested: the no-container path with LCB rows present (scoring.py:1483 reached with n_lcb>0 but lcb_websocket_port=None), and the LCB count-divergence warning (scoring.py:1562). _scorer always passes lcb_websocket_port=13835.

nv-alicheng · 2026-06-05T23:02:13Z

+        ws = websocket.create_connection(
+            url, timeout=7200, ping_interval=30, ping_timeout=10
+        )
+    except (ConnectionRefusedError, OSError, Exception) as e:  # noqa: BLE001


[Claude] low · error-handling: except (ConnectionRefusedError, OSError, Exception) as e: — Exception already subsumes ConnectionRefusedError and OSError, so the tuple is misleading and, more importantly, this handler now treats any exception (including programming errors like AttributeError/TypeError raised inside create_connection usage) as a benign "connect failed" and returns None. Catch the specific network/protocol exceptions, or use a bare except Exception with the explanatory comment and let it be honest about what it swallows.

nv-alicheng · 2026-06-05T23:02:13Z

+    )
+    args = ap.parse_args()
+
+    df = pd.read_pickle(args.source)


[Claude] low · security: df = pd.read_pickle(args.source) deserializes an arbitrary path supplied via --source / $DEEPSEEK_R1_DATASET_PKL. pickle load executes arbitrary code, so pointing this at an untrusted .pkl is RCE. The official MLCommons dataset is the expected input and this is an operator-run script, so severity is low — but a one-line caveat in the script/docstring ("only load the official MLCommons .pkl; pickle deserialization executes arbitrary code") would be prudent since there is currently no such warning.

nv-alicheng · 2026-06-05T23:02:13Z

+
+    @property
+    def harmonizer(self) -> Harmonizer:
+        if self._harmonizer is None:


[Claude] low · design: The harmonizer lazy property is not thread-safe: if self._harmonizer is None: self._harmonizer = Harmonizer(...) lets two threads both observe None and construct two Harmonizer instances. This is benign today because dataset transforms run single-threaded, but the lazy-init pattern silently bakes in that assumption (the previous eager construction made the attribute set-once). If a transform is ever shared across a thread pool (e.g. df.apply parallelism), this becomes a real double-construction race. Flagging so the single-threaded assumption is intentional, not accidental.

nv-alicheng · 2026-06-05T23:07:03Z

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.4) + Claude | Depth: thorough

Posted 6 inline findings. This PR already carries an automated review with several solid catches, so the council deliberately did not re-post issues that bot already covers — instead the synthesis below shows how everything fits together. 1 additional latent issue is out of diff scope and noted at the bottom.

🔴 Must Fix (critical/high)

The single most important issue is already flagged by the existing review and is not re-posted:

File	Line	Category	Reviewer(s)	Summary
`evaluation/scoring.py`	1588	data-integrity	Both (already posted)	`score()` returns the partial `combined` even when `complete=False`; `finalize_benchmark` (`execute.py:976-985`) stores only the `score` float in `results.json` with no completeness marker, so a run that left LiveCodeBench unscored (container down) or had a failed text subset publishes a partial number as the headline 5-subset `exact_match`. The `Scorer.score()` contract has no channel for `complete`. Fix: return `None` when incomplete, or propagate `complete` into `accuracy_scores`.

🟡 Should Fix (medium)

#	File	Line	Category	Reviewer(s)	Summary
1	`examples/.../offline_deepseek_r1_accuracy_smoke.yaml`	12	bug	Codex	Default model `deepseek_r1-torch-fp4-v2` is documented in this same PR as not loadable with stock TRT-LLM; smoke path fails out of the box without `SERVED_MODEL_NAME`. Default to `deepseek_r1-torch-fp4`.
2	`examples/10_DeepSeekR1_Example/README.md`	186	documentation	Claude	Self-contradiction: claims the scorer "falls back to grading LCB in-process" when the container is unreachable, but for the configured-port-but-down case the code leaves LCB unscored (`scoring.py:1527-1543`); README:201-202 itself says `complete: false`.
3	`tests/unit/evaluation/test_deepseek_r1_scorer.py`	307	testing	Claude	Container tests assert sidecar `complete` but never assert the returned headline `score` is gated when incomplete (the `scoring.py:1588` gap is untested); the no-container+LCB-rows branch (`scoring.py:1483`, port `None`) and the count-divergence warning (`scoring.py:1562`) are also untested.

🔵 Consider (low)

#	File	Line	Category	Reviewer(s)	Summary
4	`evaluation/scoring.py`	338	error-handling	Claude	`except (ConnectionRefusedError, OSError, Exception)` — `Exception` subsumes the others and swallows programming errors as "connect failed".
5	`examples/.../prepare_dataset.py`	75	security	Claude	`pd.read_pickle(args.source)` on an env/CLI-supplied path is arbitrary-code-execution on load; add a "official MLCommons pkl only" caveat.
6	`dataset_manager/transforms.py`	192	design	Claude	Lazy `harmonizer` property has a double-construct race if ever shared across threads; benign today (single-threaded transforms) but the assumption is now implicit.

Already flagged by the existing bot review (not re-posted)

scoring.py:1588 (partial score, above) · scoring.py:1519 (float round-trip → wrong count) · scoring.py:1518 (NaN em crashes round()) · scoring.py:1433 (unvalidated WS response) · deepseek_eval_runner.py:61 (sys.path.insert RCE) · deepseek_eval_runner.py:109 (trust_remote_code) · run_client.sh:43 (temp-file leak) · scoring.py:1265/1511 (extractor / lcb_subset key).

⚠️ Out-of-diff advisory (not posted inline — unchanged line)

evaluation/scoring.py:583 references module inference_endpoint.dataset_manager.predefined.livecodebench.lcb_serve, but lcb_serve.py actually lives at inference_endpoint.evaluation.livecodebench.lcb_serve (the predefined/livecodebench/ package contains only presets.py/__init__.py). The subprocess fallback would fail with ModuleNotFoundError. Pre-existing, so out of this PR's scope, but worth a follow-up since the new LCB work documents a subprocess fallback story.

Commit hygiene: 6 commits, 0 fixups — fine. Build check skipped: only the example subproject pyproject.toml changed (isolated uv project), root package build is unaffected.

nvzhihanj

Review Council — Multi-AI Code Review (second pass)

Reviewed by: Codex (--yolo) + Claude · Depth: thorough

This PR already has a prior council pass + author self-review covering the high/medium issues. This pass posts only net-new findings not already commented; see the summary for the confirmed/deduped set.

nvzhihanj · 2026-06-05T23:49:37Z

+  tail -n 20 "${RESULTS}.raw" >&2
+  exit 1
+fi
+echo "Fold into the aggregate: full_exact_match = (text_subset_correct + LCB passed_samples) / 4388."


[Claude] medium · bug — The fold-in hint hardcodes the full-set denominator:

full_exact_match = (text_subset_correct + LCB passed_samples) / 4388.

But OUTPUTS_PARQUET is user-overridable and the README documents ~385-sample subset runs via offline_deepseek_r1_accuracy_subset.yaml. A user who scores LCB for a subset run and follows this formula divides by 4388 instead of their actual sample count, silently understating accuracy by ~10×. Derive the denominator from the run's actual total sample count (e.g. len(df) of the outputs parquet) rather than hardcoding 4388.

nvzhihanj · 2026-06-05T23:49:37Z

+# trtllm-serve registers the model under the checkpoint dir's basename; the
+# /v1/completions `model` field (model_params.name in the YAML) must match it.
+SERVED_MODEL_NAME="${SERVED_MODEL_NAME:-$(basename "${MODEL_DIR}")}"
+SOURCE_PKL="${SOURCE_PKL:?Set SOURCE_PKL to the MLPerf DeepSeek-R1 source dataset .pkl}"


[Codex] medium · bug — SOURCE_PKL="${SOURCE_PKL:?…}" aborts the launch unconditionally when SOURCE_PKL is unset. But in the documented pre-prepared-parquet / SERVER_ONLY=1 workflow the source pickle isn't needed — the later if [ ! -f $PARQUET ] check is meant to skip conversion. This :? expansion fires first, so a run with data/deepseek_r1_eval.parquet already present fails even though no conversion is required. Only require (and mount) SOURCE_PKL when the parquet is missing.

nvzhihanj · 2026-06-05T23:49:37Z

+`<report_dir>/deepseek_eval/<name>_outputs.parquet`, then runs:
+
+```bash
+uv run --project <this dir> python deepseek_eval_runner.py \


[Claude] low · documentation — This documented invocation omits --external-subsets:

uv run --project <this dir> python deepseek_eval_runner.py --input … --output … --tokenizer …

In the default (container-reachable) path DeepSeekR1Scorer._run_eval_subprocess passes --external-subsets livecodebench (scoring.py:1500), which changes the runner's behavior (LCB tokenized but not graded, status=external). The documented command matches only the no-container fallback — note both invocation modes so readers understand when --external-subsets applies.

nvzhihanj · 2026-06-05T23:51:12Z

Review Council — Summary (second pass)

Reviewed by: Codex (--yolo) + Claude · Depth: thorough

This PR already has a prior review-council pass + the author's self-review + gemini/github-code-quality bot comments. To avoid duplicate noise, this pass posts only net-new findings and records the dedup explicitly.

🟡 Net-new — posted inline

#	File	Line	Severity	Category	Reviewer	Summary
A	`examples/10_DeepSeekR1_Example/score_livecodebench.sh`	77	medium	bug	Claude	Fold-in hint hardcodes `/ 4388`; README supports ~385-sample subset runs → wrong denominator, understates accuracy ~10×.
B	`examples/10_DeepSeekR1_Example/launch_and_run.sh`	67	medium	bug	Codex	`SOURCE_PKL:?` aborts unconditionally, so `SERVER_ONLY=1` / pre-prepared-parquet runs fail before the `[ ! -f $PARQUET ]` skip.
E	`examples/10_DeepSeekR1_Example/accuracy/RUNBOOK.md`	90	low	documentation	Claude	Documented `deepseek_eval_runner.py` invocation omits `--external-subsets livecodebench`, which the container path actually passes.

🔵 Net-new — out-of-diff (can't anchor inline; noted here)

#	File:Line	Severity	Category	Reviewer	Summary
C	`src/inference_endpoint/evaluation/scoring.py:206`	low	bug	Claude	Base `Scorer.score()` does `len(scores) // self.dataset.num_samples()` with no div-by-zero guard, while the PR's new scorers (`DeepSeekR1Scorer`, `VBenchScorer`) all guard with `if num_samples else 0`. Empty dataset → `ZeroDivisionError`. Pre-existing; the inconsistency is newly visible.
D	`src/inference_endpoint/config/schema.py:353`	low	api-contract	Claude	`AccuracyConfig.extras` is forwarded verbatim as `*eval_cfg.extras` into the scorer ctor (`execute.py:981`). A typo'd key (e.g. `lcb_websocket_prt`) surfaces as `TypeError: unexpected keyword argument` only after* a multi-hour run, not at config-load time. (Field is pre-existing infra; this PR is its first heavy user — `subset_column`/`question_column`/`lcb_websocket_port`.) Consider an unknown-key check.

✅ Confirmed but already commented (deduped — not re-posted)

The council independently reproduced these already-raised findings (cross-validation, not new):

Partial DeepSeek score reported as complete — author [HIGH] scoring.py:1588 (also names the execute.py complete-field fix); gemini [high] :1518 (the round(NaN) crash variant). Flagged by both Codex (P1) and Claude here.
Round-trip count fragility scoring.py:1519 — author [MEDIUM].
lcb_subset key mismatch / subset mis-classification scoring.py:1511 — author [LOW].
Test gaps (count-divergence + no-container-with-LCB branches) — prior council [Claude] test_deepseek_r1_scorer.py:307.
Smoke config -v2 model default offline_deepseek_r1_accuracy_smoke.yaml:12 — prior council [Codex].
Redundant except (…, Exception) scoring.py:338 — prior council [Claude].

Codex review ran via codex --yolo (now working in this environment).

arekay-nv added 5 commits June 2, 2026 21:59

DSR1 accuracy support

f66f039

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Example files

cd8c210

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

updates.

a83a184

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Update instr.

c4e44f7

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Cleanup

7e3a083

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

arekay-nv requested a review from a team June 5, 2026 21:21

github-actions Bot requested a review from nvzhihanj June 5, 2026 21:21

github-code-quality Bot found potential problems Jun 5, 2026

View reviewed changes

gemini-code-assist Bot reviewed Jun 5, 2026

View reviewed changes

arekay-nv commented Jun 5, 2026

View reviewed changes

arekay-nv added 2 commits June 5, 2026 16:44

Merge branch 'main' into arekay/deepseek_r1_accuracy

374a972

Merge branch 'main' into arekay/deepseek_r1_accuracy

eace2b4

nv-alicheng reviewed Jun 5, 2026

View reviewed changes

nvzhihanj reviewed Jun 5, 2026

View reviewed changes

Merge branch 'main' into arekay/deepseek_r1_accuracy

619ecb1

	tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True)
	tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=False)

	RUN_YAML="$(mktemp /tmp/dsr1_run_XXXX.yaml)"
	RUN_YAML="$(mktemp /tmp/dsr1_run_XXXX.yaml)"
	trap 'rm -f "${RUN_YAML}"' EXIT

Conversation

arekay-nv commented Jun 5, 2026

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Review Council — Claude Code Review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arekay-nv commented Jun 5, 2026

Review Council — Multi-AI Code Review

🔴 Must Fix (high)

🟡 Should Fix (medium)

🔵 Consider (low)

Uh oh!

nv-alicheng left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review (thorough)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nv-alicheng commented Jun 5, 2026

Review Council — Multi-AI Code Review

🔴 Must Fix (critical/high)

🟡 Should Fix (medium)

🔵 Consider (low)

Already flagged by the existing bot review (not re-posted)

⚠️ Out-of-diff advisory (not posted inline — unchanged line)

Uh oh!

nvzhihanj left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review (second pass)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvzhihanj Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 5, 2026 •

edited

Loading

nvzhihanj Jun 5, 2026 •

edited

Loading