feat(multiturn): add kimi agentic benchmark controls#331
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces inline accuracy scoring for multi-turn performance replay logs, updates the multi-turn load strategy to support trajectory budgets, repeating conversations with unique logical IDs, cache salting, and performance tracking stops, and updates the associated configurations and documentation. The review feedback highlights two robustness issues in the new multi_turn_inline_accuracy.py file: potential KeyError or TypeError exceptions due to unsafe dictionary access on the turn field, and potential crashes when decoding malformed lines in the event log.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
arekay-nv
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review
Reviewed by: Codex (gpt-5.4) + Claude · Depth: thorough
Posting 10 inline findings. See the pinned summary comment for the tiered breakdown.
| (ds for ds in ctx.config.datasets if ds.type == DatasetType.PERFORMANCE), | ||
| None, | ||
| ) | ||
| if ( |
There was a problem hiding this comment.
[Codex + Claude] 🔴 critical · bug: _inline_score_multi_turn_if_enabled() is called unconditionally for every MultiTurnDataset run from finalize_benchmark (line 954), but this guard raises InputValidationError whenever inline accuracy is not enabled — and multi_turn.inline_accuracy defaults to False.
The function name (..._if_enabled) and its return None for non-multi-turn datasets show the intent is to no-op when disabled. As written, every existing multi-turn benchmark that doesn't opt into inline accuracy now crashes at finalization, after the run has already completed — discarding results.
if (
perf_cfg is None
or perf_cfg.multi_turn is None
or not perf_cfg.multi_turn.inline_accuracy
or perf_cfg.path is None
):
raise InputValidationError(...)Fix: return None (skip scoring) when inline accuracy is disabled / path missing. Only raise if inline accuracy was explicitly requested but a prerequisite is missing (e.g. inline_accuracy: true but no path).
| return row | ||
|
|
||
|
|
||
| def score_report( |
There was a problem hiding this comment.
[Claude] 🟡 high · testing: This entire new ~533-line scorer module has zero test coverage — grep across tests/ finds no reference to score_report, inline_accuracy, or multi_turn_inline, and there is no tests/unit/evaluation/test_multi_turn_inline_accuracy.py. AGENTS.md targets >90% coverage for new code, and you specifically asked to ensure test coverage.
The untested logic is exactly the kind that's error-prone: shell-command canonicalization, tool-call merging (streamed vs complete), multiset IoU, repeat-set validity, and the valid/invalid_reason computation. Notably, the existing CodeRabbit comments on this same file already flag two robustness crashes (row["turn"] KeyError; unguarded msgspec.DecodeError on partial events.jsonl) — both in code paths that tests would have caught.
Fix: add a tests/unit/evaluation/test_multi_turn_inline_accuracy.py covering exe extraction, intent extraction, repeat-id splitting, multiset scoring, and the validity/invalid_reason branches (with @pytest.mark.unit).
| if getattr(self._thread_local, "tokenizer", None) is None: | ||
| self._thread_local.tokenizer = AutoTokenizer.from_pretrained( | ||
| self._tokenizer_name | ||
| self._tokenizer_name, trust_remote_code=True |
There was a problem hiding this comment.
[Codex] 🟡 medium · security: This PR flips tokenizer loading to trust_remote_code=True by default at two sites — here (the hot-path tokenizer pool) and commands/benchmark/execute.py:329 (_precompute_isl_for_multi_turn):
self._thread_local.tokenizer = AutoTokenizer.from_pretrained(
self._tokenizer_name, trust_remote_code=True
)trust_remote_code=True executes arbitrary Python from the model repo on load. Previously these paths only loaded standard tokenizer assets, so this broadens the security posture for every benchmark run — a mistyped or untrusted HF repo id now runs remote code. It's intentional and even has a test asserting the kwarg, but it's enabled globally rather than opt-in and is undocumented.
Suggestion: gate remote-code execution behind an explicit config flag (default False), or at minimum document the new default and its implication in the README so operators pointing at untrusted repos are aware.
| turn_timeout_s: float = Field(default=300.0, gt=0) | ||
| use_dataset_history: bool = True | ||
| turn_timeout_s: float = Field( | ||
| default=86400.0, |
There was a problem hiding this comment.
[Claude] 🟡 medium · documentation: turn_timeout_s's default was changed from 300.0 to 86400.0 (24 hours) and the field was removed from both the example YAML and the README in this PR:
turn_timeout_s: float = Field(
default=86400.0,
gt=0,A 24-hour default effectively disables the per-turn safety deadline — if a server hangs mid-turn, the run now blocks for a day instead of 5 minutes. That's a meaningful behavioral change to hide from the example config. use_dataset_history (default True) was likewise dropped from the docs.
Fix: document the new turn_timeout_s default and its implication in examples/09_MultiTurn/README.md, and either surface use_dataset_history or note why it's intentionally omitted.
| "multi_turn.inline_accuracy: true and a dataset path" | ||
| ) | ||
|
|
||
| from inference_endpoint.evaluation.multi_turn_inline_accuracy import score_report |
There was a problem hiding this comment.
[Claude] 🟡 medium · design: Lazy import inside a function violates the project's "No lazy imports" rule (AGENTS.md):
from inference_endpoint.evaluation.multi_turn_inline_accuracy import score_reportmulti_turn_inline_accuracy.py only imports from core.record / core.types, so there is no circular-import justification (the only sanctioned exceptions are circular-import avoidance with a comment, optional-dependency try/except, and sandboxing).
Fix: move this import to the top of the file alongside the other inference_endpoint.evaluation imports.
| if self._store_in_history: | ||
| sys_content = self._dataset_metadata.system_prompts_by_conv.get(source_id) | ||
| if sys_content is not None and self._enable_salt: | ||
| if sys_content.startswith("[salt: "): |
There was a problem hiding this comment.
[Claude] 🔵 low · bug: The store-in-history salt strip here only handles the [salt: prefix:
if sys_content.startswith("[salt: "):
marker_end = sys_content.find("]\n\n")but the sibling _messages_with_logical_id_salt (line ~472) strips both [salt: and the legacy [cache_salt: prefixes. If a system_prompts_by_conv value carries a [cache_salt: ...] marker, this path won't strip it and will then prepend a fresh [salt: ...], producing doubled/garbled markers in the rendered system prompt.
Fix: make both strip routines handle the same set of prefixes (factor out a shared helper).
| "valid": valid, | ||
| } | ||
| if not valid: | ||
| score_entry["invalid_reason"] = score_result["invalid_reason"] |
There was a problem hiding this comment.
[Claude] 🔵 low · error-handling: score_entry["invalid_reason"] = score_result["invalid_reason"] (and the mirror at line ~1039) rely on an unguarded cross-module dict contract: that score_report always populates invalid_reason when valid is False. That invariant currently holds, so this isn't a live bug — but a future change to score_report that returns valid=False without a reason would raise KeyError at finalization and discard an otherwise-successful run.
Suggestion: use score_result.get("invalid_reason", "") to make the contract failure-tolerant at this boundary.
| ) | ||
| result["invalid_reason"] = "; ".join(reasons) | ||
|
|
||
| if out_path is not None: |
There was a problem hiding this comment.
[Claude] 🔵 low · data-integrity: scores.json serializes the full per_turn list (one row per expected-turn × repeat) via a single in-memory json.dumps(result, indent=2). For the documented target (990 trajectories × many turns) this artifact can grow very large. The standalone main() already strips per_turn before printing.
Suggestion: gate the verbose per_turn dump (summary-only by default, full dump behind a flag) to bound the artifact size.
| path: /path/to/agentic_dataset.jsonl | ||
| - name: agentic_combined | ||
| type: performance # do not change. | ||
| path: <path-to-agentic-combined-jsonl> |
There was a problem hiding this comment.
[Claude] 🔵 low · documentation: path: <path-to-agentic-combined-jsonl> is a literal placeholder. Combined with inline_accuracy: true # do not change and num_trajectories_to_issue: 990, a user who copies this file and only edits the endpoint will hit the inline-accuracy finalization path with an unreadable ground-truth JSONL and fail (after the run).
Suggestion: use a clearer placeholder convention (e.g. /path/to/agentic_dataset.jsonl) plus a one-line note that this must be a real local/mounted file because inline accuracy reads it at finalization.
|
|
||
| @pytest.mark.unit | ||
| def test_multi_turn_rejects_removed_stop_on_first_empty_slot_config(self): | ||
| with pytest.raises(ValueError, match="stop_on_first_empty_slot"): |
There was a problem hiding this comment.
[Claude] 🔵 low · testing: test_multi_turn_rejects_removed_stop_on_first_empty_slot_config only passes because MultiTurnConfig uses extra="forbid" — it would pass for any unknown key, so it doesn't actually guard the removal of this specific field. It gives false confidence.
with pytest.raises(ValueError, match="stop_on_first_empty_slot"):Suggestion: either drop it (generic extra="forbid" behavior is already covered elsewhere) or, if guarding a real migration, document that intent in the test.
Review Council — Multi-AI Code ReviewReviewed by: Codex (gpt-5.4) + Claude · Depth: thorough Found 10 issues across 7 files. 10 inline comments posted on the relevant lines. 🔴 Must Fix (critical/high)Will cause incorrect behavior or data loss in production.
🟡 Should Fix (medium)Real issues / behavioral & posture changes that should be addressed.
🔵 Consider (low)Valid improvements / follow-ups.
On the three things you asked us to weight:
The critical finalization crash (#1) is the blocker — it breaks existing multi-turn runs regardless of this PR's new features. 🤖 Generated with Claude Code · Review Council |
What does this PR do?
Type of change
Related issues
Testing
Checklist