feat(multiturn): add kimi agentic benchmark controls by hvagadia · Pull Request #331 · mlcommons/endpoints

hvagadia · 2026-06-03T18:31:13Z

What does this PR do?

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

github-actions · 2026-06-03T18:31:29Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces inline accuracy scoring for multi-turn performance replay logs, updates the multi-turn load strategy to support trajectory budgets, repeating conversations with unique logical IDs, cache salting, and performance tracking stops, and updates the associated configurations and documentation. The review feedback highlights two robustness issues in the new multi_turn_inline_accuracy.py file: potential KeyError or TypeError exceptions due to unsafe dictionary access on the turn field, and potential crashes when decoding malformed lines in the event log.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

arekay-nv

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.4) + Claude · Depth: thorough

Posting 10 inline findings. See the pinned summary comment for the tiered breakdown.

arekay-nv · 2026-06-03T21:45:05Z

+        (ds for ds in ctx.config.datasets if ds.type == DatasetType.PERFORMANCE),
+        None,
+    )
+    if (


[Codex + Claude] 🔴 critical · bug: _inline_score_multi_turn_if_enabled() is called unconditionally for every MultiTurnDataset run from finalize_benchmark (line 954), but this guard raises InputValidationError whenever inline accuracy is not enabled — and multi_turn.inline_accuracy defaults to False.

The function name (..._if_enabled) and its return None for non-multi-turn datasets show the intent is to no-op when disabled. As written, every existing multi-turn benchmark that doesn't opt into inline accuracy now crashes at finalization, after the run has already completed — discarding results.

if ( perf_cfg is None or perf_cfg.multi_turn is None or not perf_cfg.multi_turn.inline_accuracy or perf_cfg.path is None ): raise InputValidationError(...)

Fix: return None (skip scoring) when inline accuracy is disabled / path missing. Only raise if inline accuracy was explicitly requested but a prerequisite is missing (e.g. inline_accuracy: true but no path).

arekay-nv · 2026-06-03T21:45:05Z

+    return row
+
+
+def score_report(


[Claude] 🟡 high · testing: This entire new ~533-line scorer module has zero test coverage — grep across tests/ finds no reference to score_report, inline_accuracy, or multi_turn_inline, and there is no tests/unit/evaluation/test_multi_turn_inline_accuracy.py. AGENTS.md targets >90% coverage for new code, and you specifically asked to ensure test coverage.

The untested logic is exactly the kind that's error-prone: shell-command canonicalization, tool-call merging (streamed vs complete), multiset IoU, repeat-set validity, and the valid/invalid_reason computation. Notably, the existing CodeRabbit comments on this same file already flag two robustness crashes (row["turn"] KeyError; unguarded msgspec.DecodeError on partial events.jsonl) — both in code paths that tests would have caught.

Fix: add a tests/unit/evaluation/test_multi_turn_inline_accuracy.py covering exe extraction, intent extraction, repeat-id splitting, multiset scoring, and the validity/invalid_reason branches (with @pytest.mark.unit).

arekay-nv · 2026-06-03T21:45:05Z

        if getattr(self._thread_local, "tokenizer", None) is None:
            self._thread_local.tokenizer = AutoTokenizer.from_pretrained(
-                self._tokenizer_name
+                self._tokenizer_name, trust_remote_code=True


[Codex] 🟡 medium · security: This PR flips tokenizer loading to trust_remote_code=True by default at two sites — here (the hot-path tokenizer pool) and commands/benchmark/execute.py:329 (_precompute_isl_for_multi_turn):

self._thread_local.tokenizer = AutoTokenizer.from_pretrained( self._tokenizer_name, trust_remote_code=True )

trust_remote_code=True executes arbitrary Python from the model repo on load. Previously these paths only loaded standard tokenizer assets, so this broadens the security posture for every benchmark run — a mistyped or untrusted HF repo id now runs remote code. It's intentional and even has a test asserting the kwarg, but it's enabled globally rather than opt-in and is undocumented.

Suggestion: gate remote-code execution behind an explicit config flag (default False), or at minimum document the new default and its implication in the README so operators pointing at untrusted repos are aware.

arekay-nv · 2026-06-03T21:45:05Z

-    turn_timeout_s: float = Field(default=300.0, gt=0)
-    use_dataset_history: bool = True
+    turn_timeout_s: float = Field(
+        default=86400.0,


[Claude] 🟡 medium · documentation: turn_timeout_s's default was changed from 300.0 to 86400.0 (24 hours) and the field was removed from both the example YAML and the README in this PR:

turn_timeout_s: float = Field( default=86400.0, gt=0,

A 24-hour default effectively disables the per-turn safety deadline — if a server hangs mid-turn, the run now blocks for a day instead of 5 minutes. That's a meaningful behavioral change to hide from the example config. use_dataset_history (default True) was likewise dropped from the docs.

Fix: document the new turn_timeout_s default and its implication in examples/09_MultiTurn/README.md, and either surface use_dataset_history or note why it's intentionally omitted.

arekay-nv · 2026-06-03T21:45:05Z

+            "multi_turn.inline_accuracy: true and a dataset path"
+        )
+
+    from inference_endpoint.evaluation.multi_turn_inline_accuracy import score_report


[Claude] 🟡 medium · design: Lazy import inside a function violates the project's "No lazy imports" rule (AGENTS.md):

from inference_endpoint.evaluation.multi_turn_inline_accuracy import score_report

multi_turn_inline_accuracy.py only imports from core.record / core.types, so there is no circular-import justification (the only sanctioned exceptions are circular-import avoidance with a comment, optional-dependency try/except, and sandboxing).

Fix: move this import to the top of the file alongside the other inference_endpoint.evaluation imports.

arekay-nv · 2026-06-03T21:45:05Z

+        if self._store_in_history:
+            sys_content = self._dataset_metadata.system_prompts_by_conv.get(source_id)
+            if sys_content is not None and self._enable_salt:
+                if sys_content.startswith("[salt: "):


[Claude] 🔵 low · bug: The store-in-history salt strip here only handles the [salt: prefix:

if sys_content.startswith("[salt: "): marker_end = sys_content.find("]\n\n")

but the sibling _messages_with_logical_id_salt (line ~472) strips both [salt: and the legacy [cache_salt: prefixes. If a system_prompts_by_conv value carries a [cache_salt: ...] marker, this path won't strip it and will then prepend a fresh [salt: ...], producing doubled/garbled markers in the rendered system prompt.

Fix: make both strip routines handle the same set of prefixes (factor out a shared helper).

arekay-nv · 2026-06-03T21:45:05Z

+        "valid": valid,
+    }
+    if not valid:
+        score_entry["invalid_reason"] = score_result["invalid_reason"]


[Claude] 🔵 low · error-handling: score_entry["invalid_reason"] = score_result["invalid_reason"] (and the mirror at line ~1039) rely on an unguarded cross-module dict contract: that score_report always populates invalid_reason when valid is False. That invariant currently holds, so this isn't a live bug — but a future change to score_report that returns valid=False without a reason would raise KeyError at finalization and discard an otherwise-successful run.

Suggestion: use score_result.get("invalid_reason", "") to make the contract failure-tolerant at this boundary.

arekay-nv · 2026-06-03T21:45:05Z

+            )
+        result["invalid_reason"] = "; ".join(reasons)
+
+    if out_path is not None:


[Claude] 🔵 low · data-integrity: scores.json serializes the full per_turn list (one row per expected-turn × repeat) via a single in-memory json.dumps(result, indent=2). For the documented target (990 trajectories × many turns) this artifact can grow very large. The standalone main() already strips per_turn before printing.

Suggestion: gate the verbose per_turn dump (summary-only by default, full dump behind a flag) to bound the artifact size.

arekay-nv · 2026-06-03T21:45:05Z

-    path: /path/to/agentic_dataset.jsonl
+  - name: agentic_combined
+    type: performance # do not change.
+    path: <path-to-agentic-combined-jsonl>


[Claude] 🔵 low · documentation: path: <path-to-agentic-combined-jsonl> is a literal placeholder. Combined with inline_accuracy: true # do not change and num_trajectories_to_issue: 990, a user who copies this file and only edits the endpoint will hit the inline-accuracy finalization path with an unreadable ground-truth JSONL and fail (after the run).

Suggestion: use a clearer placeholder convention (e.g. /path/to/agentic_dataset.jsonl) plus a one-line note that this must be a real local/mounted file because inline accuracy reads it at finalization.

arekay-nv · 2026-06-03T21:45:05Z


+    @pytest.mark.unit
+    def test_multi_turn_rejects_removed_stop_on_first_empty_slot_config(self):
+        with pytest.raises(ValueError, match="stop_on_first_empty_slot"):


[Claude] 🔵 low · testing: test_multi_turn_rejects_removed_stop_on_first_empty_slot_config only passes because MultiTurnConfig uses extra="forbid" — it would pass for any unknown key, so it doesn't actually guard the removal of this specific field. It gives false confidence.

with pytest.raises(ValueError, match="stop_on_first_empty_slot"):

Suggestion: either drop it (generic extra="forbid" behavior is already covered elsewhere) or, if guarding a real migration, document that intent in the test.

arekay-nv · 2026-06-03T21:45:35Z

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.4) + Claude · Depth: thorough

Found 10 issues across 7 files. 10 inline comments posted on the relevant lines.

🔴 Must Fix (critical/high)

Will cause incorrect behavior or data loss in production.

#	File	Line	Category	Reviewer(s)	Summary
1	`commands/benchmark/execute.py`	905	bug	Both	`_inline_score_multi_turn_if_enabled()` is called unconditionally but raises when `inline_accuracy` is disabled (the default) → every multi-turn run that doesn't opt in crashes at finalization, discarding results. Should `return None` when disabled.
2	`evaluation/multi_turn_inline_accuracy.py`	387	testing	Claude	The new ~533-line scorer module has zero tests. AGENTS.md targets >90% on new code. The two existing CodeRabbit comments flag crashes (`row["turn"]` KeyError, `msgspec.DecodeError`) that tests would have caught.

🟡 Should Fix (medium)

Real issues / behavioral & posture changes that should be addressed.

#	File	Line	Category	Reviewer(s)	Summary
3	`async_utils/.../token_metrics.py`	124	security	Codex	`trust_remote_code=True` now the default at two sites (here + `execute.py:329`) — executes arbitrary repo code on every run. Should be opt-in and/or documented.
4	`config/schema.py`	259	documentation	Claude	`turn_timeout_s` default changed `300 → 86400` (24h, effectively disables the per-turn deadline) and was removed from the README + example YAML. `use_dataset_history` also undocumented.
5	`commands/benchmark/execute.py`	916	design	Claude	Lazy `import score_report` mid-function violates the project's "No lazy imports" rule; no circular-import justification. Move to top of file.

🔵 Consider (low)

Valid improvements / follow-ups.

#	File	Line	Category	Reviewer(s)	Summary
6	`load_generator/multi_turn_strategy.py`	295	bug	Claude	Salt strip handles only `[salt:` here, but the sibling routine strips both `[salt:` and `[cache_salt:` → potential doubled/garbled markers.
7	`commands/benchmark/execute.py`	930	error-handling	Claude	Unguarded `score_result["invalid_reason"]` (also ~1039) — safe today, but a fragile cross-module contract; use `.get(...)`.
8	`evaluation/multi_turn_inline_accuracy.py`	493	data-integrity	Claude	`scores.json` dumps the full `per_turn` list in one `json.dumps`; unbounded for 990× trajectories. Consider summary-only by default.
9	`examples/09_MultiTurn/kimi_agentic_benchmark.yaml`	17	documentation	Claude	`path: <path-to-agentic-combined-jsonl>` placeholder + `inline_accuracy: true` → copy-paste users fail at finalization. Clarify placeholder + add a note.
10	`tests/unit/config/test_schema.py`	497	testing	Claude	`test_..._rejects_removed_stop_on_first_empty_slot` passes only due to `extra="forbid"` — would pass for any unknown key; false confidence.

On the three things you asked us to weight:

Test coverage — the four touched modules with new tests are well covered (schema 47, dataset 49, strategy 30, token_metrics 12 tests), but the scorer module has none ([fix] Update review and branch name #2) and one schema test is hollow (Perf: system bottleneck tests #10).
Documentation — the README/YAML drifted from the schema: turn_timeout_s default change is hidden (Feat: Add accuracy evaluation for LLMs (GPQA, AIME, HLE etc.) #4), use_dataset_history is undocumented (Feat: Add accuracy evaluation for LLMs (GPQA, AIME, HLE etc.) #4), and the example path is a non-runnable placeholder (Perf: Analayze the roofline of the inference endpoints #9).
Conformance — one clear convention violation (lazy import Feat: define and implement the submission ruleset interaction with config #5); the rest matches existing patterns.

The critical finalization crash (#1) is the blocker — it breaks existing multi-turn runs regardless of this PR's new features.

_{🤖 Generated with Claude Code · Review Council}

feat(multiturn): add kimi agentic benchmark controls

36cd0b2

gemini-code-assist Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread src/inference_endpoint/evaluation/multi_turn_inline_accuracy.py Outdated

Comment thread src/inference_endpoint/evaluation/multi_turn_inline_accuracy.py Outdated

feat: add multi-turn benchmark controls

3425267

hvagadia changed the title ~~[Do Not Review] feat(multiturn): add kimi agentic benchmark controls~~ feat(multiturn): add kimi agentic benchmark controls Jun 3, 2026

fix: skip disabled multi-turn inline accuracy

0a13672

arekay-nv reviewed Jun 3, 2026

View reviewed changes

Conversation

hvagadia commented Jun 3, 2026

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arekay-nv commented Jun 3, 2026

Review Council — Multi-AI Code Review

🔴 Must Fix (critical/high)

🟡 Should Fix (medium)

🔵 Consider (low)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 3, 2026 •

edited

Loading