Skip to content

feat(multiturn): add kimi agentic benchmark controls#331

Draft
hvagadia wants to merge 3 commits into
mlcommons:mainfrom
hvagadia:hvagadia/kimi-agentic-benchmark-updates
Draft

feat(multiturn): add kimi agentic benchmark controls#331
hvagadia wants to merge 3 commits into
mlcommons:mainfrom
hvagadia:hvagadia/kimi-agentic-benchmark-updates

Conversation

@hvagadia
Copy link
Copy Markdown
Collaborator

@hvagadia hvagadia commented Jun 3, 2026

What does this PR do?

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces inline accuracy scoring for multi-turn performance replay logs, updates the multi-turn load strategy to support trajectory budgets, repeating conversations with unique logical IDs, cache salting, and performance tracking stops, and updates the associated configurations and documentation. The review feedback highlights two robustness issues in the new multi_turn_inline_accuracy.py file: potential KeyError or TypeError exceptions due to unsafe dictionary access on the turn field, and potential crashes when decoding malformed lines in the event log.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/inference_endpoint/evaluation/multi_turn_inline_accuracy.py Outdated
Comment thread src/inference_endpoint/evaluation/multi_turn_inline_accuracy.py Outdated
@hvagadia hvagadia changed the title [Do Not Review] feat(multiturn): add kimi agentic benchmark controls feat(multiturn): add kimi agentic benchmark controls Jun 3, 2026
Copy link
Copy Markdown
Collaborator

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.4) + Claude · Depth: thorough

Posting 10 inline findings. See the pinned summary comment for the tiered breakdown.

(ds for ds in ctx.config.datasets if ds.type == DatasetType.PERFORMANCE),
None,
)
if (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Codex + Claude] 🔴 critical · bug: _inline_score_multi_turn_if_enabled() is called unconditionally for every MultiTurnDataset run from finalize_benchmark (line 954), but this guard raises InputValidationError whenever inline accuracy is not enabled — and multi_turn.inline_accuracy defaults to False.

The function name (..._if_enabled) and its return None for non-multi-turn datasets show the intent is to no-op when disabled. As written, every existing multi-turn benchmark that doesn't opt into inline accuracy now crashes at finalization, after the run has already completed — discarding results.

    if (
        perf_cfg is None
        or perf_cfg.multi_turn is None
        or not perf_cfg.multi_turn.inline_accuracy
        or perf_cfg.path is None
    ):
        raise InputValidationError(...)

Fix: return None (skip scoring) when inline accuracy is disabled / path missing. Only raise if inline accuracy was explicitly requested but a prerequisite is missing (e.g. inline_accuracy: true but no path).

return row


def score_report(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] 🟡 high · testing: This entire new ~533-line scorer module has zero test coveragegrep across tests/ finds no reference to score_report, inline_accuracy, or multi_turn_inline, and there is no tests/unit/evaluation/test_multi_turn_inline_accuracy.py. AGENTS.md targets >90% coverage for new code, and you specifically asked to ensure test coverage.

The untested logic is exactly the kind that's error-prone: shell-command canonicalization, tool-call merging (streamed vs complete), multiset IoU, repeat-set validity, and the valid/invalid_reason computation. Notably, the existing CodeRabbit comments on this same file already flag two robustness crashes (row["turn"] KeyError; unguarded msgspec.DecodeError on partial events.jsonl) — both in code paths that tests would have caught.

Fix: add a tests/unit/evaluation/test_multi_turn_inline_accuracy.py covering exe extraction, intent extraction, repeat-id splitting, multiset scoring, and the validity/invalid_reason branches (with @pytest.mark.unit).

if getattr(self._thread_local, "tokenizer", None) is None:
self._thread_local.tokenizer = AutoTokenizer.from_pretrained(
self._tokenizer_name
self._tokenizer_name, trust_remote_code=True
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Codex] 🟡 medium · security: This PR flips tokenizer loading to trust_remote_code=True by default at two sites — here (the hot-path tokenizer pool) and commands/benchmark/execute.py:329 (_precompute_isl_for_multi_turn):

            self._thread_local.tokenizer = AutoTokenizer.from_pretrained(
                self._tokenizer_name, trust_remote_code=True
            )

trust_remote_code=True executes arbitrary Python from the model repo on load. Previously these paths only loaded standard tokenizer assets, so this broadens the security posture for every benchmark run — a mistyped or untrusted HF repo id now runs remote code. It's intentional and even has a test asserting the kwarg, but it's enabled globally rather than opt-in and is undocumented.

Suggestion: gate remote-code execution behind an explicit config flag (default False), or at minimum document the new default and its implication in the README so operators pointing at untrusted repos are aware.

turn_timeout_s: float = Field(default=300.0, gt=0)
use_dataset_history: bool = True
turn_timeout_s: float = Field(
default=86400.0,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] 🟡 medium · documentation: turn_timeout_s's default was changed from 300.0 to 86400.0 (24 hours) and the field was removed from both the example YAML and the README in this PR:

    turn_timeout_s: float = Field(
        default=86400.0,
        gt=0,

A 24-hour default effectively disables the per-turn safety deadline — if a server hangs mid-turn, the run now blocks for a day instead of 5 minutes. That's a meaningful behavioral change to hide from the example config. use_dataset_history (default True) was likewise dropped from the docs.

Fix: document the new turn_timeout_s default and its implication in examples/09_MultiTurn/README.md, and either surface use_dataset_history or note why it's intentionally omitted.

"multi_turn.inline_accuracy: true and a dataset path"
)

from inference_endpoint.evaluation.multi_turn_inline_accuracy import score_report
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] 🟡 medium · design: Lazy import inside a function violates the project's "No lazy imports" rule (AGENTS.md):

    from inference_endpoint.evaluation.multi_turn_inline_accuracy import score_report

multi_turn_inline_accuracy.py only imports from core.record / core.types, so there is no circular-import justification (the only sanctioned exceptions are circular-import avoidance with a comment, optional-dependency try/except, and sandboxing).

Fix: move this import to the top of the file alongside the other inference_endpoint.evaluation imports.

if self._store_in_history:
sys_content = self._dataset_metadata.system_prompts_by_conv.get(source_id)
if sys_content is not None and self._enable_salt:
if sys_content.startswith("[salt: "):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] 🔵 low · bug: The store-in-history salt strip here only handles the [salt: prefix:

                if sys_content.startswith("[salt: "):
                    marker_end = sys_content.find("]\n\n")

but the sibling _messages_with_logical_id_salt (line ~472) strips both [salt: and the legacy [cache_salt: prefixes. If a system_prompts_by_conv value carries a [cache_salt: ...] marker, this path won't strip it and will then prepend a fresh [salt: ...], producing doubled/garbled markers in the rendered system prompt.

Fix: make both strip routines handle the same set of prefixes (factor out a shared helper).

"valid": valid,
}
if not valid:
score_entry["invalid_reason"] = score_result["invalid_reason"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] 🔵 low · error-handling: score_entry["invalid_reason"] = score_result["invalid_reason"] (and the mirror at line ~1039) rely on an unguarded cross-module dict contract: that score_report always populates invalid_reason when valid is False. That invariant currently holds, so this isn't a live bug — but a future change to score_report that returns valid=False without a reason would raise KeyError at finalization and discard an otherwise-successful run.

Suggestion: use score_result.get("invalid_reason", "") to make the contract failure-tolerant at this boundary.

)
result["invalid_reason"] = "; ".join(reasons)

if out_path is not None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] 🔵 low · data-integrity: scores.json serializes the full per_turn list (one row per expected-turn × repeat) via a single in-memory json.dumps(result, indent=2). For the documented target (990 trajectories × many turns) this artifact can grow very large. The standalone main() already strips per_turn before printing.

Suggestion: gate the verbose per_turn dump (summary-only by default, full dump behind a flag) to bound the artifact size.

path: /path/to/agentic_dataset.jsonl
- name: agentic_combined
type: performance # do not change.
path: <path-to-agentic-combined-jsonl>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] 🔵 low · documentation: path: <path-to-agentic-combined-jsonl> is a literal placeholder. Combined with inline_accuracy: true # do not change and num_trajectories_to_issue: 990, a user who copies this file and only edits the endpoint will hit the inline-accuracy finalization path with an unreadable ground-truth JSONL and fail (after the run).

Suggestion: use a clearer placeholder convention (e.g. /path/to/agentic_dataset.jsonl) plus a one-line note that this must be a real local/mounted file because inline accuracy reads it at finalization.


@pytest.mark.unit
def test_multi_turn_rejects_removed_stop_on_first_empty_slot_config(self):
with pytest.raises(ValueError, match="stop_on_first_empty_slot"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] 🔵 low · testing: test_multi_turn_rejects_removed_stop_on_first_empty_slot_config only passes because MultiTurnConfig uses extra="forbid" — it would pass for any unknown key, so it doesn't actually guard the removal of this specific field. It gives false confidence.

        with pytest.raises(ValueError, match="stop_on_first_empty_slot"):

Suggestion: either drop it (generic extra="forbid" behavior is already covered elsewhere) or, if guarding a real migration, document that intent in the test.

@arekay-nv
Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.4) + Claude · Depth: thorough

Found 10 issues across 7 files. 10 inline comments posted on the relevant lines.

🔴 Must Fix (critical/high)

Will cause incorrect behavior or data loss in production.

# File Line Category Reviewer(s) Summary
1 commands/benchmark/execute.py 905 bug Both _inline_score_multi_turn_if_enabled() is called unconditionally but raises when inline_accuracy is disabled (the default) → every multi-turn run that doesn't opt in crashes at finalization, discarding results. Should return None when disabled.
2 evaluation/multi_turn_inline_accuracy.py 387 testing Claude The new ~533-line scorer module has zero tests. AGENTS.md targets >90% on new code. The two existing CodeRabbit comments flag crashes (row["turn"] KeyError, msgspec.DecodeError) that tests would have caught.

🟡 Should Fix (medium)

Real issues / behavioral & posture changes that should be addressed.

# File Line Category Reviewer(s) Summary
3 async_utils/.../token_metrics.py 124 security Codex trust_remote_code=True now the default at two sites (here + execute.py:329) — executes arbitrary repo code on every run. Should be opt-in and/or documented.
4 config/schema.py 259 documentation Claude turn_timeout_s default changed 300 → 86400 (24h, effectively disables the per-turn deadline) and was removed from the README + example YAML. use_dataset_history also undocumented.
5 commands/benchmark/execute.py 916 design Claude Lazy import score_report mid-function violates the project's "No lazy imports" rule; no circular-import justification. Move to top of file.

🔵 Consider (low)

Valid improvements / follow-ups.

# File Line Category Reviewer(s) Summary
6 load_generator/multi_turn_strategy.py 295 bug Claude Salt strip handles only [salt: here, but the sibling routine strips both [salt: and [cache_salt: → potential doubled/garbled markers.
7 commands/benchmark/execute.py 930 error-handling Claude Unguarded score_result["invalid_reason"] (also ~1039) — safe today, but a fragile cross-module contract; use .get(...).
8 evaluation/multi_turn_inline_accuracy.py 493 data-integrity Claude scores.json dumps the full per_turn list in one json.dumps; unbounded for 990× trajectories. Consider summary-only by default.
9 examples/09_MultiTurn/kimi_agentic_benchmark.yaml 17 documentation Claude path: <path-to-agentic-combined-jsonl> placeholder + inline_accuracy: true → copy-paste users fail at finalization. Clarify placeholder + add a note.
10 tests/unit/config/test_schema.py 497 testing Claude test_..._rejects_removed_stop_on_first_empty_slot passes only due to extra="forbid" — would pass for any unknown key; false confidence.

On the three things you asked us to weight:

The critical finalization crash (#1) is the blocker — it breaks existing multi-turn runs regardless of this PR's new features.

🤖 Generated with Claude Code · Review Council

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants