Merge 3 existing PR related to OnnxDiscrepancyCheck + llama.cpp integration by xadupre · Pull Request #2546 · microsoft/Olive

xadupre · 2026-06-29T11:38:07Z

Describe your changes

Merges #2536, #2535, #2534.

Additionally adds llama.cpp integration and other improvements to OnnxDiscrepancyCheck:

New llama_cpp flag (bool, default False) on OnnxDiscrepancyCheck — when enabled, converts the reference HuggingFace model to GGUF format using convert_hf_to_gguf.py from llama.cpp and compares inference with llama.cpp.
New llama_cpp_env_path parameter (Optional[str]) — path to the llama_env virtual environment where llama-cpp-python and convert_hf_to_gguf.py are installed (defaults to "llama_env" relative to cwd). The virtual environment also isolates potentially conflicting versions of torch, transformers, etc.
compare_llama_cpp() method — saves the reference model and tokenizer to output_dir/hf_model (alongside the saved test model/report output) using save_pretrained (standard HuggingFace format), then:
1. Calls convert_hf_to_gguf.py (the official llama.cpp conversion CLI) inside llama_env via subprocess to produce a GGUF F32 file at output_dir/model.gguf.
2. Runs a self-contained inference helper script inside llama_env via subprocess.run to measure first-token latency with llama_cpp.Llama.
  Results include first-token match vs PyTorch, llama_cpp_ttft_s, llama_cpp_ttfn_s, llama_cpp_total_time_s, llama_cpp_speedup_vs_pytorch, and llama_cpp_speedup_vs_onnx. All llama_cpp/gguf imports are strictly isolated to the subprocess — the main Olive process never imports them.
New --test_llama_path CLI option — specifies the path to the llama_env virtual environment when running olive optimize --test. When provided, the injected OnnxDiscrepancyCheck pass automatically enables llama_cpp=True and forwards the path as llama_cpp_env_path. Using --test_llama_path without --test emits a warning.
Improved --test_metrics parsing — now accepts both space-separated (--test_metrics mae speedup) and comma-separated (--test_metrics mae,speedup) forms. A _parse_test_metrics type function handles splitting and validation per token, and a _flatten_test_metrics helper normalises the result before it is forwarded to the pass.
Fixed add_discrepancy_check_pass update-in-place — when a config was previously generated by olive optimize --dry_run --test, the OnnxDiscrepancyCheck pass was already present and was silently skipped on subsequent olive run --test --test_metrics … calls. The function now updates the existing pass in-place, refreshing reference_model_path (resolved to absolute), report_output_dir, metric settings (max_mae / timing_iterations), and llama.cpp settings — so --test_metrics, --output_path, and --test_llama_path from olive run always take effect.
Fixed test model persistence across engine cache hits — ModelBuilder now copies the test model directory to reference_hf_model/ alongside the generated ONNX output in the engine cache. OnnxDiscrepancyCheck falls back to this cached copy when reference_model_path (e.g. out/tiny-test) no longer exists on disk, fixing the OSError: Repo id must be in the form 'repo_name' or 'namespace/repo_name' error that occurred on subsequent olive run calls that hit the model-builder cache.
New SaveTestModelConfig pass (olive/passes/pytorch/save_test_model_config.py) — a new Olive pass injected at the beginning of the passes list when --test is active. It takes an HfModelHandler, writes config.json (with the reduced hidden-layer count) and the Olive test-model marker file to test_model_path, and returns the model unchanged. This ensures test_model_path always exists as a config-only directory before ModelBuilder or any downstream pass needs it, replacing the previous _save_test_model_config_for_dry_run standalone function.
CI workflow (test-model-fast.yml) — new step that creates a llama_env virtual environment, installs gguf, safetensors, llama-cpp-python (from pre-built CPU wheels at https://abetlen.github.io/llama-cpp-python/whl/cpu), transformers, sentencepiece, and protobuf, and downloads convert_hf_to_gguf.py from the llama.cpp GitHub repository.
Updated documentation (cli-fast-test.md) — clarifies WHERE the 2-layer reduction happens (_apply_test_model_config in olive/common/hf/utils.py, called during the model-builder pass of olive run) and WHEN out/tiny-test is created (by the SaveTestModelConfig pass on first olive run, completed with weights by ModelBuilder); documents the cache fallback behaviour; explains the --test_llama_path option and llama_env setup; and clarifies that --test_metrics is always respected even when the config was generated by olive optimize --dry_run --test.

Checklist before requesting a review

Add unit tests for this change.
Make sure all tests can pass.
Update documents if necessary.
Lint and apply fixes to your code by running lintrunner -a
Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

(Optional) Issue link

… xadupre/merged

Copilot

Pull request overview

This PR consolidates three earlier changes around OnnxDiscrepancyCheck to improve test-mode metrics: it exposes latency values alongside speedup, adds time-to-first-token style generation metrics, and extends olive run --test so users can opt into additional metrics via --test_metrics.

Changes:

Extend OnnxDiscrepancyCheck.compare_generation to return a metrics dict including TTFT / time-to-first-N for both transformers and ORT GenAI.
Update speedup measurement to return and persist average PyTorch/ONNX latencies plus the computed speedup.
Add CLI support for selecting --test metrics (--test_metrics mae|speedup) and update/extend tests + docs accordingly.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`olive/passes/onnx/discrepancy_check.py`	Adds/returns latency metrics for generation and speedup runs; logs formatted latency values.
`olive/cli/base.py`	Introduces `--test_metrics`, injects discrepancy pass based on selected metrics, and warns when `--test_metrics` is unused.
`olive/cli/run.py`	Wires `--test_metrics` through to the injected discrepancy pass and warning helper.
`test/passes/onnx/test_discrepancy_check.py`	Updates unit tests for new generation result shape and new speedup return tuple.
`test/cli/test_cli.py`	Adjusts CLI workflow test config to reflect timing being disabled in certain cases.
`test/cli/test_base.py`	Adds focused tests for default metric selection behavior and warning behavior.
`docs/source/how-to/cli/cli-fast-test.md`	Documents metric selection for `--test` runs.

xadupre · 2026-06-29T13:06:41Z

@copilot in job Ubuntu test model fast, add a step which create a virtual environment called llama_env and install llama-cpp-python, then in OnnxDiscrepancy check, add a flag called llama-cpp which converts the dumped model into gguf format, measures the first token difference and the latencies and speed up, every use of llama-cpp is done through the virtual environment

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

… CI step

xadupre · 2026-07-01T07:00:35Z

@copilot command line python -m olive optimize
--model_name_or_path arnir0/Tiny-LLM --device cpu --provider CPUExecutionProvider
--precision fp32 --output_path out/tiny --dry_run
--test out/tiny-test --test_metrics mae,speedup
--log_level 2
does not save test_metrics

… explicitly

Copilot · 2026-07-01T07:09:27Z

@copilot command line python -m olive optimize
--model_name_or_path arnir0/Tiny-LLM --device cpu --provider CPUExecutionProvider
--precision fp32 --output_path out/tiny ...

Fixed in the latest commit. Two changes in add_discrepancy_check_pass:

New passes now write timing_iterations explicitly — instead of relying on the pass default when speedup is requested, the config now always gets an explicit timing_iterations: 5 (speedup enabled) or timing_iterations: 0 (speedup disabled). This makes the generated config.json self-contained and correct at a glance.
Update-in-place no longer clobbers metric settings when --test_metrics is absent — a subsequent olive run --config ... --test (without --test_metrics) now only refreshes reference_model_path and report_output_dir, leaving timing_iterations and max_mae from the dry-run config untouched. Metric settings are only overwritten when --test_metrics is explicitly provided.

xadupre · 2026-07-01T07:13:56Z

@copilot I want test_metrics to be saved, they should not be needed a second time with command run

…explicitly on update

Copilot · 2026-07-01T07:29:52Z

@copilot I want test_metrics to be saved, they should not be needed a second time with command run

Done. Two changes in the latest commit:

test_metrics are now always explicitly saved — the update-in-place path in add_discrepancy_check_pass now always writes timing_iterations explicitly (5 for speedup, 0 otherwise) instead of popping it. This ensures the config file always contains a self-contained, human-readable timing_iterations value after olive optimize --dry_run --test --test_metrics mae,speedup, and olive run --config ... --test will honour those saved settings without needing --test_metrics again.
Fixed all 5 pylint E1125 linter errors — compare_generation and compare_llama_cpp tests were missing the mandatory ref_model_path keyword argument (the CI failures). All calls now pass ref_model_path=config.reference_model_path.

xadupre · 2026-07-01T07:32:56Z

@copilot I ran python -m olive optimize --model_name_or_path arnir0/Tiny-LLM --device cpu --provider CPUExecutionProvider --precision fp32 --output_path out/tiny --dry_run --test out/tiny-test --test_metrics speedup --log_level 2 and no test_metrics in config.json

…nerated config

Copilot · 2026-07-01T07:49:51Z

@copilot I ran python -m olive optimize --model_name_or_path arnir0/Tiny-LLM --device cpu --provider CPUExecutionProvider --precision fp32 --output_path out/tiny --dry_run ...

Fixed in the latest commit. The root cause was that --test_metrics speedup was being stored as the opaque timing_iterations: 5 field, so searching config.json for "test_metrics" found nothing.

The generated config now contains the human-readable field directly:

"discrepancy_check": {
    "type": "OnnxDiscrepancyCheck",
    "reference_model_path": "...",
    "report_output_dir": "...",
    "test_metrics": ["speedup"]
}

Changes made:

Added test_metrics: Optional[list[str]] parameter to OnnxDiscrepancyCheck._default_config with a description explaining it controls both speedup timing and MAE threshold.
In _run_for_config, when test_metrics is set it takes precedence: "speedup" enables timing, "mae" enforces the MAE threshold. The old timing_iterations/max_mae fields remain supported for backward compatibility with hand-crafted configs.
add_discrepancy_check_pass now writes test_metrics instead of the implicit timing_iterations/max_mae fields, so the intent is always visible in config.json.

After olive optimize --dry_run --test out/tiny-test --test_metrics speedup, you will see "test_metrics": ["speedup"] in the generated config and olive run --config ... --test will respect it without requiring --test_metrics to be passed again.

xadupre · 2026-07-01T07:52:23Z

@copilot File "/home/xadupre/github/Olive/olive/passes/onnx/discrepancy_check.py", line 561, in _run_for_config
failures.append(f"Max absolute error {max_abs_error:.6f} exceeds threshold {config.max_mae:.6f}")

Copilot · 2026-07-01T07:54:07Z

@copilot File "/home/xadupre/github/Olive/olive/passes/onnx/discrepancy_check.py", line 561, in _run_for_config
failures.append(f"Max absolute error {max_abs_error:.6f} exceeds threshold {confi...

Fixed. The failure message at line 561 was using config.max_mae directly, which is None when the MAE threshold is derived from test_metrics (e.g. ["mae"]). Changed to use effective_max_mae — the already-resolved value — so formatting never hits None.

xadupre · 2026-07-01T08:02:27Z

@copilot eager implementation is the default? it should be attn_impl

…ntion implementation

Copilot · 2026-07-01T08:05:36Z

@copilot eager implementation is the default? it should be attn_impl

Added an attn_impl parameter (Optional[str], default None) to OnnxDiscrepancyCheck._default_config. It is forwarded as attn_implementation to AutoModelForCausalLM.from_pretrained; when None the model's own default is used.

xadupre · 2026-07-01T08:09:49Z

@copilot comand run overwrite olive config.json with transformers config.json

xadupre · 2026-07-01T08:10:09Z

@copilot test_model_path is not used?

xadupre · 2026-07-01T08:11:20Z

@copilot increase the number of logged info message in OnnxDiscrepancyCheck, every time is metric is computed, every time a file is saved

xadupre and others added 20 commits June 22, 2026 13:45

expose latencies with the speedup in OnnxDiscrepancyCheck

b21b4f6

Extend discrepancy check unit test for latency tuple

72c5d4e

add time to first token in OnnxDiscrepancyCheck

bf0a978

Add return type annotation to _measure_speedup

804bb92

Add latency key assertions to fully matching discrepancy test

1bdee25

Handle zero max_new_tokens in generation metrics

142ddea

Use single measured transformers generation for latency metrics

39cac1c

extend command line --test to trigger speedup measure

6b5b652

Document --test_metrics speedup usage

15287d8

Fix default test metrics to be mae-only, make speedup opt-in

89d98c4

Fix test to match new default mae-only behavior

7490e55

Fix test_cli.py expected pass config to include timing_iterations=0

bf96e3f

Merge branch 'main' into xadupre/lat

d34ecd5

Merge branch 'main' into xadupre/tts

141f35a

Merge branch 'main' into xadupre/cmd

1e8c020

Merge branch 'main' into xadupre/cmd

3370626

Merge branch 'main' into xadupre/tts

9c7365d

Merge branch 'main' into xadupre/lat

922089a

Merge branch 'xadupre/tts' of https://github.com/microsoft/Olive into…

0741621

… xadupre/merged

Merge branch 'xadupre/lat' of https://github.com/microsoft/Olive into…

4c7938b

… xadupre/merged

xadupre marked this pull request as ready for review June 29, 2026 13:02

Copilot AI review requested due to automatic review settings June 29, 2026 13:02

Copilot started reviewing on behalf of xadupre June 29, 2026 13:03 View session

Copilot AI reviewed Jun 29, 2026

View reviewed changes

Comment thread olive/cli/base.py

Comment thread docs/source/how-to/cli/cli-fast-test.md Outdated

Comment thread olive/passes/onnx/discrepancy_check.py Outdated

Copilot started work on behalf of xadupre June 29, 2026 13:07 View session

xadupre and others added 4 commits June 29, 2026 15:29

Potential fix for pull request finding

fc5c372

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

f1077b7

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

c1e75e6

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

feat: add llama-cpp integration to OnnxDiscrepancyCheck and llama_env…

4c73d0a

… CI step

Copilot finished work on behalf of xadupre June 30, 2026 19:42

Copilot started work on behalf of xadupre July 1, 2026 07:01 View session

Fix test_metrics not saved in dry_run: always write timing_iterations…

df9602e

… explicitly

Copilot finished work on behalf of xadupre July 1, 2026 07:10

Copilot started work on behalf of xadupre July 1, 2026 07:14 View session

Fix missing ref_model_path in test calls and write timing_iterations …

34a08e5

…explicitly on update

Copilot finished work on behalf of xadupre July 1, 2026 07:30

Copilot started work on behalf of xadupre July 1, 2026 07:33 View session

Add test_metrics parameter to OnnxDiscrepancyCheck and store it in ge…

5bf4087

…nerated config

Copilot finished work on behalf of xadupre July 1, 2026 07:50

Copilot started work on behalf of xadupre July 1, 2026 07:53 View session

Fix crash when formatting MAE threshold with test_metrics

b97c17e

Copilot finished work on behalf of xadupre July 1, 2026 07:54

Copilot started work on behalf of xadupre July 1, 2026 08:03 View session

Add attn_impl parameter to OnnxDiscrepancyCheck for configurable atte…

da0d77d

…ntion implementation

Copilot finished work on behalf of xadupre July 1, 2026 08:06

Copilot started work on behalf of xadupre July 1, 2026 08:10 View session

Uh oh!

Conversation

xadupre commented Jun 29, 2026 • edited by Copilot AI Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Checklist before requesting a review

(Optional) Issue link

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xadupre commented Jun 29, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

Copilot AI commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

Copilot AI commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

Copilot AI commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

Copilot AI commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

Copilot AI commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xadupre commented Jun 29, 2026 •

edited by Copilot AI

Loading