Skip to content

Merge 3 existing PR related to OnnxDiscrepancyCheck + llama.cpp integration#2546

Open
xadupre wants to merge 51 commits into
mainfrom
xadupre/merged
Open

Merge 3 existing PR related to OnnxDiscrepancyCheck + llama.cpp integration#2546
xadupre wants to merge 51 commits into
mainfrom
xadupre/merged

Conversation

@xadupre

@xadupre xadupre commented Jun 29, 2026

Copy link
Copy Markdown
Member

Describe your changes

Merges #2536, #2535, #2534.

Additionally adds llama.cpp integration and other improvements to OnnxDiscrepancyCheck:

  • New llama_cpp flag (bool, default False) on OnnxDiscrepancyCheck — when enabled, converts the reference HuggingFace model to GGUF format using convert_hf_to_gguf.py from llama.cpp and compares inference with llama.cpp.
  • New llama_cpp_env_path parameter (Optional[str]) — path to the llama_env virtual environment where llama-cpp-python and convert_hf_to_gguf.py are installed (defaults to "llama_env" relative to cwd). The virtual environment also isolates potentially conflicting versions of torch, transformers, etc.
  • compare_llama_cpp() method — saves the reference model and tokenizer to output_dir/hf_model (alongside the saved test model/report output) using save_pretrained (standard HuggingFace format), then:
    1. Calls convert_hf_to_gguf.py (the official llama.cpp conversion CLI) inside llama_env via subprocess to produce a GGUF F32 file at output_dir/model.gguf.
    2. Runs a self-contained inference helper script inside llama_env via subprocess.run to measure first-token latency with llama_cpp.Llama.
      Results include first-token match vs PyTorch, llama_cpp_ttft_s, llama_cpp_ttfn_s, llama_cpp_total_time_s, llama_cpp_speedup_vs_pytorch, and llama_cpp_speedup_vs_onnx. All llama_cpp/gguf imports are strictly isolated to the subprocess — the main Olive process never imports them.
  • New --test_llama_path CLI option — specifies the path to the llama_env virtual environment when running olive optimize --test. When provided, the injected OnnxDiscrepancyCheck pass automatically enables llama_cpp=True and forwards the path as llama_cpp_env_path. Using --test_llama_path without --test emits a warning.
  • Improved --test_metrics parsing — now accepts both space-separated (--test_metrics mae speedup) and comma-separated (--test_metrics mae,speedup) forms. A _parse_test_metrics type function handles splitting and validation per token, and a _flatten_test_metrics helper normalises the result before it is forwarded to the pass.
  • Fixed add_discrepancy_check_pass update-in-place — when a config was previously generated by olive optimize --dry_run --test, the OnnxDiscrepancyCheck pass was already present and was silently skipped on subsequent olive run --test --test_metrics … calls. The function now updates the existing pass in-place, refreshing reference_model_path (resolved to absolute), report_output_dir, metric settings (max_mae / timing_iterations), and llama.cpp settings — so --test_metrics, --output_path, and --test_llama_path from olive run always take effect.
  • Fixed test model persistence across engine cache hitsModelBuilder now copies the test model directory to reference_hf_model/ alongside the generated ONNX output in the engine cache. OnnxDiscrepancyCheck falls back to this cached copy when reference_model_path (e.g. out/tiny-test) no longer exists on disk, fixing the OSError: Repo id must be in the form 'repo_name' or 'namespace/repo_name' error that occurred on subsequent olive run calls that hit the model-builder cache.
  • New SaveTestModelConfig pass (olive/passes/pytorch/save_test_model_config.py) — a new Olive pass injected at the beginning of the passes list when --test is active. It takes an HfModelHandler, writes config.json (with the reduced hidden-layer count) and the Olive test-model marker file to test_model_path, and returns the model unchanged. This ensures test_model_path always exists as a config-only directory before ModelBuilder or any downstream pass needs it, replacing the previous _save_test_model_config_for_dry_run standalone function.
  • CI workflow (test-model-fast.yml) — new step that creates a llama_env virtual environment, installs gguf, safetensors, llama-cpp-python (from pre-built CPU wheels at https://abetlen.github.io/llama-cpp-python/whl/cpu), transformers, sentencepiece, and protobuf, and downloads convert_hf_to_gguf.py from the llama.cpp GitHub repository.
  • Updated documentation (cli-fast-test.md) — clarifies WHERE the 2-layer reduction happens (_apply_test_model_config in olive/common/hf/utils.py, called during the model-builder pass of olive run) and WHEN out/tiny-test is created (by the SaveTestModelConfig pass on first olive run, completed with weights by ModelBuilder); documents the cache fallback behaviour; explains the --test_llama_path option and llama_env setup; and clarifies that --test_metrics is always respected even when the config was generated by olive optimize --dry_run --test.

Checklist before requesting a review

  • Add unit tests for this change.
  • Make sure all tests can pass.
  • Update documents if necessary.
  • Lint and apply fixes to your code by running lintrunner -a
  • Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

(Optional) Issue link

@xadupre xadupre marked this pull request as ready for review June 29, 2026 13:02
Copilot AI review requested due to automatic review settings June 29, 2026 13:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR consolidates three earlier changes around OnnxDiscrepancyCheck to improve test-mode metrics: it exposes latency values alongside speedup, adds time-to-first-token style generation metrics, and extends olive run --test so users can opt into additional metrics via --test_metrics.

Changes:

  • Extend OnnxDiscrepancyCheck.compare_generation to return a metrics dict including TTFT / time-to-first-N for both transformers and ORT GenAI.
  • Update speedup measurement to return and persist average PyTorch/ONNX latencies plus the computed speedup.
  • Add CLI support for selecting --test metrics (--test_metrics mae|speedup) and update/extend tests + docs accordingly.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
olive/passes/onnx/discrepancy_check.py Adds/returns latency metrics for generation and speedup runs; logs formatted latency values.
olive/cli/base.py Introduces --test_metrics, injects discrepancy pass based on selected metrics, and warns when --test_metrics is unused.
olive/cli/run.py Wires --test_metrics through to the injected discrepancy pass and warning helper.
test/passes/onnx/test_discrepancy_check.py Updates unit tests for new generation result shape and new speedup return tuple.
test/cli/test_cli.py Adjusts CLI workflow test config to reflect timing being disabled in certain cases.
test/cli/test_base.py Adds focused tests for default metric selection behavior and warning behavior.
docs/source/how-to/cli/cli-fast-test.md Documents metric selection for --test runs.

Comment thread olive/cli/base.py
Comment thread docs/source/how-to/cli/cli-fast-test.md Outdated
Comment thread olive/passes/onnx/discrepancy_check.py Outdated
@xadupre

xadupre commented Jun 29, 2026

Copy link
Copy Markdown
Member Author

@copilot in job Ubuntu test model fast, add a step which create a virtual environment called llama_env and install llama-cpp-python, then in OnnxDiscrepancy check, add a flag called llama-cpp which converts the dumped model into gguf format, measures the first token difference and the latencies and speed up, every use of llama-cpp is done through the virtual environment

xadupre and others added 4 commits June 29, 2026 15:29
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@xadupre

xadupre commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@copilot command line python -m olive optimize
--model_name_or_path arnir0/Tiny-LLM --device cpu --provider CPUExecutionProvider
--precision fp32 --output_path out/tiny --dry_run
--test out/tiny-test --test_metrics mae,speedup
--log_level 2
does not save test_metrics

Copilot AI commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@copilot command line python -m olive optimize
--model_name_or_path arnir0/Tiny-LLM --device cpu --provider CPUExecutionProvider
--precision fp32 --output_path out/tiny ...

Fixed in the latest commit. Two changes in add_discrepancy_check_pass:

  1. New passes now write timing_iterations explicitly — instead of relying on the pass default when speedup is requested, the config now always gets an explicit timing_iterations: 5 (speedup enabled) or timing_iterations: 0 (speedup disabled). This makes the generated config.json self-contained and correct at a glance.

  2. Update-in-place no longer clobbers metric settings when --test_metrics is absent — a subsequent olive run --config ... --test (without --test_metrics) now only refreshes reference_model_path and report_output_dir, leaving timing_iterations and max_mae from the dry-run config untouched. Metric settings are only overwritten when --test_metrics is explicitly provided.

@xadupre

xadupre commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@copilot I want test_metrics to be saved, they should not be needed a second time with command run

Copilot AI commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@copilot I want test_metrics to be saved, they should not be needed a second time with command run

Done. Two changes in the latest commit:

  1. test_metrics are now always explicitly saved — the update-in-place path in add_discrepancy_check_pass now always writes timing_iterations explicitly (5 for speedup, 0 otherwise) instead of popping it. This ensures the config file always contains a self-contained, human-readable timing_iterations value after olive optimize --dry_run --test --test_metrics mae,speedup, and olive run --config ... --test will honour those saved settings without needing --test_metrics again.

  2. Fixed all 5 pylint E1125 linter errorscompare_generation and compare_llama_cpp tests were missing the mandatory ref_model_path keyword argument (the CI failures). All calls now pass ref_model_path=config.reference_model_path.

@xadupre

xadupre commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@copilot I ran python -m olive optimize --model_name_or_path arnir0/Tiny-LLM --device cpu --provider CPUExecutionProvider --precision fp32 --output_path out/tiny --dry_run --test out/tiny-test --test_metrics speedup --log_level 2 and no test_metrics in config.json

Copilot AI commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@copilot I ran python -m olive optimize --model_name_or_path arnir0/Tiny-LLM --device cpu --provider CPUExecutionProvider --precision fp32 --output_path out/tiny --dry_run ...

Fixed in the latest commit. The root cause was that --test_metrics speedup was being stored as the opaque timing_iterations: 5 field, so searching config.json for "test_metrics" found nothing.

The generated config now contains the human-readable field directly:

"discrepancy_check": {
    "type": "OnnxDiscrepancyCheck",
    "reference_model_path": "...",
    "report_output_dir": "...",
    "test_metrics": ["speedup"]
}

Changes made:

  • Added test_metrics: Optional[list[str]] parameter to OnnxDiscrepancyCheck._default_config with a description explaining it controls both speedup timing and MAE threshold.
  • In _run_for_config, when test_metrics is set it takes precedence: "speedup" enables timing, "mae" enforces the MAE threshold. The old timing_iterations/max_mae fields remain supported for backward compatibility with hand-crafted configs.
  • add_discrepancy_check_pass now writes test_metrics instead of the implicit timing_iterations/max_mae fields, so the intent is always visible in config.json.

After olive optimize --dry_run --test out/tiny-test --test_metrics speedup, you will see "test_metrics": ["speedup"] in the generated config and olive run --config ... --test will respect it without requiring --test_metrics to be passed again.

@xadupre

xadupre commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@copilot File "/home/xadupre/github/Olive/olive/passes/onnx/discrepancy_check.py", line 561, in _run_for_config
failures.append(f"Max absolute error {max_abs_error:.6f} exceeds threshold {config.max_mae:.6f}")

Copilot AI commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@copilot File "/home/xadupre/github/Olive/olive/passes/onnx/discrepancy_check.py", line 561, in _run_for_config
failures.append(f"Max absolute error {max_abs_error:.6f} exceeds threshold {confi...

Fixed. The failure message at line 561 was using config.max_mae directly, which is None when the MAE threshold is derived from test_metrics (e.g. ["mae"]). Changed to use effective_max_mae — the already-resolved value — so formatting never hits None.

@xadupre

xadupre commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@copilot eager implementation is the default? it should be attn_impl

Copilot AI commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@copilot eager implementation is the default? it should be attn_impl

Added an attn_impl parameter (Optional[str], default None) to OnnxDiscrepancyCheck._default_config. It is forwarded as attn_implementation to AutoModelForCausalLM.from_pretrained; when None the model's own default is used.

@xadupre

xadupre commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@copilot comand run overwrite olive config.json with transformers config.json

@xadupre

xadupre commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@copilot test_model_path is not used?

@xadupre

xadupre commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@copilot increase the number of logged info message in OnnxDiscrepancyCheck, every time is metric is computed, every time a file is saved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants