Merge 3 existing PR related to OnnxDiscrepancyCheck + llama.cpp integration#2546
Merge 3 existing PR related to OnnxDiscrepancyCheck + llama.cpp integration#2546xadupre wants to merge 51 commits into
Conversation
… xadupre/merged
… xadupre/merged
There was a problem hiding this comment.
Pull request overview
This PR consolidates three earlier changes around OnnxDiscrepancyCheck to improve test-mode metrics: it exposes latency values alongside speedup, adds time-to-first-token style generation metrics, and extends olive run --test so users can opt into additional metrics via --test_metrics.
Changes:
- Extend
OnnxDiscrepancyCheck.compare_generationto return a metrics dict including TTFT / time-to-first-N for both transformers and ORT GenAI. - Update speedup measurement to return and persist average PyTorch/ONNX latencies plus the computed speedup.
- Add CLI support for selecting
--testmetrics (--test_metrics mae|speedup) and update/extend tests + docs accordingly.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
olive/passes/onnx/discrepancy_check.py |
Adds/returns latency metrics for generation and speedup runs; logs formatted latency values. |
olive/cli/base.py |
Introduces --test_metrics, injects discrepancy pass based on selected metrics, and warns when --test_metrics is unused. |
olive/cli/run.py |
Wires --test_metrics through to the injected discrepancy pass and warning helper. |
test/passes/onnx/test_discrepancy_check.py |
Updates unit tests for new generation result shape and new speedup return tuple. |
test/cli/test_cli.py |
Adjusts CLI workflow test config to reflect timing being disabled in certain cases. |
test/cli/test_base.py |
Adds focused tests for default metric selection behavior and warning behavior. |
docs/source/how-to/cli/cli-fast-test.md |
Documents metric selection for --test runs. |
|
@copilot in job Ubuntu test model fast, add a step which create a virtual environment called llama_env and install llama-cpp-python, then in OnnxDiscrepancy check, add a flag called llama-cpp which converts the dumped model into gguf format, measures the first token difference and the latencies and speed up, every use of llama-cpp is done through the virtual environment |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
@copilot command line python -m olive optimize |
Fixed in the latest commit. Two changes in
|
|
@copilot I want test_metrics to be saved, they should not be needed a second time with command run |
…explicitly on update
Done. Two changes in the latest commit:
|
|
@copilot I ran python -m olive optimize --model_name_or_path arnir0/Tiny-LLM --device cpu --provider CPUExecutionProvider --precision fp32 --output_path out/tiny --dry_run --test out/tiny-test --test_metrics speedup --log_level 2 and no test_metrics in config.json |
Fixed in the latest commit. The root cause was that The generated config now contains the human-readable field directly: "discrepancy_check": {
"type": "OnnxDiscrepancyCheck",
"reference_model_path": "...",
"report_output_dir": "...",
"test_metrics": ["speedup"]
}Changes made:
After |
|
@copilot File "/home/xadupre/github/Olive/olive/passes/onnx/discrepancy_check.py", line 561, in _run_for_config |
Fixed. The failure message at line 561 was using |
|
@copilot eager implementation is the default? it should be attn_impl |
…ntion implementation
Added an |
|
@copilot comand run overwrite olive config.json with transformers config.json |
|
@copilot test_model_path is not used? |
|
@copilot increase the number of logged info message in OnnxDiscrepancyCheck, every time is metric is computed, every time a file is saved |
Describe your changes
Merges #2536, #2535, #2534.
Additionally adds llama.cpp integration and other improvements to
OnnxDiscrepancyCheck:llama_cppflag (bool, defaultFalse) onOnnxDiscrepancyCheck— when enabled, converts the reference HuggingFace model to GGUF format usingconvert_hf_to_gguf.pyfrom llama.cpp and compares inference with llama.cpp.llama_cpp_env_pathparameter (Optional[str]) — path to thellama_envvirtual environment wherellama-cpp-pythonandconvert_hf_to_gguf.pyare installed (defaults to"llama_env"relative to cwd). The virtual environment also isolates potentially conflicting versions oftorch,transformers, etc.compare_llama_cpp()method — saves the reference model and tokenizer tooutput_dir/hf_model(alongside the saved test model/report output) usingsave_pretrained(standard HuggingFace format), then:convert_hf_to_gguf.py(the official llama.cpp conversion CLI) insidellama_envvia subprocess to produce a GGUF F32 file atoutput_dir/model.gguf.llama_envviasubprocess.runto measure first-token latency withllama_cpp.Llama.Results include first-token match vs PyTorch,
llama_cpp_ttft_s,llama_cpp_ttfn_s,llama_cpp_total_time_s,llama_cpp_speedup_vs_pytorch, andllama_cpp_speedup_vs_onnx. Allllama_cpp/ggufimports are strictly isolated to the subprocess — the main Olive process never imports them.--test_llama_pathCLI option — specifies the path to thellama_envvirtual environment when runningolive optimize --test. When provided, the injectedOnnxDiscrepancyCheckpass automatically enablesllama_cpp=Trueand forwards the path asllama_cpp_env_path. Using--test_llama_pathwithout--testemits a warning.--test_metricsparsing — now accepts both space-separated (--test_metrics mae speedup) and comma-separated (--test_metrics mae,speedup) forms. A_parse_test_metricstype function handles splitting and validation per token, and a_flatten_test_metricshelper normalises the result before it is forwarded to the pass.add_discrepancy_check_passupdate-in-place — when a config was previously generated byolive optimize --dry_run --test, theOnnxDiscrepancyCheckpass was already present and was silently skipped on subsequentolive run --test --test_metrics …calls. The function now updates the existing pass in-place, refreshingreference_model_path(resolved to absolute),report_output_dir, metric settings (max_mae/timing_iterations), and llama.cpp settings — so--test_metrics,--output_path, and--test_llama_pathfromolive runalways take effect.ModelBuildernow copies the test model directory toreference_hf_model/alongside the generated ONNX output in the engine cache.OnnxDiscrepancyCheckfalls back to this cached copy whenreference_model_path(e.g.out/tiny-test) no longer exists on disk, fixing theOSError: Repo id must be in the form 'repo_name' or 'namespace/repo_name'error that occurred on subsequentolive runcalls that hit the model-builder cache.SaveTestModelConfigpass (olive/passes/pytorch/save_test_model_config.py) — a new Olive pass injected at the beginning of the passes list when--testis active. It takes anHfModelHandler, writesconfig.json(with the reduced hidden-layer count) and the Olive test-model marker file totest_model_path, and returns the model unchanged. This ensurestest_model_pathalways exists as a config-only directory beforeModelBuilderor any downstream pass needs it, replacing the previous_save_test_model_config_for_dry_runstandalone function.test-model-fast.yml) — new step that creates allama_envvirtual environment, installsgguf,safetensors,llama-cpp-python(from pre-built CPU wheels athttps://abetlen.github.io/llama-cpp-python/whl/cpu),transformers,sentencepiece, andprotobuf, and downloadsconvert_hf_to_gguf.pyfrom the llama.cpp GitHub repository.cli-fast-test.md) — clarifies WHERE the 2-layer reduction happens (_apply_test_model_configinolive/common/hf/utils.py, called during the model-builder pass ofolive run) and WHENout/tiny-testis created (by theSaveTestModelConfigpass on firstolive run, completed with weights byModelBuilder); documents the cache fallback behaviour; explains the--test_llama_pathoption andllama_envsetup; and clarifies that--test_metricsis always respected even when the config was generated byolive optimize --dry_run --test.Checklist before requesting a review
lintrunner -a(Optional) Issue link