Kv prefix cache#23
Merged
Merged
Conversation
Previously every completion cleared the context's KV cache and prefilled the entire prompt (~45s for a 20k-token agent prompt at ~450 tok/s). Now, when a request sets cache_prompt, the runtime keeps a per-model record of the tokens last decoded into the context (ModelRuntime::LoadedModelState::kvCacheTokens), computes the longest common prefix with the new prompt, erases only the divergent tail (llama_memory_seq_rm) and prefills just the suffix. - Record covers prompt + generated tokens, so the typical agent turn (previous prompt + reply + new tool results) reuses nearly all of it - At least one prompt token is always re-decoded so sampling has fresh logits at the final position - Record is cleared at inference start and only repopulated on success; errors/aborts fall back to a full clear on the next request - Embedding requests share the context, so they invalidate the record - Guarded by the existing inference mutex; record dies with the model state on unload/swap - No behaviour change when cache_prompt is absent/false Tests: kvPrefixReuseLength unit cases in llamaProviderTests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The route builds CompletionRequest field-by-field and was silently dropping the flag, so KV prefix reuse never engaged end-to-end. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.