Skip to content

Kv prefix cache#23

Merged
caseymcc merged 2 commits into
mainfrom
kv-prefix-cache
Jun 11, 2026
Merged

Kv prefix cache#23
caseymcc merged 2 commits into
mainfrom
kv-prefix-cache

Conversation

@caseymcc

Copy link
Copy Markdown
Owner

No description provided.

AVA Agent and others added 2 commits June 11, 2026 10:06
Previously every completion cleared the context's KV cache and
prefilled the entire prompt (~45s for a 20k-token agent prompt at
~450 tok/s).  Now, when a request sets cache_prompt, the runtime keeps
a per-model record of the tokens last decoded into the context
(ModelRuntime::LoadedModelState::kvCacheTokens), computes the longest
common prefix with the new prompt, erases only the divergent tail
(llama_memory_seq_rm) and prefills just the suffix.

- Record covers prompt + generated tokens, so the typical agent turn
  (previous prompt + reply + new tool results) reuses nearly all of it
- At least one prompt token is always re-decoded so sampling has fresh
  logits at the final position
- Record is cleared at inference start and only repopulated on success;
  errors/aborts fall back to a full clear on the next request
- Embedding requests share the context, so they invalidate the record
- Guarded by the existing inference mutex; record dies with the model
  state on unload/swap
- No behaviour change when cache_prompt is absent/false

Tests: kvPrefixReuseLength unit cases in llamaProviderTests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The route builds CompletionRequest field-by-field and was silently
dropping the flag, so KV prefix reuse never engaged end-to-end.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@caseymcc caseymcc merged commit f7c3cbf into main Jun 11, 2026
1 check passed
@caseymcc caseymcc deleted the kv-prefix-cache branch June 11, 2026 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant