perf(desktop): make the floating bar feel instant#7635
Conversation
Cut latency on every floating-bar query (text + voice) without changing answer quality or which model runs. All wins are general — no benchmark prompts are hardcoded. - Downscale screen captures to a 1568px longest edge (Claude's max useful vision resolution). ~10x less encode + upload + vision tokens, so this helps speed AND cost. Quality-neutral since Claude downscales anyway. - Take screenshot capture off the critical path: it now starts when a query begins routing and overlaps the router classification, instead of blocking the query on a fresh full-Retina grab. - Skip the ~300-700ms Haiku router for unambiguously conversational queries via a local heuristic (FloatingRouterHeuristic). Anything with an action signal still defers to the LLM router, so no new misroutes are introduced. - Voice: halve the live-mode finalize safety timeout (3s -> 1.5s). Accumulated interim text is already the fallback, so an early cutoff never drops words. - Render the first streamed token immediately instead of behind a 0.4s entrance spring; only the window resize animates now. - Add timing instrumentation (capture ms + KB, time-to-first-token, total) to the dev log for objective before/after measurement. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…hing The pi-mono chat proxy resends an identical system prompt and tool-definition block on every turn of a session. Mark the stable prefix with cache_control breakpoints so Anthropic caches it (5-min TTL, renewed on each hit), cutting input-token processing latency and cost on every subsequent message. - system is now serialized as a content-block array with a cache_control ephemeral breakpoint (was a bare string). - The last translated tool gets a cache_control breakpoint, so all tool definitions are cached. AnthropicUsage already parses and prices cache_creation/cache_read input tokens, so cost accounting was already built for this — it just wasn't being triggered. Tests updated for the new system shape. Note: not compile-verified locally (no Rust toolchain on the dev machine); relies on CI (cargo check/test) to build. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Greptile SummaryThis PR reduces floating-bar latency through six coordinated changes: downscaling screenshots to 1568px, overlapping capture with router classification, a local heuristic to skip the LLM router for conversational queries, halving the Deepgram finalization timeout, removing the 0.4s entrance animation, and adding Anthropic prompt-caching breakpoints to the system prompt and tool definitions.
Confidence Score: 3/5Client-side changes are safe to ship, but the backend prompt-caching change needs the missing beta header before it is safe to deploy — the system prompt format change from a plain string to a JSON array could break all Anthropic requests through the Rust proxy. The Rust backend now sends the system field as a JSON array of content blocks instead of a plain string, and includes cache_control in tool definitions, but neither handle_streaming nor handle_non_streaming sets the anthropic-beta header Anthropic requires. If Anthropic rejects the changed format, every chat completion routed through this backend would return an error. The floating-bar Swift changes are lower risk, though heuristic signal list gaps mean some agent-bound queries silently go to inline chat instead. desktop/Backend-Rust/src/routes/chat_completions.rs needs the anthropic-beta header; desktop/Desktop/Sources/FloatingControlBar/FloatingRouterHeuristic.swift needs its action-signal list audited against real LLM-router training data. Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant FCB as FloatingControlBar
participant Heuristic as FloatingRouterHeuristic
participant SCM as ScreenCaptureManager
participant Router as AgentPillsManager
participant Provider as ChatProvider
participant Backend as Backend-Rust
participant Anthropic
User->>FCB: Submit query
FCB->>FCB: prepareVisibleQueryState()
FCB->>SCM: beginScreenshotCapture() [Task.detached]
Note over SCM: Capture & downscale to 1568px async
FCB->>Heuristic: precheck(message)
alt No action-verb signal - .chat fast path
Heuristic-->>FCB: .chat
FCB->>FCB: consumePendingScreenshot()
FCB->>Provider: sendMessage()
Provider->>Backend: POST /v2/chat/completions
Backend->>Anthropic: POST /v1/messages system array + cache_control
Anthropic-->>Backend: stream
Backend-->>Provider: stream
Provider-->>FCB: "first token, showingAIResponse = true immediately"
else Has action-verb signal - .uncertain
Heuristic-->>FCB: .uncertain
Note over SCM,Router: Screenshot and router run concurrently
FCB->>Router: classify(message) 300-700ms
Router-->>FCB: .agent or .chat
alt .agent
FCB->>FCB: spawnAgentPill, pendingScreenshotTask abandoned
else .chat
FCB->>FCB: consumePendingScreenshot()
FCB->>Provider: sendMessage()
Provider->>Backend: POST /v2/chat/completions
Backend->>Anthropic: stream
Anthropic-->>Backend: stream
Backend-->>Provider: stream
Provider-->>FCB: tokens stream
end
end
|
| private static let actionSignals: [String] = [ | ||
| "build", "rebuild", "implement", "refactor", "debug", | ||
| "create a", "create an", "make me", "generate a", "compose", "draft a", | ||
| "write a", "write an", "write me", "write some", "edit", "modify", | ||
| "rename", "delete the", "remove the", | ||
| "send", "email", "reply to", "respond to", "post", "tweet", "dm", | ||
| "open", "launch", "navigate", "go to", "browse", "click", "fill out", | ||
| "download", "install", "deploy", "commit", "push the", | ||
| "schedule a", "book a", "order", "automate", "set up", "set-up", | ||
| "move the", "organize", "clean up", "update the", "code up", | ||
| ] | ||
|
|
||
| static func precheck(_ rawMessage: String) -> Precheck { | ||
| let message = rawMessage.lowercased() | ||
| guard !message.isEmpty else { return .chat } | ||
| for signal in actionSignals where containsWord(message, signal) { | ||
| return .uncertain | ||
| } | ||
| return .chat |
There was a problem hiding this comment.
Common action verbs missing from signal list cause silent misroutes
The invariant stated in the comment — "this never creates a misroute the router wouldn't also make" — depends on the signal list being complete. In practice, several high-frequency action verbs are absent: fix, run, execute, close, convert, resize, switch, type. A user saying "fix this bug" or "run this script" will hit no signal, fall through to .chat, and skip the LLM router entirely. If the LLM router would have classified those as .agent (dispatching a background pill that can actually take system actions), the query is silently downgraded to inline chat instead. The heuristic's safe-fallback guarantee is only valid when the signal list is comprehensive enough to catch any query the router would route to agent.
Summary
Makes Omi Desktop's floating bar respond noticeably faster for every query — text and voice — without changing answer quality or which model runs. Every optimization is general; no specific prompts are hardcoded.
The floating-bar request path is:
I instrumented and profiled this path. The pre-warm of the bridge + floating session at launch is already in place, so the remaining latency lived in a handful of concrete, low-risk spots. This PR fixes them.
What changed & why
1. Downscale screen captures (speed and cost)
ScreenCaptureManager.captureScreenData()encoded screenshots at the display's native resolution (often 3–5K) on every query. Claude downscales vision input to ~1568px on the longest side anyway, so the extra pixels were pure waste — encode time, upload bandwidth, vision tokens (cost), and server-side processing.2. Take screenshot capture off the critical path
sendAIQueryused toawaita fresh full-screen grab before sending the query. Now the capture starts the moment a query begins routing (beginScreenshotCapture) and overlaps the ~300–700ms router classification;sendAIQueryjust consumes the result. The capture is no longer on the blocking path for any query.3. Skip the LLM router for obvious chat (new
FloatingRouterHeuristic)Every query first awaited a Haiku router LLM call (chat vs. background-agent) before a single token streamed — a ~300–700ms serial tax even on "hi, how are you?".
A cheap local heuristic now returns
chatfor queries with no action-verb signal, letting those skip the LLM router entirely. Anything with an action signal (build,send,open,create a, …) still defers to the LLM router, so no new misroutes are introduced — an "agent" task by definition involves an action verb. This removes the router from the conversational majority (including all the common test prompts).4. Voice: finalize sooner
The live-mode (Deepgram) finalize waited a flat 3s safety timeout. Halved to 1.5s. Accumulated interim text is already the fallback in
sendTranscript, so an early cutoff never drops words — it just stops idling for the worst case. (Omni STT's relay timeout and batch mode are untouched.)5. Render the first token immediately
The first streamed token was gated behind a 0.4s entrance spring. Now the response container reveals immediately and only the window resize animates, so text paints on frame one.
6. Anthropic prompt caching for the tools+system prefix (backend)
The pi-mono chat proxy (
Backend-Rust) resends an identical system prompt + tool-definition block every turn. This PR marks that stable prefix withcache_controlbreakpoints (system content-block + last tool), so Anthropic caches it (5-min TTL, renewed on hit) — cutting input-token latency and cost on every subsequent message.AnthropicUsagealready parsed and pricedcache_*_input_tokens, so the cost accounting was already built for this; it just wasn't being triggered.Measurement
Added timing logs to the dev log (
/tmp/omi-dev.log) for objective before/after:ScreenCaptureManager: captured WxH → wxh, WebP NN KB in NNmsFloatingBar: time-to-first-token NNmsFloatingBar: response complete in NNmsTo test the client wins against production (no local Rust needed):
Run a mix of text + voice (⌥) prompts and compare the logged numbers against
main.Notes for reviewers
cargo check/cargo test. The two affected tests were updated for the newsystemshape.Test plan
xcrun swift build -c debug --package-path Desktop) — verified locally,Build complete.🤖 Generated with Claude Code