Skip to content

perf(desktop): make the floating bar feel instant#7635

Open
realtobyfu wants to merge 2 commits into
BasedHardware:mainfrom
realtobyfu:track2-make-omi-fast
Open

perf(desktop): make the floating bar feel instant#7635
realtobyfu wants to merge 2 commits into
BasedHardware:mainfrom
realtobyfu:track2-make-omi-fast

Conversation

@realtobyfu
Copy link
Copy Markdown

Summary

Makes Omi Desktop's floating bar respond noticeably faster for every query — text and voice — without changing answer quality or which model runs. Every optimization is general; no specific prompts are hardcoded.

The floating-bar request path is:

Floating bar → routeQuery → ChatProvider.sendMessage
  → AgentBridge (Node) → pi-mono → api.omi.me → Claude → stream

I instrumented and profiled this path. The pre-warm of the bridge + floating session at launch is already in place, so the remaining latency lived in a handful of concrete, low-risk spots. This PR fixes them.

What changed & why

1. Downscale screen captures (speed and cost)

ScreenCaptureManager.captureScreenData() encoded screenshots at the display's native resolution (often 3–5K) on every query. Claude downscales vision input to ~1568px on the longest side anyway, so the extra pixels were pure waste — encode time, upload bandwidth, vision tokens (cost), and server-side processing.

  • Cap the longest edge at 1568px before encoding. ~10× less pixel work and ~10× smaller payload.
  • Quality-neutral (the model would have downscaled regardless).
  • Capture drops from ~100–500ms to ~10–50ms; per-query vision-token cost drops proportionally.

2. Take screenshot capture off the critical path

sendAIQuery used to await a fresh full-screen grab before sending the query. Now the capture starts the moment a query begins routing (beginScreenshotCapture) and overlaps the ~300–700ms router classification; sendAIQuery just consumes the result. The capture is no longer on the blocking path for any query.

3. Skip the LLM router for obvious chat (new FloatingRouterHeuristic)

Every query first awaited a Haiku router LLM call (chat vs. background-agent) before a single token streamed — a ~300–700ms serial tax even on "hi, how are you?".

A cheap local heuristic now returns chat for queries with no action-verb signal, letting those skip the LLM router entirely. Anything with an action signal (build, send, open, create a, …) still defers to the LLM router, so no new misroutes are introduced — an "agent" task by definition involves an action verb. This removes the router from the conversational majority (including all the common test prompts).

4. Voice: finalize sooner

The live-mode (Deepgram) finalize waited a flat 3s safety timeout. Halved to 1.5s. Accumulated interim text is already the fallback in sendTranscript, so an early cutoff never drops words — it just stops idling for the worst case. (Omni STT's relay timeout and batch mode are untouched.)

5. Render the first token immediately

The first streamed token was gated behind a 0.4s entrance spring. Now the response container reveals immediately and only the window resize animates, so text paints on frame one.

6. Anthropic prompt caching for the tools+system prefix (backend)

The pi-mono chat proxy (Backend-Rust) resends an identical system prompt + tool-definition block every turn. This PR marks that stable prefix with cache_control breakpoints (system content-block + last tool), so Anthropic caches it (5-min TTL, renewed on hit) — cutting input-token latency and cost on every subsequent message. AnthropicUsage already parsed and priced cache_*_input_tokens, so the cost accounting was already built for this; it just wasn't being triggered.

Measurement

Added timing logs to the dev log (/tmp/omi-dev.log) for objective before/after:

  • ScreenCaptureManager: captured WxH → wxh, WebP NN KB in NNms
  • FloatingBar: time-to-first-token NNms
  • FloatingBar: response complete in NNms

To test the client wins against production (no local Rust needed):

OMI_SKIP_BACKEND=1 OMI_APP_NAME=omi-fast ./run.sh
tail -f /tmp/omi-dev.log | grep -E "ScreenCaptureManager: captured|FloatingBar:"

Run a mix of text + voice (⌥) prompts and compare the logged numbers against main.

Notes for reviewers

  • Quality-neutral by design. No model routing changes, no skipping the agent, no prompt-specific shortcuts; the heuristic only adds a safe fast-path and falls back to the existing router when unsure.
  • Backend change not compiled locally (no Rust toolchain on my machine) — relies on CI cargo check/cargo test. The two affected tests were updated for the new system shape.
  • A more aggressive speculative routing idea (race the chat stream against the router) was scoped but intentionally left out — it overlaps the heuristic and has agent-handoff UX edge cases that warrant live validation first.

Test plan

  • Desktop builds (xcrun swift build -c debug --package-path Desktop) — verified locally, Build complete.
  • Backend builds in CI.
  • Manual: run the 4 common prompts + hidden ones (text + voice); confirm faster TTFT, ~10× smaller screenshot payload, and unchanged answer quality (incl. "what do you see?").

🤖 Generated with Claude Code

realtobyfu and others added 2 commits June 4, 2026 01:22
Cut latency on every floating-bar query (text + voice) without changing
answer quality or which model runs. All wins are general — no benchmark
prompts are hardcoded.

- Downscale screen captures to a 1568px longest edge (Claude's max useful
  vision resolution). ~10x less encode + upload + vision tokens, so this
  helps speed AND cost. Quality-neutral since Claude downscales anyway.
- Take screenshot capture off the critical path: it now starts when a query
  begins routing and overlaps the router classification, instead of blocking
  the query on a fresh full-Retina grab.
- Skip the ~300-700ms Haiku router for unambiguously conversational queries
  via a local heuristic (FloatingRouterHeuristic). Anything with an action
  signal still defers to the LLM router, so no new misroutes are introduced.
- Voice: halve the live-mode finalize safety timeout (3s -> 1.5s). Accumulated
  interim text is already the fallback, so an early cutoff never drops words.
- Render the first streamed token immediately instead of behind a 0.4s
  entrance spring; only the window resize animates now.
- Add timing instrumentation (capture ms + KB, time-to-first-token, total)
  to the dev log for objective before/after measurement.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…hing

The pi-mono chat proxy resends an identical system prompt and tool-definition
block on every turn of a session. Mark the stable prefix with cache_control
breakpoints so Anthropic caches it (5-min TTL, renewed on each hit), cutting
input-token processing latency and cost on every subsequent message.

- system is now serialized as a content-block array with a cache_control
  ephemeral breakpoint (was a bare string).
- The last translated tool gets a cache_control breakpoint, so all tool
  definitions are cached.

AnthropicUsage already parses and prices cache_creation/cache_read input
tokens, so cost accounting was already built for this — it just wasn't being
triggered. Tests updated for the new system shape.

Note: not compile-verified locally (no Rust toolchain on the dev machine);
relies on CI (cargo check/test) to build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 4, 2026

Greptile Summary

This PR reduces floating-bar latency through six coordinated changes: downscaling screenshots to 1568px, overlapping capture with router classification, a local heuristic to skip the LLM router for conversational queries, halving the Deepgram finalization timeout, removing the 0.4s entrance animation, and adding Anthropic prompt-caching breakpoints to the system prompt and tool definitions.

  • Screen capture (ScreenCaptureManager.swift): Longest-edge cap at 1568px and off-critical-path Task.detached execution are straightforward wins with no quality impact.
  • Heuristic router (FloatingRouterHeuristic.swift): The word-boundary logic is correct, but the action-signal list omits common verbs like fix, run, and execute, which can silently misroute those queries to inline chat instead of the background agent.
  • Backend prompt caching (routes/chat_completions.rs): cache_control breakpoints are added to the system content block and last tool, but the required anthropic-beta: prompt-caching-2024-07-31 header is absent from both handle_streaming and handle_non_streaming; without it the caching never activates and the new array-format system field may trigger a 400 error from Anthropic on every request.

Confidence Score: 3/5

Client-side changes are safe to ship, but the backend prompt-caching change needs the missing beta header before it is safe to deploy — the system prompt format change from a plain string to a JSON array could break all Anthropic requests through the Rust proxy.

The Rust backend now sends the system field as a JSON array of content blocks instead of a plain string, and includes cache_control in tool definitions, but neither handle_streaming nor handle_non_streaming sets the anthropic-beta header Anthropic requires. If Anthropic rejects the changed format, every chat completion routed through this backend would return an error. The floating-bar Swift changes are lower risk, though heuristic signal list gaps mean some agent-bound queries silently go to inline chat instead.

desktop/Backend-Rust/src/routes/chat_completions.rs needs the anthropic-beta header; desktop/Desktop/Sources/FloatingControlBar/FloatingRouterHeuristic.swift needs its action-signal list audited against real LLM-router training data.

Important Files Changed

Filename Overview
desktop/Backend-Rust/src/routes/chat_completions.rs Adds cache_control breakpoints to system prompt (now a content-block array) and the last tool definition; missing the required anthropic-beta: prompt-caching-2024-07-31 header in both handle_streaming and handle_non_streaming, so caching will not activate and the new system array format may cause 400 errors from Anthropic.
desktop/Backend-Rust/src/models/chat_completions.rs Widens AnthropicRequest.system from Option to Option<serde_json::Value> and adds cache_control: Option<serde_json::Value> to AnthropicTool; type changes are correct and tests are updated accordingly.
desktop/Desktop/Sources/FloatingControlBar/FloatingRouterHeuristic.swift New local pre-classifier to skip the LLM router for conversational queries; word-boundary logic is sound but the action signal list omits common verbs (fix, run, execute, close) that the LLM router would route to agent, risking silent misroutes to inline chat.
desktop/Desktop/Sources/FloatingControlBar/FloatingControlBarWindow.swift Adds screenshot parallelism, heuristic fast-path, TTFT instrumentation, and immediate response-container reveal; screenshot task is unnecessarily kept alive on the agent route.
desktop/Desktop/Sources/FloatingControlBar/ScreenCaptureManager.swift Caps longest edge at 1568px before WebP encoding, adds interpolation quality, and adds elapsed-time logging; straightforward and correct.
desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift Halves the Deepgram finalization timeout from 3s to 1.5s; safe because accumulated interim text is the fallback, but may truncate on very slow connections.
desktop/CHANGELOG.json Adds a user-facing changelog entry for the floating bar latency improvements; no issues.

Sequence Diagram

sequenceDiagram
    participant User
    participant FCB as FloatingControlBar
    participant Heuristic as FloatingRouterHeuristic
    participant SCM as ScreenCaptureManager
    participant Router as AgentPillsManager
    participant Provider as ChatProvider
    participant Backend as Backend-Rust
    participant Anthropic

    User->>FCB: Submit query
    FCB->>FCB: prepareVisibleQueryState()
    FCB->>SCM: beginScreenshotCapture() [Task.detached]
    Note over SCM: Capture & downscale to 1568px async
    FCB->>Heuristic: precheck(message)
    alt No action-verb signal - .chat fast path
        Heuristic-->>FCB: .chat
        FCB->>FCB: consumePendingScreenshot()
        FCB->>Provider: sendMessage()
        Provider->>Backend: POST /v2/chat/completions
        Backend->>Anthropic: POST /v1/messages system array + cache_control
        Anthropic-->>Backend: stream
        Backend-->>Provider: stream
        Provider-->>FCB: "first token, showingAIResponse = true immediately"
    else Has action-verb signal - .uncertain
        Heuristic-->>FCB: .uncertain
        Note over SCM,Router: Screenshot and router run concurrently
        FCB->>Router: classify(message) 300-700ms
        Router-->>FCB: .agent or .chat
        alt .agent
            FCB->>FCB: spawnAgentPill, pendingScreenshotTask abandoned
        else .chat
            FCB->>FCB: consumePendingScreenshot()
            FCB->>Provider: sendMessage()
            Provider->>Backend: POST /v2/chat/completions
            Backend->>Anthropic: stream
            Anthropic-->>Backend: stream
            Backend-->>Provider: stream
            Provider-->>FCB: tokens stream
        end
    end
Loading

Comments Outside Diff (2)

  1. desktop/Backend-Rust/src/routes/chat_completions.rs, line 574-576 (link)

    P1 Missing anthropic-beta header for prompt caching

    The cache_control breakpoints added to system and tool definitions require the anthropic-beta: prompt-caching-2024-07-31 header to take effect. Without it, the system field being sent as a content-block array (instead of the base API's plain string) is likely to result in a 400 from Anthropic on every request — both streaming and non-streaming paths make the same call without this header. The current official docs still require the beta header alongside anthropic-version: 2023-06-01. If Anthropic silently ignores the array form the caching benefit is completely lost; if it rejects the format, all chat completions through this backend would fail.

    Add the header to both handle_streaming and handle_non_streaming:

    .header("anthropic-beta", "prompt-caching-2024-07-31")
    
  2. desktop/Desktop/Sources/FloatingControlBar/FloatingControlBarWindow.swift, line 1347-1361 (link)

    P2 Screenshot task started but never consumed on the agent route

    beginScreenshotCapture() is called unconditionally before both the heuristic check and the LLM router. When the router returns .agent, the code spawns an agent pill and returns without ever calling consumePendingScreenshot(). The Task.detached capture runs to completion regardless (Swift task cancellation is cooperative and captureScreenData() is synchronous with no cancellation checks), leaving the captured Data? referenced by pendingScreenshotTask until the next call to beginScreenshotCapture(). On retina displays this can be several hundred KB held longer than needed.

Reviews (1): Last reviewed commit: "perf(backend): cache the tools+system pr..." | Re-trigger Greptile

Comment on lines +32 to +50
private static let actionSignals: [String] = [
"build", "rebuild", "implement", "refactor", "debug",
"create a", "create an", "make me", "generate a", "compose", "draft a",
"write a", "write an", "write me", "write some", "edit", "modify",
"rename", "delete the", "remove the",
"send", "email", "reply to", "respond to", "post", "tweet", "dm",
"open", "launch", "navigate", "go to", "browse", "click", "fill out",
"download", "install", "deploy", "commit", "push the",
"schedule a", "book a", "order", "automate", "set up", "set-up",
"move the", "organize", "clean up", "update the", "code up",
]

static func precheck(_ rawMessage: String) -> Precheck {
let message = rawMessage.lowercased()
guard !message.isEmpty else { return .chat }
for signal in actionSignals where containsWord(message, signal) {
return .uncertain
}
return .chat
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Common action verbs missing from signal list cause silent misroutes

The invariant stated in the comment — "this never creates a misroute the router wouldn't also make" — depends on the signal list being complete. In practice, several high-frequency action verbs are absent: fix, run, execute, close, convert, resize, switch, type. A user saying "fix this bug" or "run this script" will hit no signal, fall through to .chat, and skip the LLM router entirely. If the LLM router would have classified those as .agent (dispatching a background pill that can actually take system actions), the query is silently downgraded to inline chat instead. The heuristic's safe-fallback guarantee is only valid when the signal list is comprehensive enough to catch any query the router would route to agent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant