caseymcc · caseymcc · Jun 11, 2026 · May 23, 2026 · Jun 10, 2026 · Jun 10, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,103 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## What this is
+
+ArbiterAI is a C++17 library providing a unified, embeddable interface across multiple LLM
+providers (OpenAI, Anthropic, DeepSeek, OpenRouter, llama.cpp local models, and a Mock provider for
+testing). It also ships a standalone OpenAI-compatible HTTP server (`arbiterAI-server`) with model
+lifecycle management, telemetry, and a live dashboard.
+
+## Build, test, run — everything goes through Docker
+
+All building, testing, and running happens **inside the Docker container** (`docker/Dockerfile`). The
+host is not guaranteed to have the toolchain (CMake + vcpkg + llama.cpp) or dependencies. Dependencies
+are managed by vcpkg (`vcpkg.json`).
+
+```bash
+./runDocker.sh                  # start/attach the container (bind-mounts repo at /app)
+./runDocker.sh ./build.sh       # build (runs CMake automatically if cmake files changed)
+./runDocker.sh ./build.sh --rebuild        # clean rebuild of the app
+./runDocker.sh ./build.sh --rebuild-cmake  # nuke CMake dir + re-run CMake (only if cmake is broken)
+./runDocker.sh --rebuild        # rebuild the Docker *image* (only when Dockerfile changes)
+./runDocker.sh --stop           # stop and remove the container
+```
+
+Build output: `build/${OS}_${ARCH}_${BUILD_TYPE}`, default `build/linux_x64_debug/`.
+Targets: `arbiterai` (library), `arbiterai_tests`, `arbiterAI-cli`, `arbiterAI-proxy`, `arbiterAI-server`.
+
+### Tests (Google Test)
+
+```bash
+./runDocker.sh ./build/linux_x64_debug/arbiterai_tests
+./runDocker.sh ./build/linux_x64_debug/arbiterai_tests --gtest_filter='ModelManager*'   # single suite/test
+```
+
+### Working rules
+
+- Run binaries/commands through `./runDocker.sh ...`. Do **not** use host `python`/`pip`/`pytest` or host virtualenvs — the container is the environment.
+- Do **not** launch `arbiterAI-server` yourself; ask the user to launch it so it doesn't occupy the agent terminal.
+- Avoid `2>&1` redirection — the user needs to see live output.
+
+## Configuration model
+
+Model/provider configs are JSON, loaded by `ModelManager` (singleton) with schema validation
+(`schemas/`). The default configs live in the **`arbiterAI_config` git submodule** (`arbiterAI_config/configs/defaults/{models,backends}/`).
+`ArbiterAI::initialize()` takes a list of config directories. The server merges these with runtime-injected
+configs (added/updated/removed via REST without restart) and can persist them via an override path.
+
+## Architecture
+
+Layered, strategy-pattern core (see `docs/developer.md` for the full API reference):
+
+```
+ArbiterAI (singleton factory + lifecycle)   ── src/arbiterAI/arbiterAI.{h,cpp}
+  ├─ createChatClient() → ChatClient (stateful per-session: history, tools, cache, stats)
+  ├─ owns ModelManager (singleton: config load, schema validation, model lookup, ConfigDownloader)
+  └─ stateless convenience: completion(), streamingCompletion(), batchCompletion(), getEmbeddings()
+        │ delegates to
+   BaseProvider (abstract)  ── src/arbiterAI/providers/baseProvider.h
+     OpenAI · Anthropic · DeepSeek · OpenRouter · Llama (local) · Mock
+```
+
+- **Providers** are instantiated by a `switch` in `arbiterAI.cpp` keyed on the provider string (`createProvider`-style factory). To add a provider: create `providers/<name>.{h,cpp}` subclassing `BaseProvider`, add it to that switch, add the source to `CMakeLists.txt`, and add a model config JSON.
+- **Error handling is error-code based** (`ErrorCode` enum), not exceptions — follow this; avoid try/catch where an error code works.
+
+### Local model subsystem (llama.cpp)
+
+Distinct from the cloud providers, this is the heavier piece:
+
+- **`ModelRuntime`** (`modelRuntime.{h,cpp}`) — multi-model loading into VRAM/RAM, swap queueing, LRU eviction, GGUF-aware load-failure classification (`LoadFailureReason`/`LoadErrorDetail`).
+- **`InferenceScheduler`** (`inferenceScheduler.{h,cpp}`) — request pipeline with stages (Queued → Tokenizing → WaitingAccelerator → Inferring → Complete), and `TokenChannel` for streaming tokens from the accelerator thread to the HTTP thread.
+- **`HardwareDetector`** — GPU/VRAM/RAM/CPU detection; **`ModelFitCalculator`** — whether a model fits available hardware.
+- **`ModelDownloader`** / **`StorageManager`** — download GGUF files (libgit2 / HTTP), track storage, hot-ready/protected flags, cleanup.
+- **`TelemetryCollector`** — inference stats and system snapshots, surfaced by the server.
+
+### Server (`src/server/`)
+
+Separate CMake target linking `arbiterai` + cpp-httplib (httplib is a server-only dependency, kept out of
+the core library). `routes.cpp` defines the OpenAI-compatible endpoints (`/v1/chat/completions`,
+`/v1/models`, `/v1/embeddings` with SSE streaming), model management, telemetry (`/api/stats`), and config
+injection. `dashboard.h`/`dashboardConfig.h` are embedded HTML/JS for the `/dashboard` UI. The server takes a
+single required config file: `arbiterAI-server -c <config.json>`. See `docs/server.md`.
+
+### Testing without API keys
+
+The **Mock provider** (`providers/mock.{h,cpp}`) returns deterministic responses driven by `<echo>...</echo>`
+tags in messages — no network or keys. Use `"provider": "mock"` in a model config. See `docs/testing.md`.
+
+## Code style (from `.roo/rules-code/` and `.github/instructions/`)
+
+- Files: **camelCase** names, `.h`/`.cpp`/`.inl`. Header guards `_PROJECT_FILENAME_EXT_`, **no `#pragma once`**.
+- Braces: open brace on a **new line** for namespaces/functions/control blocks; **same line** for struct/class definitions in headers.
+- Naming: Types `PascalCase`; functions/methods `camelCase`; class members `m_camelCase`; locals/struct vars `camelCase`; macros `UPPER_CASE`.
+- Spacing: no space around `=`, `::`, unary operators, or between a keyword/function name and `(`; spaces around comparison/logical operators; comma after, not before.
+- Pointers/refs bind to the variable: `type *var`, `type &var`. Minimize `auto`. Minimize comments — none for obvious code.
+- Includes: `""` for local files, `<>` for libraries. Namespaces: prefer explicit qualification over `using` directives; aliases allowed.
+
+## Docs map
+
+`docs/developer.md` (architecture + API), `docs/server.md` (server API), `docs/testing.md` (mock/echo),
+`docs/project.md` (goals/providers), `docs/tasks/` (active task plans). The `docs/old/` and
+`docs/development/tasks/completed/` dirs are historical.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -75,6 +75,8 @@ set(arbiterai_src
     ./src/arbiterAI/modelRuntime.cpp
     ./src/arbiterAI/telemetryCollector.h
     ./src/arbiterAI/telemetryCollector.cpp
+    ./src/arbiterAI/inferenceScheduler.h
+    ./src/arbiterAI/inferenceScheduler.cpp
     ./src/arbiterAI/storageManager.h
     ./src/arbiterAI/storageManager.cpp
     ./src/arbiterAI/providers/baseProvider.h
@@ -141,6 +143,7 @@ target_link_libraries(arbiterai
         tests/hardwareDetectorTests.cpp
         tests/modelRuntimeTests.cpp
         tests/telemetryCollectorTests.cpp
+        tests/inferenceSchedulerTests.cpp
         tests/llamaProviderTests.cpp
         tests/storageManagerTests.cpp
         tests/serverConnectTests.cpp

diff --git a/arbiterAI_config b/arbiterAI_config
diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -1,5 +1,5 @@
 # syntax=docker/dockerfile:1
-ARG DOCKER_VERSION=1.2.1
+ARG DOCKER_VERSION=1.2.2
 FROM ubuntu:24.04
 
 # Install basic build tools, Python 3, and GPU libraries.
@@ -33,6 +33,7 @@ RUN apt-get update && apt-get install -y \
     vulkan-tools \
     libvulkan-dev \
     mesa-vulkan-drivers \
+    spirv-headers \
     glslc \
     glslang-tools \
     wget \

diff --git a/docs/developer.md b/docs/developer.md
@@ -53,14 +53,15 @@ ArbiterAI follows a layered architecture:
 - **[`ModelManager`](../src/arbiterAI/modelManager.h)** — Singleton that loads and manages model configurations from JSON files with schema validation.
 - **Utility Components** — Cross-cutting functionality including caching ([`CacheManager`](../src/arbiterAI/cacheManager.h)), cost tracking ([`CostManager`](../src/arbiterAI/costManager.h)), model downloading ([`ModelDownloader`](../src/arbiterAI/modelDownloader.h)), and file verification ([`FileVerifier`](../src/arbiterAI/fileVerifier.h)).
 
-### Planned Components
+### Local Model Components
 
-See [Local Model Management Task](tasks/local_model_management.md) for upcoming additions:
+Components supporting local (llama.cpp) models — see [Local Model Management Task](tasks/local_model_management.md) for background:
 
-- **`HardwareDetector`** — GPU/RAM/CPU detection (NVML + Vulkan)
-- **`ModelRuntime`** — Multi-model loading, swap queueing, LRU eviction (refactor of `LlamaInterface`)
-- **`TelemetryCollector`** — Inference stats and system snapshots
-- **Standalone Server** — Separate `arbiterAI-server` application providing an OpenAI-compatible API, model management endpoints, and a live stats dashboard
+- **[`HardwareDetector`](../src/arbiterAI/hardwareDetector.h)** — GPU/RAM/CPU detection (NVML + Vulkan)
+- **[`ModelRuntime`](../src/arbiterAI/modelRuntime.h)** — Multi-model loading, swap queueing, LRU eviction, load-failure classification
+- **[`InferenceScheduler`](../src/arbiterAI/inferenceScheduler.h)** — Inference pipeline used by the server for local models. HTTP threads submit jobs; a tokenizer thread loads the model and pre-tokenizes the prompt; per-accelerator worker threads run inference. Streaming tokens flow back to the HTTP thread through a `TokenChannel`, and jobs are cancelled on client disconnect. Active jobs are exposed at `/api/scheduler/jobs`.
+- **[`TelemetryCollector`](../src/arbiterAI/telemetryCollector.h)** — Inference stats and system snapshots
+- **Standalone Server** — Separate `arbiterAI-server` application providing an OpenAI-compatible API, model management endpoints, and a live stats dashboard (see [Server Guide](server.md))
 
 ---
 

diff --git a/docs/server.md b/docs/server.md
@@ -31,7 +31,7 @@ The server supports:
 - **Model lifecycle management** — Load, unload, pin, and download models at runtime
 - **Runtime model config injection** — Add, update, or remove model configurations via REST without restarting
 - **Storage management** — Track downloaded model files, set hot ready / protected flags, configure automated cleanup, monitor disk usage and download progress with speed and ETA
-- **Telemetry** — System snapshots, inference history, swap history, and hardware info
+- **Telemetry** — System snapshots, inference history, swap history, active scheduler jobs, and hardware info
 - **Live dashboard** — Browser-based UI at `/dashboard` with storage bar, download progress, and model management
 - **CORS** — All responses include permissive CORS headers
 
@@ -738,6 +738,8 @@ Inference history within a time window.
   {
     "model": "gpt-4",
     "variant": "",
+    "job_id": 17,
+    "cancelled": false,
     "tokens_per_second": 45.2,
     "prompt_tokens": 120,
     "completion_tokens": 80,
@@ -747,6 +749,40 @@ Inference history within a time window.
 ]
 ```
 
+#### `GET /api/scheduler/jobs`
+
+Active inference scheduler jobs (local models only). Jobs flow through the
+pipeline stages `queued` → `tokenizing` → `waiting` → `inferring`; completed
+and cancelled jobs are not listed. Returns `[]` when the scheduler is not
+running.
+
+**Response:**
+
+```json
+[
+  {
+    "id": 17,
+    "model": "my-local-model",
+    "stage": "inferring",
+    "streaming": true,
+    "prompt_tokens": 120,
+    "completion_tokens": 34,
+    "queue_position": 0,
+    "elapsed_ms": 2150.0
+  }
+]
+```
+
+| Field | Description |
+|-------|-------------|
+| `id` | Scheduler job ID (matches `job_id` in inference history) |
+| `stage` | `queued`, `tokenizing`, `waiting`, or `inferring` |
+| `streaming` | Whether the request is a streaming completion |
+| `prompt_tokens` | Prompt token count (available once tokenized) |
+| `completion_tokens` | Tokens generated so far (streaming jobs only) |
+| `queue_position` | Position in the accelerator queue (`0` = running) |
+| `elapsed_ms` | Time since the job was submitted |
+
 #### `GET /api/stats/swaps`
 
 Model swap history.

diff --git a/schemas/model_config.schema.json b/schemas/model_config.schema.json
@@ -299,6 +299,12 @@
               "enum": ["vulkan", "rocm", "cuda"]
             },
             "uniqueItems": true
+          },
+          "api_format": {
+            "type": "string",
+            "description": "Output format produced by the model. When set (e.g. 'harmony'), the server converts the model's native output to standard OpenAI API format so clients don't need to understand the model's native format.",
+            "enum": ["", "harmony"],
+            "default": ""
           }
         }
       }

diff --git a/src/arbiterAI/arbiterAI.cpp b/src/arbiterAI/arbiterAI.cpp
@@ -254,6 +254,13 @@ ErrorCode ArbiterAI::completion(const CompletionRequest &request, CompletionResp
 
 ErrorCode ArbiterAI::streamingCompletion(const CompletionRequest &request,
     std::function<void(const std::string &)> callback)
+{
+    return streamingCompletion(request, callback, nullptr);
+}
+
+ErrorCode ArbiterAI::streamingCompletion(const CompletionRequest &request,
+    std::function<void(const std::string &)> callback,
+    std::function<void()> waitCallback)
 {
     if (!ArbiterAI::instance().initialized)
     {
@@ -273,6 +280,10 @@ ErrorCode ArbiterAI::streamingCompletion(const CompletionRequest &request,
         return ErrorCode::UnsupportedProvider;
     }
 
+    if(waitCallback)
+    {
+        return provider->streamingCompletion(request, callback, waitCallback);
+    }
     return provider->streamingCompletion(request, callback);
 }
 

diff --git a/src/arbiterAI/arbiterAI.h b/src/arbiterAI/arbiterAI.h
@@ -76,7 +76,9 @@ enum class ErrorCode
     ModelLoadError,
     ModelDownloading,
     ModelDownloadFailed,
-    InsufficientStorage
+    InsufficientStorage,
+    ServerOverloaded,
+    Cancelled
 };
 
 /**
@@ -616,6 +618,17 @@ class ArbiterAI
     ErrorCode streamingCompletion(const CompletionRequest &request,
         std::function<void(const std::string &)> callback);
 
+    /**
+     * @brief Perform streaming completion with queue wait notification
+     * @param request Completion parameters
+     * @param callback Function to receive streaming chunks
+     * @param waitCallback Called periodically while waiting for backend availability
+     * @return ErrorCode indicating success or failure
+     */
+    ErrorCode streamingCompletion(const CompletionRequest &request,
+        std::function<void(const std::string &)> callback,
+        std::function<void()> waitCallback);
+
     /**
      * @brief Process multiple completion requests in batch
      * @param requests Vector of completion requests