Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## What this is

ArbiterAI is a C++17 library providing a unified, embeddable interface across multiple LLM
providers (OpenAI, Anthropic, DeepSeek, OpenRouter, llama.cpp local models, and a Mock provider for
testing). It also ships a standalone OpenAI-compatible HTTP server (`arbiterAI-server`) with model
lifecycle management, telemetry, and a live dashboard.

## Build, test, run — everything goes through Docker

All building, testing, and running happens **inside the Docker container** (`docker/Dockerfile`). The
host is not guaranteed to have the toolchain (CMake + vcpkg + llama.cpp) or dependencies. Dependencies
are managed by vcpkg (`vcpkg.json`).

```bash
./runDocker.sh # start/attach the container (bind-mounts repo at /app)
./runDocker.sh ./build.sh # build (runs CMake automatically if cmake files changed)
./runDocker.sh ./build.sh --rebuild # clean rebuild of the app
./runDocker.sh ./build.sh --rebuild-cmake # nuke CMake dir + re-run CMake (only if cmake is broken)
./runDocker.sh --rebuild # rebuild the Docker *image* (only when Dockerfile changes)
./runDocker.sh --stop # stop and remove the container
```

Build output: `build/${OS}_${ARCH}_${BUILD_TYPE}`, default `build/linux_x64_debug/`.
Targets: `arbiterai` (library), `arbiterai_tests`, `arbiterAI-cli`, `arbiterAI-proxy`, `arbiterAI-server`.

### Tests (Google Test)

```bash
./runDocker.sh ./build/linux_x64_debug/arbiterai_tests
./runDocker.sh ./build/linux_x64_debug/arbiterai_tests --gtest_filter='ModelManager*' # single suite/test
```

### Working rules

- Run binaries/commands through `./runDocker.sh ...`. Do **not** use host `python`/`pip`/`pytest` or host virtualenvs — the container is the environment.
- Do **not** launch `arbiterAI-server` yourself; ask the user to launch it so it doesn't occupy the agent terminal.
- Avoid `2>&1` redirection — the user needs to see live output.

## Configuration model

Model/provider configs are JSON, loaded by `ModelManager` (singleton) with schema validation
(`schemas/`). The default configs live in the **`arbiterAI_config` git submodule** (`arbiterAI_config/configs/defaults/{models,backends}/`).
`ArbiterAI::initialize()` takes a list of config directories. The server merges these with runtime-injected
configs (added/updated/removed via REST without restart) and can persist them via an override path.

## Architecture

Layered, strategy-pattern core (see `docs/developer.md` for the full API reference):

```
ArbiterAI (singleton factory + lifecycle) ── src/arbiterAI/arbiterAI.{h,cpp}
├─ createChatClient() → ChatClient (stateful per-session: history, tools, cache, stats)
├─ owns ModelManager (singleton: config load, schema validation, model lookup, ConfigDownloader)
└─ stateless convenience: completion(), streamingCompletion(), batchCompletion(), getEmbeddings()
│ delegates to
BaseProvider (abstract) ── src/arbiterAI/providers/baseProvider.h
OpenAI · Anthropic · DeepSeek · OpenRouter · Llama (local) · Mock
```

- **Providers** are instantiated by a `switch` in `arbiterAI.cpp` keyed on the provider string (`createProvider`-style factory). To add a provider: create `providers/<name>.{h,cpp}` subclassing `BaseProvider`, add it to that switch, add the source to `CMakeLists.txt`, and add a model config JSON.
- **Error handling is error-code based** (`ErrorCode` enum), not exceptions — follow this; avoid try/catch where an error code works.

### Local model subsystem (llama.cpp)

Distinct from the cloud providers, this is the heavier piece:

- **`ModelRuntime`** (`modelRuntime.{h,cpp}`) — multi-model loading into VRAM/RAM, swap queueing, LRU eviction, GGUF-aware load-failure classification (`LoadFailureReason`/`LoadErrorDetail`).
- **`InferenceScheduler`** (`inferenceScheduler.{h,cpp}`) — request pipeline with stages (Queued → Tokenizing → WaitingAccelerator → Inferring → Complete), and `TokenChannel` for streaming tokens from the accelerator thread to the HTTP thread.
- **`HardwareDetector`** — GPU/VRAM/RAM/CPU detection; **`ModelFitCalculator`** — whether a model fits available hardware.
- **`ModelDownloader`** / **`StorageManager`** — download GGUF files (libgit2 / HTTP), track storage, hot-ready/protected flags, cleanup.
- **`TelemetryCollector`** — inference stats and system snapshots, surfaced by the server.

### Server (`src/server/`)

Separate CMake target linking `arbiterai` + cpp-httplib (httplib is a server-only dependency, kept out of
the core library). `routes.cpp` defines the OpenAI-compatible endpoints (`/v1/chat/completions`,
`/v1/models`, `/v1/embeddings` with SSE streaming), model management, telemetry (`/api/stats`), and config
injection. `dashboard.h`/`dashboardConfig.h` are embedded HTML/JS for the `/dashboard` UI. The server takes a
single required config file: `arbiterAI-server -c <config.json>`. See `docs/server.md`.

### Testing without API keys

The **Mock provider** (`providers/mock.{h,cpp}`) returns deterministic responses driven by `<echo>...</echo>`
tags in messages — no network or keys. Use `"provider": "mock"` in a model config. See `docs/testing.md`.

## Code style (from `.roo/rules-code/` and `.github/instructions/`)

- Files: **camelCase** names, `.h`/`.cpp`/`.inl`. Header guards `_PROJECT_FILENAME_EXT_`, **no `#pragma once`**.
- Braces: open brace on a **new line** for namespaces/functions/control blocks; **same line** for struct/class definitions in headers.
- Naming: Types `PascalCase`; functions/methods `camelCase`; class members `m_camelCase`; locals/struct vars `camelCase`; macros `UPPER_CASE`.
- Spacing: no space around `=`, `::`, unary operators, or between a keyword/function name and `(`; spaces around comparison/logical operators; comma after, not before.
- Pointers/refs bind to the variable: `type *var`, `type &var`. Minimize `auto`. Minimize comments — none for obvious code.
- Includes: `""` for local files, `<>` for libraries. Namespaces: prefer explicit qualification over `using` directives; aliases allowed.

## Docs map

`docs/developer.md` (architecture + API), `docs/server.md` (server API), `docs/testing.md` (mock/echo),
`docs/project.md` (goals/providers), `docs/tasks/` (active task plans). The `docs/old/` and
`docs/development/tasks/completed/` dirs are historical.
3 changes: 3 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,8 @@ set(arbiterai_src
./src/arbiterAI/modelRuntime.cpp
./src/arbiterAI/telemetryCollector.h
./src/arbiterAI/telemetryCollector.cpp
./src/arbiterAI/inferenceScheduler.h
./src/arbiterAI/inferenceScheduler.cpp
./src/arbiterAI/storageManager.h
./src/arbiterAI/storageManager.cpp
./src/arbiterAI/providers/baseProvider.h
Expand Down Expand Up @@ -141,6 +143,7 @@ target_link_libraries(arbiterai
tests/hardwareDetectorTests.cpp
tests/modelRuntimeTests.cpp
tests/telemetryCollectorTests.cpp
tests/inferenceSchedulerTests.cpp
tests/llamaProviderTests.cpp
tests/storageManagerTests.cpp
tests/serverConnectTests.cpp
Expand Down
2 changes: 1 addition & 1 deletion arbiterAI_config
3 changes: 2 additions & 1 deletion docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# syntax=docker/dockerfile:1
ARG DOCKER_VERSION=1.2.1
ARG DOCKER_VERSION=1.2.2
FROM ubuntu:24.04

# Install basic build tools, Python 3, and GPU libraries.
Expand Down Expand Up @@ -33,6 +33,7 @@ RUN apt-get update && apt-get install -y \
vulkan-tools \
libvulkan-dev \
mesa-vulkan-drivers \
spirv-headers \
glslc \
glslang-tools \
wget \
Expand Down
13 changes: 7 additions & 6 deletions docs/developer.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,14 +53,15 @@ ArbiterAI follows a layered architecture:
- **[`ModelManager`](../src/arbiterAI/modelManager.h)** — Singleton that loads and manages model configurations from JSON files with schema validation.
- **Utility Components** — Cross-cutting functionality including caching ([`CacheManager`](../src/arbiterAI/cacheManager.h)), cost tracking ([`CostManager`](../src/arbiterAI/costManager.h)), model downloading ([`ModelDownloader`](../src/arbiterAI/modelDownloader.h)), and file verification ([`FileVerifier`](../src/arbiterAI/fileVerifier.h)).

### Planned Components
### Local Model Components

See [Local Model Management Task](tasks/local_model_management.md) for upcoming additions:
Components supporting local (llama.cpp) models — see [Local Model Management Task](tasks/local_model_management.md) for background:

- **`HardwareDetector`** — GPU/RAM/CPU detection (NVML + Vulkan)
- **`ModelRuntime`** — Multi-model loading, swap queueing, LRU eviction (refactor of `LlamaInterface`)
- **`TelemetryCollector`** — Inference stats and system snapshots
- **Standalone Server** — Separate `arbiterAI-server` application providing an OpenAI-compatible API, model management endpoints, and a live stats dashboard
- **[`HardwareDetector`](../src/arbiterAI/hardwareDetector.h)** — GPU/RAM/CPU detection (NVML + Vulkan)
- **[`ModelRuntime`](../src/arbiterAI/modelRuntime.h)** — Multi-model loading, swap queueing, LRU eviction, load-failure classification
- **[`InferenceScheduler`](../src/arbiterAI/inferenceScheduler.h)** — Inference pipeline used by the server for local models. HTTP threads submit jobs; a tokenizer thread loads the model and pre-tokenizes the prompt; per-accelerator worker threads run inference. Streaming tokens flow back to the HTTP thread through a `TokenChannel`, and jobs are cancelled on client disconnect. Active jobs are exposed at `/api/scheduler/jobs`.
- **[`TelemetryCollector`](../src/arbiterAI/telemetryCollector.h)** — Inference stats and system snapshots
- **Standalone Server** — Separate `arbiterAI-server` application providing an OpenAI-compatible API, model management endpoints, and a live stats dashboard (see [Server Guide](server.md))

---

Expand Down
38 changes: 37 additions & 1 deletion docs/server.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ The server supports:
- **Model lifecycle management** — Load, unload, pin, and download models at runtime
- **Runtime model config injection** — Add, update, or remove model configurations via REST without restarting
- **Storage management** — Track downloaded model files, set hot ready / protected flags, configure automated cleanup, monitor disk usage and download progress with speed and ETA
- **Telemetry** — System snapshots, inference history, swap history, and hardware info
- **Telemetry** — System snapshots, inference history, swap history, active scheduler jobs, and hardware info
- **Live dashboard** — Browser-based UI at `/dashboard` with storage bar, download progress, and model management
- **CORS** — All responses include permissive CORS headers

Expand Down Expand Up @@ -738,6 +738,8 @@ Inference history within a time window.
{
"model": "gpt-4",
"variant": "",
"job_id": 17,
"cancelled": false,
"tokens_per_second": 45.2,
"prompt_tokens": 120,
"completion_tokens": 80,
Expand All @@ -747,6 +749,40 @@ Inference history within a time window.
]
```

#### `GET /api/scheduler/jobs`

Active inference scheduler jobs (local models only). Jobs flow through the
pipeline stages `queued` → `tokenizing` → `waiting` → `inferring`; completed
and cancelled jobs are not listed. Returns `[]` when the scheduler is not
running.

**Response:**

```json
[
{
"id": 17,
"model": "my-local-model",
"stage": "inferring",
"streaming": true,
"prompt_tokens": 120,
"completion_tokens": 34,
"queue_position": 0,
"elapsed_ms": 2150.0
}
]
```

| Field | Description |
|-------|-------------|
| `id` | Scheduler job ID (matches `job_id` in inference history) |
| `stage` | `queued`, `tokenizing`, `waiting`, or `inferring` |
| `streaming` | Whether the request is a streaming completion |
| `prompt_tokens` | Prompt token count (available once tokenized) |
| `completion_tokens` | Tokens generated so far (streaming jobs only) |
| `queue_position` | Position in the accelerator queue (`0` = running) |
| `elapsed_ms` | Time since the job was submitted |

#### `GET /api/stats/swaps`

Model swap history.
Expand Down
6 changes: 6 additions & 0 deletions schemas/model_config.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,12 @@
"enum": ["vulkan", "rocm", "cuda"]
},
"uniqueItems": true
},
"api_format": {
"type": "string",
"description": "Output format produced by the model. When set (e.g. 'harmony'), the server converts the model's native output to standard OpenAI API format so clients don't need to understand the model's native format.",
"enum": ["", "harmony"],
"default": ""
}
}
}
Expand Down
11 changes: 11 additions & 0 deletions src/arbiterAI/arbiterAI.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,13 @@ ErrorCode ArbiterAI::completion(const CompletionRequest &request, CompletionResp

ErrorCode ArbiterAI::streamingCompletion(const CompletionRequest &request,
std::function<void(const std::string &)> callback)
{
return streamingCompletion(request, callback, nullptr);
}

ErrorCode ArbiterAI::streamingCompletion(const CompletionRequest &request,
std::function<void(const std::string &)> callback,
std::function<void()> waitCallback)
{
if (!ArbiterAI::instance().initialized)
{
Expand All @@ -273,6 +280,10 @@ ErrorCode ArbiterAI::streamingCompletion(const CompletionRequest &request,
return ErrorCode::UnsupportedProvider;
}

if(waitCallback)
{
return provider->streamingCompletion(request, callback, waitCallback);
}
return provider->streamingCompletion(request, callback);
}

Expand Down
15 changes: 14 additions & 1 deletion src/arbiterAI/arbiterAI.h
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,9 @@ enum class ErrorCode
ModelLoadError,
ModelDownloading,
ModelDownloadFailed,
InsufficientStorage
InsufficientStorage,
ServerOverloaded,
Cancelled
};

/**
Expand Down Expand Up @@ -616,6 +618,17 @@ class ArbiterAI
ErrorCode streamingCompletion(const CompletionRequest &request,
std::function<void(const std::string &)> callback);

/**
* @brief Perform streaming completion with queue wait notification
* @param request Completion parameters
* @param callback Function to receive streaming chunks
* @param waitCallback Called periodically while waiting for backend availability
* @return ErrorCode indicating success or failure
*/
ErrorCode streamingCompletion(const CompletionRequest &request,
std::function<void(const std::string &)> callback,
std::function<void()> waitCallback);

/**
* @brief Process multiple completion requests in batch
* @param requests Vector of completion requests
Expand Down
Loading
Loading