Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ service keys.
| [Swap an OCR model with one identifier change](./document-ocr) | Driving recognition (VLM-OCR), structured extraction (Donut), and zero-shot NER (GLiNER) through the same `extract` call by swapping the model ID | `extract` | Docker Compose plus Node UI, no API key required, hosted version on [Hugging Face Spaces](https://huggingface.co/spaces/superlinked/document-ocr) | Runnable demo |
| [A Stripe Link checkout with an SIE fraud-risk gate](./stripe-link-fraud) | Wiring all three SIE primitives into a pre-authorization fraud-risk gate that runs in the same round-trip as the Stripe PaymentIntent | `extract`, `encode`, `score` | Docker Compose plus Node UI; Stripe test-mode keys optional (runs in mock mode without them) | Runnable demo |
| [Vision-first document RAG](./vision-doc-rag) | Retrieving and answering questions over a multi-tenant page corpus by looking at page images — including scanned drawings — with OCR kept out of the score path | `encode`, `chat/completions`, `score` (optional) | GPU SIE deployment required: ColQwen2.5 retriever + Qwen3.5-4B answer model (runs on the generation bundle) | Runnable demo |
| [Multi-model contract review with the OpenAI Agents SDK](./contract-review-agent) | Running an OpenAI Agents SDK agent whose every model call — triage, orchestration, vision, OCR, embeddings, rerank, entity extraction, text-to-SQL, reasoning, and a safety guardrail — is served by one SIE cluster, each step on the right catalog model, with per-model observability | `chat/completions`, `encode`, `score`, `extract` | GPU SIE deployment required; standalone `uv` project; real contracts fetched from CUAD (CC BY 4.0) | Runnable demo |

For docs publishing, lead with the quickest runnable demos, then use the
benchmark and evaluation examples for deeper technical users.
Expand Down
4 changes: 4 additions & 0 deletions examples/contract-review-agent/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Point these at any SIE deployment that serves /v1/chat/completions (a GPU
# cluster — see README "Run it"). Defaults match a local CUDA container.
SIE_CLUSTER_URL=http://localhost:8080
SIE_API_KEY=
8 changes: 8 additions & 0 deletions examples/contract-review-agent/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
.venv/
__pycache__/
.env
*.log
uv.lock
.ruff_cache/
# Generated sample artifacts (recreate with `uv run make-sample`)
contract_review_agent/data/generated/
111 changes: 111 additions & 0 deletions examples/contract-review-agent/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Contract review with the OpenAI Agents SDK, on one SIE cluster

A multi-agent contract reviewer built with the [OpenAI Agents SDK](https://openai.github.io/openai-agents-python/) where **every model call is served by SIE** — no `api.openai.com`, no per-token bill. An **investigator** agent autonomously calls tools to gather grounded facts, then a **synthesizer** agent turns them into a structured review — each step running on the **right model from the SIE catalog**: a fast triage model, a vision model that reads the scanned signature page, a reasoning sub-agent for clause risk, a text-to-SQL specialist, an OCR model, embedding + reranker models for clause search, a zero-shot entity extractor, and a safety guardrail. Ten specialized jobs, one cluster, one request.

This is the "one cluster powers every model your agent calls" idea from the [SIE landing page](https://superlinked.com), made real and runnable.

## The catalog: the right model for each job

Every value below is a real model in the [SIE catalog](https://superlinked.com/models). Swap any line in `config.yaml` to try another — nothing else changes.

| Role in the agent | SIE model | SIE function |
|---|---|---|
| Triage — classify the document type | `Qwen/Qwen3-0.6B` | chat |
| **Orchestrator** — plan, call tools, assemble the review | `Qwen/Qwen3-4B-Instruct-2507` (alias `code`) | chat + tools + JSON schema |
| Vision — read the scanned signature page | `Qwen/Qwen3.5-4B` | chat + image |
| Reasoning sub-agent — clause-risk analysis | `Qwen/Qwen3-4B-Instruct-2507` (↑ `Qwen3.5-4B` / `Qwen3.6-27B` where served) | chat |
| Text-to-SQL — query the obligations DB | `defog/sqlcoder-7b-2` | completions |
| Guardrail — safety / prompt-injection | `ibm-granite/granite-guardian-3.0-2b` (alias `guard`) | chat |
| OCR — scanned page → markdown | `lightonai/LightOnOCR-2-1B` | extract |
| Clause search — dense embeddings | `BAAI/bge-m3` | encode |
| Clause rerank — cross-encoder | `Qwen/Qwen3-Reranker-4B` | score |
| Entity extraction — parties, dates, amounts | `urchade/gliner_large-v2.1` | extract |

## How it works

The whole trick is one idea: **the Agents SDK speaks the OpenAI wire protocol, and SIE serves an OpenAI-compatible `/v1` endpoint.** So we point the SDK at SIE and force chat completions (`contract_review_agent/runtime.py`):

```python
client = AsyncOpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
set_default_openai_client(client) # every agent talks to SIE...
set_default_openai_api("chat_completions") # ...over chat completions, not the Responses API...
set_tracing_disabled(True) # ...and we never phone home with traces.
```

After that, each `Agent` just names the SIE model it should run on:

```python
Agent(name="Risk Analyst", model=OpenAIChatCompletionsModel("Qwen/Qwen3-4B-Instruct-2507", openai_client=client), ...)
```

The flow is **two agents** (which is what keeps a small open model reliable):

1. An **investigator** (on `Qwen3-4B-Instruct`) with seven tools and **no** structured `output_type` — so it can't short-circuit to a hallucinated answer and instead must call tools to learn anything about the contract:
- `classify_document` (triage) · `read_signature_page` (vision) · `analyze_clause_risks` (delegates to the reasoning **sub-agent**) — generative LLMs
- `ocr_signature_page` · `extract_entities` (`extract`), `search_clauses` (`encode` + `score`), `query_obligations_db` (`completions`) — retrieval & extraction
- a `granite-guardian` **input guardrail** screens the request first (and fails open, logged, if the guard model is unavailable).
2. A **synthesizer** (structured `output_type=ContractReview`, no tools) turns the investigator's grounded findings into the final review — parties, dates, governing law, executed?, key obligations, risk flags with severity + redlines, recommendation — via SIE's JSON-schema-constrained generation.

> Why two agents? With a structured `output_type`, a small model tends to emit the schema immediately and skip the tools (it will even hallucinate the fields). Splitting "gather with tools" from "format the result" keeps the fan-out real and the output grounded.

## Run it

You need Python 3.12 and a **GPU-backed SIE deployment** — the generative models run on SIE's generation bundle (CUDA), so the `latest-cpu-default` image can't serve them.

```bash
# 1. SIE on a local NVIDIA GPU, or point SIE_CLUSTER_URL / SIE_API_KEY at a managed GPU cluster.
docker run --gpus all -p 8080:8080 -v sie-hf-cache:/app/.cache/huggingface \
ghcr.io/superlinked/sie-server:latest-cuda12-default

cd examples/contract-review-agent
cp .env.example .env # edit SIE_CLUSTER_URL / SIE_API_KEY if not localhost
uv sync

# 2. Fetch a handful of real contracts from CUAD (CC BY 4.0). Downloads a ~18 MB archive once.
uv run fetch-contracts # or: uv run make-sample (offline synthetic contracts)

# 3. Review the first contract and watch the model fan-out.
uv run review # uv run review --list to see available contracts
uv run review --contract <slug> # review a specific one
```

> **GPU sizing.** `reasoning` defaults to `Qwen/Qwen3-4B-Instruct-2507` (reliable, fast) so the demo
> runs on a single mid-size GPU; swap in the newer `Qwen/Qwen3.5-4B` or the stronger `Qwen/Qwen3.6-27B` (H100/RTX PRO 6000) where the cluster serves them. A cold
> cluster pays a one-time load per model on first use; the agent retries the "still
> provisioning" responses under `cluster.provision_timeout_s`. Keep bundles warm
> (`minReplicas: 1`) to skip the wait — and any model the cluster can't serve degrades
> gracefully (logged in the ledger) instead of failing the run.

## What you'll see

`uv run review` prints the model catalog, runs the agent, then prints the structured review **plus a per-model observability ledger** — each step's model, SIE function, **cold-start warm-up**, warm latency, data sent, and **warm throughput (tokens/s)** — so you can watch one cluster fan a single request across the catalog and see how each model performed. (Warm-up is shown separately from throughput for the generative calls; the `encode`/`score`/`extract` calls go through the SIE SDK, which provisions internally, so those show total latency.) Try `--instruction "..."` to change the ask, or feed the guardrail a malicious prompt to watch `granite-guardian` trip the tripwire.

## Swapping models (the point of the catalog)

`config.yaml` maps each role to a model id. Change a string, rerun — no code edits:

```yaml
models:
reasoning: "Qwen/Qwen3.6-27B" # default 4B runs anywhere; bump to 27B on an H100-class cluster
ocr: "opendatalab/MinerU2.5-Pro-2604-1.2B" # try a different OCR model
```

Alternatively, resolve roles **server-side** with SIE's gateway aliases — set
`SIE_GATEWAY_MODEL_ALIASES='{"vision":"Qwen/Qwen3.5-4B","ocr":"lightonai/LightOnOCR-2-1B"}'`
and reference `vision` / `ocr` (the built-ins `code`, `sql`, `guard` already ship).

## Data

The default corpus is **[CUAD](https://www.atticusprojectai.org/cuad/)** (Contract Understanding Atticus Dataset) — 510 real commercial contracts filed with the SEC, released by The Atticus Project under **CC BY 4.0**. `fetch-contracts` downloads CUAD's ~18 MB archive once (from the [Atticus Project repo](https://github.com/TheAtticusProject/cuad)), parses the SQuAD-format contract text, writes a curated handful as the corpus, renders one page to an image for the OCR/vision step, and seeds a small SQLite obligations database that references the contracts pulled.

> CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review. Dan Hendrycks, Collin Burns, Anya Chen, Spencer Ball. arXiv:2103.06268. Licensed CC BY 4.0.

`uv run make-sample` builds a fully synthetic, offline alternative (an Acme MSA, an NDA, and an SOW) so the demo runs with no network.

## Notes

- Chat completions, tool calling, JSON-schema structured output, vision, and `/v1/completions` (for `sqlcoder`) are all served over SIE's OpenAI-compatible API.
- `sqlcoder-7b-2` is a completion model used with its native text-to-SQL template; for higher accuracy you can instead point `sql` at the `code`-aliased instruct model.
- This is a demo of inference orchestration, **not legal advice**.

Apache-2.0, like the rest of SIE.
36 changes: 36 additions & 0 deletions examples/contract-review-agent/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# SIE deployment. /v1/chat/completions is served by SIE's generation bundle
# (GPU), so use a CUDA image locally:
# docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default
# or override with SIE_CLUSTER_URL / SIE_API_KEY to target a managed GPU cluster.
cluster:
url: "http://localhost:8080"
api_key: ""
gpu: "" # only set for managed multi-GPU clusters (e.g. "l4-spot"); ignored locally
provision_timeout_s: 900 # cold clusters scale from zero — first call to each model pays a load

# ── The heart of the demo: one cluster, many models — the right one for each job. ──
# Every value is a real model in the SIE catalog (https://superlinked.com/models).
# Swap any line to try a different model; no other code changes.
models:
# Generative LLMs — the agent "brains" (chat / tools / structured output):
triage: "Qwen/Qwen3-0.6B" # fast, cheap doc-type classifier (no tools)
orchestrator: "Qwen/Qwen3-4B-Instruct-2507" # plans, calls tools, assembles output (alias: code)
vision: "Qwen/Qwen3.5-4B" # reads scanned / signature pages (text + image)
reasoning: "Qwen/Qwen3-4B-Instruct-2507" # clause-risk analysis (reliable, fast); swap to newer Qwen/Qwen3.5-4B or stronger Qwen/Qwen3.6-27B where the cluster serves them
sql: "defog/sqlcoder-7b-2" # text-to-SQL specialist (completions + native template)
guard: "ibm-granite/granite-guardian-3.0-2b" # safety / prompt-injection guardrail (alias: guard)
# Retrieval + extraction tools (encode / score / extract):
ocr: "lightonai/LightOnOCR-2-1B" # scanned PDF / image -> markdown (latest OCR model)
embed: "BAAI/bge-m3" # dense embeddings for clause search
rerank: "Qwen/Qwen3-Reranker-4B" # cross-encoder rerank of retrieved clauses
entities: "urchade/gliner_large-v2.1" # zero-shot entity extraction (parties, dates, amounts)

# Tunables for the SIE-backed tools.
search:
top_k_candidates: 12 # clauses retrieved by embedding similarity
top_k_results: 4 # clauses kept after rerank

guard:
# granite-guardian emits a "yes" (unsafe) / "no" (safe) verdict. Trip the
# guardrail when P(unsafe) clears this threshold. 0.5 is recall-biased.
threshold: 0.5
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""Contract review with the OpenAI Agents SDK, served entirely by SIE.

One orchestrator agent drives specialist sub-agents and SIE-backed tools, each
running on a different model from the SIE catalog — the "one cluster powers
every model your agent calls" story, made runnable.
"""
114 changes: 114 additions & 0 deletions examples/contract-review-agent/contract_review_agent/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
"""Assemble the multi-agent app: an orchestrator on one model, a risk-analyst
sub-agent on another, SIE-backed tools, a safety guardrail, and a structured
output type."""

from __future__ import annotations

from typing import Any

from agents import Agent, RunResult, Runner
from pydantic import BaseModel

from .guardrails import safety_guardrail
from .runtime import AppContext, model_for
from .tools import ALL_TOOLS


class RiskFlag(BaseModel):
clause: str
issue: str
severity: str # low | medium | high
suggested_redline: str


class ContractReview(BaseModel):
"""The structured deliverable the orchestrator must produce."""

document_type: str
parties: list[str]
effective_date: str # "unknown" if not stated
renewal_terms: str
governing_law: str # "unknown" if not stated
executed: bool # is the signature page signed and dated?
key_obligations: list[str]
risk_flags: list[RiskFlag]
recommendation: str


# The investigator has NO output_type on purpose: a structured output_type gives a
# weak model an escape hatch to emit the schema immediately instead of using tools.
# With only tools available, it must call them to do its job.
_INVESTIGATOR_INSTRUCTIONS = """\
You are a contract investigator. You have NO prior knowledge of this contract — the
ONLY way to learn anything is to CALL YOUR TOOLS. Investigate thoroughly: call EVERY
one of these tools, one after another, before you write anything.

- classify_document() — the document type
- ocr_signature_page() — read the executed signature page (signatories, titles, date)
- extract_entities() — parties, dates, amounts, governing law
- read_signature_page("Are both parties' signatures present and dated?") — visual execution check
- search_clauses("automatic renewal"), then search_clauses("limitation of liability"),
then search_clauses("indemnification"), then search_clauses("termination")
- analyze_clause_risks(<the clause text you found>) — risk analysis with severities
- query_obligations_db("upcoming obligations with due dates and amounts") — deadlines

Do NOT write your report until you have called them all. Then write a thorough,
factual findings report that cites ONLY what the tools returned. Never invent a party,
date, number, or clause — if a tool failed, say so."""

_SYNTHESIZER_INSTRUCTIONS = """\
You turn a contract investigator's findings into a structured ContractReview. Use
ONLY the findings provided — never add facts. If the findings don't establish a
field, use "unknown" (or false for `executed`). Make key_obligations and risk_flags
specific and grounded in the findings, and give a clear recommendation."""


def build_reasoning_agent(cfg: dict[str, Any], client: Any) -> Agent:
return Agent(
name="Risk Analyst",
instructions=(
"You are a senior contracts attorney. Given contract clauses, identify "
"risks to the Customer. For each, state the clause, the issue, a severity "
"(low/medium/high), and a concrete one-line redline. Be specific and brief."
),
model=model_for(cfg["models"]["reasoning"], client),
)


def build_investigator(cfg: dict[str, Any], client: Any) -> Agent:
"""Autonomous tool-using agent (no output_type) that gathers grounded findings."""
return Agent(
name="Contract Investigator",
instructions=_INVESTIGATOR_INSTRUCTIONS,
model=model_for(cfg["models"]["orchestrator"], client),
tools=ALL_TOOLS,
input_guardrails=[safety_guardrail],
)


def build_synthesizer(cfg: dict[str, Any], client: Any) -> Agent:
"""Structured-output agent (no tools) that formats the findings into a review."""
return Agent(
name="Contract Reviewer",
instructions=_SYNTHESIZER_INSTRUCTIONS,
model=model_for(cfg["models"]["orchestrator"], client),
output_type=ContractReview,
)


async def run_review(
app: AppContext, investigator: Agent, synthesizer: Agent, instruction: str
) -> tuple[RunResult, RunResult]:
"""Investigate with tools (autonomous fan-out), then synthesize the structured review."""
gather = await Runner.run(
investigator,
f"{instruction}\n\nInvestigate the contract using your tools, then report your findings.",
context=app,
max_turns=20,
)
synth = await Runner.run(
synthesizer,
f"Investigator findings:\n\n{gather.final_output}\n\nProduce the ContractReview.",
context=app,
)
return gather, synth
Loading