examples(contract-review-agent): multi-model contract review with the OpenAI Agents SDK by svonava · Pull Request #197 · superlinked/sie

svonava · 2026-06-23T09:28:15Z

Multi-model contract review with the OpenAI Agents SDK

A new runnable example under examples/contract-review-agent/: a contract reviewer built on the OpenAI Agents SDK where every model call is served by one SIE cluster — no api.openai.com, no per-token bill. It demonstrates the SIE model catalog by fanning a single request across ~9 specialised models, and it doubles as a per-model observability tool.

This is the "one cluster powers every model your agent calls" idea from the landing page, made real, runnable, and grounded on real contracts.

The catalog — the right model for each job

Role	SIE model	SIE function
Triage — classify document	`Qwen/Qwen3-0.6B`	chat
Orchestration (investigator)	`Qwen/Qwen3-4B-Instruct-2507`	chat + tools
Vision — read scanned signature page	`Qwen/Qwen3.5-4B`	chat + image
Reasoning sub-agent — clause risk	`Qwen/Qwen3-4B-Instruct-2507` (↑ `Qwen3.5-4B` / `Qwen3.6-27B` where served)	chat
Text-to-SQL — obligations DB	`defog/sqlcoder-7b-2`	completions
Guardrail — safety / injection	`ibm-granite/granite-guardian-3.0-2b`	chat
OCR — scan → markdown	`lightonai/LightOnOCR-2-1B`	extract
Clause search — embeddings	`BAAI/bge-m3`	encode
Clause rerank	`Qwen/Qwen3-Reranker-4B`	score
Entities — parties/dates/$	`urchade/gliner_large-v2.1`	extract

Each role is one line in config.yaml; swap a string to try another catalog model.

Architecture — two agents

The whole wiring is one idea: the Agents SDK speaks the OpenAI wire protocol and SIE serves an OpenAI-compatible /v1, so we point the SDK at SIE (set_default_openai_client + set_default_openai_api("chat_completions") + set_tracing_disabled) and each Agent names a SIE model.

The flow is deliberately two agents:

An investigator (tools, no output_type) autonomously calls tools to gather grounded facts — guarded by a granite-guardian input guardrail, delegating clause-risk to a reasoning sub-agent.
A synthesizer (output_type=ContractReview, no tools) formats the findings into structured output via SIE's JSON-schema-constrained generation.

Why split? With a structured output_type, a small open model emits the schema immediately and skips the tools (it will even hallucinate the fields). Separating "gather with tools" from "format the result" keeps the multi-model fan-out real and the output grounded. (tool_choice="required" would be another lever, but SIE returns 400 for forced tool calls on this model — only auto works.)

Observability

A normal uv run review prints a per-model ledger: model, SIE function, cold-start warm-up, warm latency, data sent, and warm throughput (tokens/s) — warm-up shown separately from throughput so a cold model's numbers aren't blended into a meaningless "1 tok/s".

Data

Default corpus is CUAD (Contract Understanding Atticus Dataset) — 510 real SEC-filed commercial contracts, CC BY 4.0. uv run fetch-contracts downloads CUAD's ~18 MB archive once and parses the contract text; uv run make-sample builds a fully synthetic offline alternative. The text-to-SQL obligations DB is seeded from the fetched contracts.

Run it

docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default
cd examples/contract-review-agent && cp .env.example .env && uv sync
uv run fetch-contracts        # or: uv run make-sample (offline)
uv run review

Validated end-to-end against a GPU SIE cluster

Run against a live GPU cluster, the investigator autonomously fanned out across 6–7 distinct models and produced a grounded review of a real CUAD contract (a Shenzhen LOHAS Supply Contract) — correctly extracting the buyer, flagging the contradictory 30%-vs-5% late-delivery penalty and the force-majeure certificate issue, and — when the OCR/vision tools were unavailable that run — honestly reporting execution as uncertain "due to tool failures" instead of hallucinating.

Sample observability ledger from a live run:

Model	Warm-up	Warm latency	Throughput
`Qwen3-0.6B` (triage)	105.7s	0.73s	82 tok/s
`Qwen3-4B-Instruct` (risk sub-agent, this run)	— (warm)	3.4s	132 tok/s
`bge-m3` (embed)	—	0.5–1.4s	12–36 items/s
`Qwen3-Reranker-4B` (rerank)	—	0.2–0.5s	22–57 docs/s
`gliner_large` (entities)	—	1.6s	7.7 ent/s
`Qwen3.5-4B` (vision)	~140s	~28s for a 161 KB page image	image-prefill bound — not a text-decode rate

The committed default uses Qwen3-4B-Instruct-2507 for reasoning — the model that produced the grounded review above. Qwen3.5-4B (newest) and Qwen3.6-27B (strongest) are one-line swaps for clusters that serve them well.

Engineering decisions surfaced by live testing

Two-agent split to stop a 4B model short-circuiting/hallucinating past its tools (the key fix).
Two OpenAI clients: tool calls use our own provisioning-retry (max_retries=0); Agents-SDK calls — which we can't wrap — use a hard-retrying client (max_retries=12) so they survive a model being evicted mid-run on a busy cluster.
Resilience: the guardrail fails open (logged) if the guard model is down; tool/model failures degrade gracefully rather than crashing the run; warm-up retries cover scale-from-zero cold starts.
Cache fix: the Agents SDK hands each tool a shallow copy of the context, so the clause-embedding cache lives in a shared dict (mutated), not a reassigned attribute.

Notes / caveats

Dependencies: openai-agents is pinned <0.14 — newer releases pin websockets>=15 (for realtime/voice, unused here), which conflicts with sie-sdk's websockets<15.
CUAD fetch downloads data.zip directly (the datasets 5.x release dropped script-based datasets, so load_dataset no longer works for it).
The example targets a GPU cluster. On the test cluster, Qwen3.6-27B, sqlcoder-7b-2, granite-guardian, LightOnOCR, and (intermittently) the vision model Qwen3.5-4B did not reliably provision — those steps degraded gracefully (logged in the ledger) and run on a fully-provisioned cluster. reasoning therefore defaults to the reliable Qwen3-4B-Instruct-2507; Qwen3.5-4B (newest) / Qwen3.6-27B (strongest) are documented one-line swaps.

… OpenAI Agents SDK An OpenAI Agents SDK agent whose every model call is served by one SIE cluster. An autonomous 'investigator' agent fans out across the SIE catalog (no structured output_type, so it must use its tools) and a 'synthesizer' agent produces a grounded, structured ContractReview: - triage Qwen3-0.6B, orchestration Qwen3-4B-Instruct, vision Qwen3.5-4B, risk-analysis sub-agent Qwen3-4B-Instruct (newer Qwen3.5-4B / stronger Qwen3.6-27B where the cluster serves them), text-to-SQL sqlcoder-7b-2, OCR LightOnOCR-2-1B, embeddings bge-m3, rerank Qwen3-Reranker-4B, entities gliner_large; granite-guardian input guardrail. - Real contracts from CUAD (CC BY 4.0), with a synthetic offline fallback. - Per-model observability: cold-start warm-up vs warm throughput, per call. - Resilient: fail-open guardrail, graceful tool degradation, and provisioning retries for cold/evicted models. Validated end-to-end against a GPU SIE cluster.

svonava force-pushed the daniel/contract-review-agent branch from 2c77c31 to 6912f3e Compare June 23, 2026 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples(contract-review-agent): multi-model contract review with the OpenAI Agents SDK#197

examples(contract-review-agent): multi-model contract review with the OpenAI Agents SDK#197
svonava wants to merge 1 commit into
mainfrom
daniel/contract-review-agent

svonava commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

svonava commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Multi-model contract review with the OpenAI Agents SDK

The catalog — the right model for each job

Architecture — two agents

Observability

Data

Run it

Validated end-to-end against a GPU SIE cluster

Engineering decisions surfaced by live testing

Notes / caveats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

svonava commented Jun 23, 2026 •

edited

Loading