Skip to content

examples(contract-review-agent): multi-model contract review with the OpenAI Agents SDK#197

Open
svonava wants to merge 1 commit into
mainfrom
daniel/contract-review-agent
Open

examples(contract-review-agent): multi-model contract review with the OpenAI Agents SDK#197
svonava wants to merge 1 commit into
mainfrom
daniel/contract-review-agent

Conversation

@svonava

@svonava svonava commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Multi-model contract review with the OpenAI Agents SDK

A new runnable example under examples/contract-review-agent/: a contract reviewer built on the OpenAI Agents SDK where every model call is served by one SIE cluster — no api.openai.com, no per-token bill. It demonstrates the SIE model catalog by fanning a single request across ~9 specialised models, and it doubles as a per-model observability tool.

This is the "one cluster powers every model your agent calls" idea from the landing page, made real, runnable, and grounded on real contracts.

The catalog — the right model for each job

Role SIE model SIE function
Triage — classify document Qwen/Qwen3-0.6B chat
Orchestration (investigator) Qwen/Qwen3-4B-Instruct-2507 chat + tools
Vision — read scanned signature page Qwen/Qwen3.5-4B chat + image
Reasoning sub-agent — clause risk Qwen/Qwen3-4B-Instruct-2507 (↑ Qwen3.5-4B / Qwen3.6-27B where served) chat
Text-to-SQL — obligations DB defog/sqlcoder-7b-2 completions
Guardrail — safety / injection ibm-granite/granite-guardian-3.0-2b chat
OCR — scan → markdown lightonai/LightOnOCR-2-1B extract
Clause search — embeddings BAAI/bge-m3 encode
Clause rerank Qwen/Qwen3-Reranker-4B score
Entities — parties/dates/$ urchade/gliner_large-v2.1 extract

Each role is one line in config.yaml; swap a string to try another catalog model.

Architecture — two agents

The whole wiring is one idea: the Agents SDK speaks the OpenAI wire protocol and SIE serves an OpenAI-compatible /v1, so we point the SDK at SIE (set_default_openai_client + set_default_openai_api("chat_completions") + set_tracing_disabled) and each Agent names a SIE model.

The flow is deliberately two agents:

  1. An investigator (tools, no output_type) autonomously calls tools to gather grounded facts — guarded by a granite-guardian input guardrail, delegating clause-risk to a reasoning sub-agent.
  2. A synthesizer (output_type=ContractReview, no tools) formats the findings into structured output via SIE's JSON-schema-constrained generation.

Why split? With a structured output_type, a small open model emits the schema immediately and skips the tools (it will even hallucinate the fields). Separating "gather with tools" from "format the result" keeps the multi-model fan-out real and the output grounded. (tool_choice="required" would be another lever, but SIE returns 400 for forced tool calls on this model — only auto works.)

Observability

A normal uv run review prints a per-model ledger: model, SIE function, cold-start warm-up, warm latency, data sent, and warm throughput (tokens/s) — warm-up shown separately from throughput so a cold model's numbers aren't blended into a meaningless "1 tok/s".

Data

Default corpus is CUAD (Contract Understanding Atticus Dataset) — 510 real SEC-filed commercial contracts, CC BY 4.0. uv run fetch-contracts downloads CUAD's ~18 MB archive once and parses the contract text; uv run make-sample builds a fully synthetic offline alternative. The text-to-SQL obligations DB is seeded from the fetched contracts.

Run it

docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default
cd examples/contract-review-agent && cp .env.example .env && uv sync
uv run fetch-contracts        # or: uv run make-sample (offline)
uv run review

Validated end-to-end against a GPU SIE cluster

Run against a live GPU cluster, the investigator autonomously fanned out across 6–7 distinct models and produced a grounded review of a real CUAD contract (a Shenzhen LOHAS Supply Contract) — correctly extracting the buyer, flagging the contradictory 30%-vs-5% late-delivery penalty and the force-majeure certificate issue, and — when the OCR/vision tools were unavailable that run — honestly reporting execution as uncertain "due to tool failures" instead of hallucinating.

Sample observability ledger from a live run:

Model Warm-up Warm latency Throughput
Qwen3-0.6B (triage) 105.7s 0.73s 82 tok/s
Qwen3-4B-Instruct (risk sub-agent, this run) — (warm) 3.4s 132 tok/s
bge-m3 (embed) 0.5–1.4s 12–36 items/s
Qwen3-Reranker-4B (rerank) 0.2–0.5s 22–57 docs/s
gliner_large (entities) 1.6s 7.7 ent/s
Qwen3.5-4B (vision) ~140s ~28s for a 161 KB page image image-prefill bound — not a text-decode rate

The committed default uses Qwen3-4B-Instruct-2507 for reasoning — the model that produced the grounded review above. Qwen3.5-4B (newest) and Qwen3.6-27B (strongest) are one-line swaps for clusters that serve them well.

Engineering decisions surfaced by live testing

  • Two-agent split to stop a 4B model short-circuiting/hallucinating past its tools (the key fix).
  • Two OpenAI clients: tool calls use our own provisioning-retry (max_retries=0); Agents-SDK calls — which we can't wrap — use a hard-retrying client (max_retries=12) so they survive a model being evicted mid-run on a busy cluster.
  • Resilience: the guardrail fails open (logged) if the guard model is down; tool/model failures degrade gracefully rather than crashing the run; warm-up retries cover scale-from-zero cold starts.
  • Cache fix: the Agents SDK hands each tool a shallow copy of the context, so the clause-embedding cache lives in a shared dict (mutated), not a reassigned attribute.

Notes / caveats

  • Dependencies: openai-agents is pinned <0.14 — newer releases pin websockets>=15 (for realtime/voice, unused here), which conflicts with sie-sdk's websockets<15.
  • CUAD fetch downloads data.zip directly (the datasets 5.x release dropped script-based datasets, so load_dataset no longer works for it).
  • The example targets a GPU cluster. On the test cluster, Qwen3.6-27B, sqlcoder-7b-2, granite-guardian, LightOnOCR, and (intermittently) the vision model Qwen3.5-4B did not reliably provision — those steps degraded gracefully (logged in the ledger) and run on a fully-provisioned cluster. reasoning therefore defaults to the reliable Qwen3-4B-Instruct-2507; Qwen3.5-4B (newest) / Qwen3.6-27B (strongest) are documented one-line swaps.

… OpenAI Agents SDK

An OpenAI Agents SDK agent whose every model call is served by one SIE cluster.
An autonomous 'investigator' agent fans out across the SIE catalog (no structured
output_type, so it must use its tools) and a 'synthesizer' agent produces a
grounded, structured ContractReview:

- triage Qwen3-0.6B, orchestration Qwen3-4B-Instruct, vision Qwen3.5-4B,
  risk-analysis sub-agent Qwen3-4B-Instruct (newer Qwen3.5-4B / stronger
  Qwen3.6-27B where the cluster serves them), text-to-SQL sqlcoder-7b-2, OCR
  LightOnOCR-2-1B, embeddings bge-m3, rerank Qwen3-Reranker-4B, entities
  gliner_large; granite-guardian input guardrail.
- Real contracts from CUAD (CC BY 4.0), with a synthetic offline fallback.
- Per-model observability: cold-start warm-up vs warm throughput, per call.
- Resilient: fail-open guardrail, graceful tool degradation, and provisioning
  retries for cold/evicted models.

Validated end-to-end against a GPU SIE cluster.
@svonava svonava force-pushed the daniel/contract-review-agent branch from 2c77c31 to 6912f3e Compare June 23, 2026 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant