Evaluate and improve AI agents from the runs they already produce.
agent-eval turns agent outputs, traces, judge scores, and production feedback into a decision packet: did this change help, what failed, what should ship, and what needs more data?
Use it when you need to:
- compare a candidate agent/prompt/model against a baseline,
- turn production traces or human feedback into eval results,
- run a gated self-improvement loop,
- explain failures by cluster, cost, judge disagreement, and release risk.
It is a library, not a SaaS requirement. TypeScript is first-class; Python can call the same wire protocol through agent-eval-rpc.
pnpm add @tangle-network/agent-evalPython clients can use the RPC package:
pip install agent-eval-rpcStart here if you already have production logs, benchmark rows, human ratings, or agent run records.
import { analyzeRuns } from '@tangle-network/agent-eval/contract'
const report = await analyzeRuns({
runs, // RunRecord[]
baselineRuns,
})
console.log(report.recommendations)
console.log(report.lift)
console.log(report.failureClusters)The output includes score distributions, lift confidence intervals, failure modes, cost-quality tradeoffs, judge agreement, contamination checks, and release recommendations when the input supports them.
Use this when you have scenarios, a runnable agent, and judges.
import { selfImprove } from '@tangle-network/agent-eval/contract'
const result = await selfImprove({
scenarios,
dispatch: async ({ scenario }) => myAgent.run(scenario),
judges: [myJudge],
baselineSurface: { systemPrompt: currentPrompt },
})
console.log(result.gateDecision)
console.log(result.winnerSurface)
console.log(result.insight.recommendations)selfImprove() evaluates candidates on held-out scenarios before recommending a winner.
import { analyzeRuns, fromFeedbackTable, fromOtelSpans } from '@tangle-network/agent-eval/contract'
const { runs, raterScores } = fromFeedbackTable({
ratings: parseYourFeedbackTable(),
})
const traceRuns = fromOtelSpans({ spans: yourOtelSpans })
await analyzeRuns({ runs: [...runs, ...traceRuns], raterScores })- RunRecord: the durable row for one agent run: model, prompt/config hashes, split, cost, tokens, outcome.
- Scenario: one task or case the agent attempts.
- Judge: a scoring function, rule-based or model-based.
- InsightReport: the decision packet returned by
analyzeRuns()and embedded inselfImprove(). - Gate: the policy that decides
ship,hold, orneed_more_data.
| Journey | Example | Who it's for |
|---|---|---|
| Closed loop — improve a prompt under statistical confidence | examples/selfimprove-quickstart/ |
Teams with scenarios + judges + agent in hand |
| Multi-rater feedback corpus — turn Obsidian/Sheets/CSV ratings into actionable insights | examples/customer-feedback-loop/ |
Teams reviewing AI outputs by hand who want to compress that taste into per-member LLM judges + close the loop |
| Production OTel traces — analyze logs you already have, no closed loop required | examples/customer-otel-traces/ |
Teams running agents in prod with observability, no eval discipline yet |
Each example: README.md + a single index.ts runnable via pnpm tsx. Prints the resulting InsightReport to stdout.
| Subpath | What it gives you |
|---|---|
…/contract |
The headline, frozen surface — new code starts here. selfImprove, analyzeRuns, runEval, runCampaign, runImprovementLoop, diffRuns; intake adapters (fromFeedbackTable, fromOtelSpans); drivers (gepaDriver, evolutionaryDriver); gates (defaultProductionGate, heldOutGate, paretoSignificanceGate, composeGate); the deployment-outcome store; storage; and the five core types Scenario / Dispatch / JudgeConfig / Mutator / Gate. |
…/hosted |
createHostedClient / hostedClientFromEnv + the wire types to ship eval-run events + trace spans to a hosted orchestrator (ours or your own implementation of the spec) |
…/adapters/otel |
createOtelBridge — forwards OpenTelemetry-shape spans into the hosted-tier ingest, no @opentelemetry/* dependency |
…/adapters/langchain |
Wrap any LangChain Runnable as a Dispatch (or JudgeConfig), no @langchain/core peer dep |
…/adapters/http |
httpDispatch + runDispatchServer — run a campaign's worker on another machine (multi-region, driver-as-a-service) |
…/campaign |
The measurement + improvement engine (@experimental): runProfileMatrix, compareDrivers, every driver (gepaDriver, haloDriver, skillOptDriver, aceDriver, memoryCurationDriver, …), the gates, storage backends, and loop provenance. /contract re-exports the stable subset. |
…/rl |
RL bridge from eval artifacts to training signal: verifiable rewards, preferences, OPE, PRM, tournaments, contamination, compute curves, plus the durable corpus + buildRlDataset / datasheet bundle |
…/reporting |
Release-decision statistics: pairedBootstrap, benjaminiHochberg, anytime-valid sequential e-values, evaluateReleaseConfidence, and the report renderers |
…/analyst |
The trace-analyst surface: AnalystRegistry + buildDefaultAnalystRegistry (run the failure-clustering panel), FindingsStore, and the LLM chat transports |
…/traces |
Trace stores + emitters, OTLP-JSONL deterministic replay, analyzeTraces, and the traceAnalystOnRunComplete hook |
…/control |
Agent control loop: runAgentControlLoop (observe → validate → decide → act), action policy, propose/review |
…/matrix |
runAgentMatrix — an N-axis cartesian over caller-supplied substrate values, per-axis pass/score/cost/duration |
…/multishot |
N-shot persona × shot matrix runner (runMultishot / runMultishotMatrix) |
…/wire |
The cross-language HTTP/RPC server + Zod schemas (the source-of-truth protocol the Python client speaks) + the built-in rubric registry |
…/benchmarks |
BenchmarkAdapter contract + deterministicSplit + the bundled routing reference benchmark |
Specialized surfaces (subpath-only): …/prm (process-reward grading + best-of-N), …/meta-eval (judge calibration + the deployment-outcome store), …/pipelines (trace-diagnostic views: budget breach, failure cluster, stuck loop, …), …/governance (EU AI Act / NIST AI RMF / SOC2 reports), …/knowledge (knowledge-readiness gating before a run), …/builder-eval (code-generator three-layer eval), …/storyboard (trace → watchable replay), …/authenticity (anti-Goodhart "real or convincing BS" scorer over produced files), …/workflow (workflow-trace eval + partner export), …/telemetry (Workers-safe telemetry client).
The root export remains available for backward compatibility; new code should prefer the focused subpaths above — /contract first.
agent-eval is the bottom of the layering: consumers depend on it, it depends on none of them.
agent-runtime Runs agents (chat turns, one-shot tasks, multi-attempt loops), captures every
run as a trace, and calls optimizePrompt / runImprovementLoop. Produces the
RunRecords + traces agent-eval scores. Depends on agent-eval.
agent-eval selfImprove, analyzeRuns, runCampaign + drivers (gepaDriver, …), the gates
(this repo) (heldOutGate, defaultProductionGate, paretoSignificanceGate), the InsightReport
decision packet, the RL bridge, the wire protocol. Depends on neither consumer.
agent-knowledge proposeKnowledgeWrites / applyKnowledgeWriteBlocks. agent-eval's analyst findings
feed it; the knowledge gate consumes them. Depends on agent-eval.
sandbox AgentProfile, Sandbox.create, streamPrompt. The execution surface the runtime's
loops run on; agent-eval scores what comes back.
The rule: agent-eval has zero upward dependencies on a consumer. A concept that makes sense without a running agent loop — a verdict, a run record, a scenario, a judge score — is substrate and lives here; a runtime-shaped one (a sandbox profile, a validation context with an abort signal) lives in agent-runtime. When in doubt, lean substrate.
docs/concepts.md— the three top-level functions, the layering rule, and the wire-protocol contract (the five core contract types are documented in the/contractbarrel itself)docs/insight-report.md— annotated walkthrough of every section of the decision packetdocs/customer-journeys.md— three end-to-end journeys with code + expected outputdocs/adapters-observability.md— composing agent-eval with LangSmith, Langfuse, Phoenix, OpenLLMetry, TraceAIdocs/wire-protocol.md— the HTTP/RPC contract Python (and any future language) speaksdocs/hosted-ingest-spec.md— the hosted-tier wire format, frozen at2026-05-26.v1docs/design/— RFCs + architectural notes
The .claude/skills/agent-eval/SKILL.md skill ships embedded directives so LLM agents writing integration code don't reintroduce historical bug classes.
Wire your loop to a hosted orchestrator (ours, or your own implementation of the spec) with one config:
await selfImprove({
scenarios, dispatch, judges, baselineSurface,
hostedTenant: {
endpoint: 'https://intelligence.tangle.tools',
apiKey: process.env.TANGLE_API_KEY!,
tenantId: 'your-tenant',
},
})The substrate runs the loop in your process. Only the eval-run events + (optional) trace spans go to the orchestrator. Your scenarios, your judges, your raw data — never sent. Spec at docs/hosted-ingest-spec.md; reference receiver at examples/hosted-ingest-server/.
Run an example:
pnpm tsx examples/selfimprove-quickstart/index.ts
pnpm tsx examples/customer-feedback-loop/index.ts
pnpm tsx examples/customer-otel-traces/index.tsRun the test suite:
pnpm install
pnpm build
pnpm testThe /contract surface is the stability contract: its barrel freezes the API — a 0.x minor only adds; nothing there changes shape or disappears. Depend on /contract (and the documented subpaths) rather than the root barrel.
In the deeper subpaths, @stable / @experimental JSDoc markers (visible in IDE hover + .d.ts) call out what may still move — most granularly in /rl (tagged per export) and /campaign (whole barrel @experimental, since /contract re-exports only its settled subset).
| Tag | Meaning |
|---|---|
@stable |
API frozen at this major. Breaking changes require a major bump. |
@experimental |
Interface may evolve before becoming @stable. Pin the patch version if you depend on it. |
@internal |
Not part of the public contract. Use the documented subpath instead. |
CHANGELOG.md tracks every release with what's new / additive / breaking.
MIT. See LICENSE.