Can an agent train a good embedding model on its own? autoembed is a benchmark
for agentic embedding-model training: a CLI agent (Claude Code, Codex, Gemini…)
is given a fixed base model, a wall-clock budget on one GPU, and a dev eval to
query — and must produce the best embedding model it can. The harness then scores
the agent's final_model/ on a frozen eval, including a held-out set the agent
never sees.
The measurement is frozen; the method is free.
run_task.sh <agent>seeds a fresh workdir with the agent-facing files only (task.py,instructions.md,timer.sh), sets a deadline, and launches the agent viaagents/<agent>/solve.sh.- The agent reads
instructions.md, writes its own training code (using the helpers intask.py), self-evaluates on the dev proxy (task.evaluate), paces itself withbash timer.sh, and saves its best model tofinal_model/. - When the agent stops,
score.pyruns outside the workdir and scoresfinal_model/on the dev proxy, the hidden held-out set, and a contamination audit.
Integrity is structural, not honor-system: the scorer (score.py), which defines
and runs the held-out set, never enters the agent's workdir, and the harness scores
from its own copy — so the agent can't see or tamper with the official metric.
task.py |
fixed base model, dev eval, contamination check (the agent gets a copy) |
instructions.md |
the prompt given to the agent |
agents/<agent>/solve.sh |
per-agent launchers (claude · codex · gemini) |
run_task.sh |
orchestrator: seed workdir → run agent → score |
timer.sh |
remaining-budget query the agent calls |
score.py |
harness scoring: dev + hidden held-out + contamination; defines the held-out tasks (harness-only) |
reference.py |
reference-model anchors on the dev proxy |
Dockerfile |
container env; run with docker run --gpus all |
Needs a local NVIDIA GPU, the agent CLI installed + authenticated (API key or
subscription), and uv.
uv sync
./run_task.sh claude # native: run Claude Code for the default budget
HOURS=10 ./run_task.sh codex # 10-hour budget with Codex
MODE=docker ./run_task.sh gemini # isolated container run (needs the agent CLI in the image)GPU note: pyproject.toml pins torch to a CUDA 12.8 wheel index; adjust it to
your driver (see the comment there).