Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

28 changes: 19 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,19 +149,20 @@ provider-backed ELF evidence was required.
mem0, OpenViking, and claude-mem remained typed non-pass states. OpenViking now
reaches its pinned Docker local embedding path and is reported as `wrong_result`
when same-corpus evidence terms are missed; setup failures remain `incomplete`.
- Real-world agent memory aggregate after the P1 benchmark batch: 38 fixture-backed
jobs across 11 suites, 36 pass, 0 incomplete, 2 blocked, 0 wrong-result,
- Real-world agent memory aggregate after the P1 benchmark batch: 40 fixture-backed
jobs across 11 suites, 38 pass, 0 incomplete, 2 blocked, 0 wrong-result,
0 not-encoded, and 0 unsupported-claim results. The remaining non-pass jobs are
production-ops operator boundaries, not hidden benchmark wins.
- Full-suite live real-world adapter sweep after XY-899: ELF and qmd emit
Docker-isolated `live_real_world` records for all 38 encoded jobs across 11 suites
Docker-isolated `live_real_world` records for all 40 encoded jobs across 11 suites
through `cargo make real-world-memory-live-adapters`. Both keep the original
targeted `work_resume`, `retrieval`, and `project_decisions` slice passing, but the
full sweep is not a full-suite pass. The fresh ELF sweep reports 18 pass,
5 wrong_result, 2 blocked, and 13 not_encoded jobs. The fresh qmd sweep reports
17 pass, 6 wrong_result, 2 blocked, and 13 not_encoded jobs. The difference is the
delete/TTL tombstone case; qmd remains the local retrieval-debug UX reference, and
no broad ELF-over-qmd claim is allowed.
full sweep is not a full-suite pass. The fresh ELF sweep reports 22 pass,
5 wrong_result, 2 blocked, and 11 not_encoded jobs. The fresh qmd sweep reports
17 pass, 6 wrong_result, 2 blocked, and 15 not_encoded jobs. The differences are
the delete/TTL tombstone case plus ELF-only capture/write-policy live self-checks;
qmd remains the local retrieval-debug UX reference, and no broad ELF-over-qmd claim
is allowed.
- Live operator-debugging slice after XY-932: `cargo make
real-world-job-operator-ux-live-adapters` emits narrow Docker-isolated
`live_real_world` records for ELF and qmd over the operator-debugging fixtures.
Expand Down Expand Up @@ -194,6 +195,12 @@ provider-backed ELF evidence was required.
for local SDK export-style parity, `blocked` for OpenMemory UI/export, and
`non_goal` for hosted Platform export and optional graph memory in the local OSS
lane.
- Capture/write-policy live follow-up after XY-933: ELF now passes 4/4 live
`capture_integration` jobs with zero redaction leaks, source ids preserved in
source refs, write-policy redaction audit counts, evidence binding, and no secret
leakage. qmd remains `not_encoded` for this suite. agentmemory capture comparison is
blocked by mocked/in-memory storage, and claude-mem hook/viewer capture remains
untested, so no broad capture-breadth superiority claim is allowed.
- The benchmark runner and report publisher are checked in and Docker-isolated:
`cargo make baseline-live-docker`, `cargo make baseline-backfill-docker`,
`cargo make baseline-production-private-addendum`,
Expand All @@ -216,6 +223,7 @@ Detailed evidence and interpretation:
- [ELF/qmd Trace Replay Diagnostics Report - June 11, 2026](docs/guide/benchmarking/2026-06-11-elf-qmd-trace-replay-diagnostics-report.md)
- [Graph/RAG Scored Smoke Adapter Report - June 11, 2026](docs/guide/benchmarking/2026-06-11-graph-rag-scored-smoke-adapter-report.md)
- [mem0/OpenMemory History and UI Export Report - June 11, 2026](docs/guide/benchmarking/2026-06-11-mem0-openmemory-history-ui-export-report.md)
- [Capture/Write-Policy Live Report - June 11, 2026](docs/guide/benchmarking/2026-06-11-capture-write-policy-live-report.md)
- [Live Baseline Benchmark Runbook](docs/guide/benchmarking/live_baseline_benchmark.md)
- [Single-User Production Runbook](docs/guide/single_user_production.md)
- Benchmark contract:
Expand All @@ -238,7 +246,8 @@ Evidence-backed position after the June 11 real-world reports:
typed non-pass states, while ELF has the stronger service and provenance contract.
- ELF is still behind or not yet proven on full-suite live real-world pass parity,
private-corpus production quality, credentialed production-ops gates,
qmd-style local debug knobs, agentmemory/claude-mem/OpenMemory-style continuity UX,
qmd-style local debug knobs, agentmemory/claude-mem/OpenMemory-style capture and
continuity UX,
OpenViking-style context trajectory, and hosted managed memory.

Quick comparison snapshot (objective/high-level).
Expand Down Expand Up @@ -292,6 +301,7 @@ Detailed comparison, mechanism-level analysis, and source map:
- [ELF/qmd Trace Replay Diagnostics Report - June 11, 2026](docs/guide/benchmarking/2026-06-11-elf-qmd-trace-replay-diagnostics-report.md)
- [Graph/RAG Scored Smoke Adapter Report - June 11, 2026](docs/guide/benchmarking/2026-06-11-graph-rag-scored-smoke-adapter-report.md)
- [mem0/OpenMemory History and UI Export Report - June 11, 2026](docs/guide/benchmarking/2026-06-11-mem0-openmemory-history-ui-export-report.md)
- [Capture/Write-Policy Live Report - June 11, 2026](docs/guide/benchmarking/2026-06-11-capture-write-policy-live-report.md)
- [Live Baseline Benchmark Runbook](docs/guide/benchmarking/live_baseline_benchmark.md)
- [Real-World Agent Memory Benchmark](docs/guide/benchmarking/real_world_agent_memory_benchmark.md)
- [External Memory Improvement Plan](docs/guide/research/external_memory_improvement_plan.md)
Expand Down
1 change: 1 addition & 0 deletions apps/elf-eval/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ uuid = { workspace = true }
elf-chunking = { workspace = true }
elf-cli = { workspace = true }
elf-config = { workspace = true }
elf-domain = { workspace = true }
elf-service = { workspace = true }
elf-storage = { workspace = true }
elf-testkit = { workspace = true }
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
},
"run": {
"status": "blocked",
"evidence": "The current fixture set reports 38 jobs, 36 pass, 0 incomplete, 2 blocked, 0 wrong_result, 0 not_encoded, and 0 unsupported_claim.",
"evidence": "The current fixture set reports 40 jobs, 38 pass, 0 incomplete, 2 blocked, 0 wrong_result, 0 not_encoded, and 0 unsupported_claim.",
"command": "cargo make real-world-memory",
"artifact": "tmp/real-world-memory/real-world-memory-report.json"
},
Expand Down Expand Up @@ -99,7 +99,7 @@
{
"suite_id": "capture_integration",
"status": "pass",
"evidence": "The redaction and capture-boundary fixture is encoded and passing."
"evidence": "Four redaction, exclusion, source-id, evidence-binding, and capture-boundary fixtures are encoded and passing."
},
{
"suite_id": "production_ops",
Expand Down Expand Up @@ -146,13 +146,13 @@
},
"run": {
"status": "wrong_result",
"evidence": "ELF materializes 38 real_world_job adapter_response objects through ElfService, worker indexing, and search_raw before scoring; the full sweep includes typed wrong_result, blocked, and not_encoded job records.",
"evidence": "ELF materializes 40 real_world_job adapter_response objects through ElfService, worker indexing, search_raw, and live capture/write-policy ingestion before scoring; the full sweep includes typed wrong_result, blocked, and not_encoded job records.",
"command": "cargo make real-world-memory-live-adapters",
"artifact": "tmp/real-world-memory/live-adapters/elf-report.json"
},
"result": {
"status": "wrong_result",
"evidence": "The fresh full live sweep scores 38 jobs across all 11 encoded suites: 18 pass, 5 wrong_result, 0 incomplete, 2 blocked, and 13 not_encoded. This is not a full-suite live pass.",
"evidence": "The fresh full live sweep scores 40 jobs across all 11 encoded suites: 22 pass, 5 wrong_result, 0 incomplete, 2 blocked, and 11 not_encoded. This is not a full-suite live pass.",
"command": "cargo make real-world-memory-live-adapters",
"artifact": "tmp/real-world-memory/live-adapters/elf-report.md"
},
Expand All @@ -175,7 +175,7 @@
{
"capability": "full_suite_live_sweep",
"status": "wrong_result",
"evidence": "The runner now emits per-job and per-suite live records for all 38 encoded jobs, but memory_evolution is wrong_result and several non-answer-generation suites remain typed non-pass."
"evidence": "The runner now emits per-job and per-suite live records for all 40 encoded jobs, but memory_evolution is wrong_result and several non-answer-generation suites remain typed non-pass."
},
{
"capability": "full_suite_live_pass",
Expand Down Expand Up @@ -231,8 +231,8 @@
},
{
"suite_id": "capture_integration",
"status": "not_encoded",
"evidence": "The live adapter sweep does not exercise capture integrations or write-policy redaction boundaries."
"status": "pass",
"evidence": "The live adapter passes 4/4 capture_integration jobs through Docker-local ELF ingestion, including capture-boundary classification, excluded evidence ids, source ids in source_ref, write_policy redaction audit counts, evidence binding, and zero secret leakage."
},
{
"suite_id": "production_ops",
Expand All @@ -245,6 +245,18 @@
"evidence": "The live adapter retrieved the scoped preference evidence and passed the personalization job."
}
],
"scenarios": [
{
"scenario_id": "live_capture_write_policy",
"suite_id": "capture_integration",
"status": "pass",
"elf_position": "ties",
"comparison_outcome": "tie",
"evidence": "ELF live capture/write-policy jobs pass for redaction, exclusions, source ids, evidence binding, and no secret leakage. This is an ELF self-check, not a win over external hook systems.",
"command": "cargo make real-world-memory-live-adapters",
"artifact": "tmp/real-world-memory/live-adapters/elf-materialization.json"
}
],
"evidence": [
{
"kind": "fixture_dir",
Expand Down Expand Up @@ -359,13 +371,13 @@
},
"run": {
"status": "wrong_result",
"evidence": "qmd materializes 38 real_world_job adapter_response objects through collection add, update, embed, and query --json before scoring; the full sweep includes typed wrong_result, blocked, and not_encoded job records.",
"evidence": "qmd materializes 40 real_world_job adapter_response objects through collection add, update, embed, and query --json before scoring; the full sweep includes typed wrong_result, blocked, and not_encoded job records.",
"command": "cargo make real-world-memory-live-adapters",
"artifact": "tmp/real-world-memory/live-adapters/qmd-report.json"
},
"result": {
"status": "wrong_result",
"evidence": "The fresh full qmd live sweep scores 38 jobs across all 11 encoded suites: 17 pass, 6 wrong_result, 0 incomplete, 2 blocked, and 13 not_encoded. This is not a full-suite live pass.",
"evidence": "The fresh full qmd live sweep scores 40 jobs across all 11 encoded suites: 17 pass, 6 wrong_result, 0 incomplete, 2 blocked, and 15 not_encoded. This is not a full-suite live pass.",
"command": "cargo make real-world-memory-live-adapters",
"artifact": "tmp/real-world-memory/live-adapters/qmd-report.md"
},
Expand All @@ -388,7 +400,7 @@
{
"capability": "full_suite_live_sweep",
"status": "wrong_result",
"evidence": "The runner now emits per-job and per-suite live records for all 38 encoded jobs, but memory_evolution is wrong_result and several non-answer-generation suites remain typed non-pass."
"evidence": "The runner now emits per-job and per-suite live records for all 40 encoded jobs, but memory_evolution is wrong_result and several non-answer-generation suites remain typed non-pass."
},
{
"capability": "full_suite_live_pass",
Expand Down Expand Up @@ -445,7 +457,7 @@
{
"suite_id": "capture_integration",
"status": "not_encoded",
"evidence": "The qmd live adapter sweep does not exercise capture integrations or write-policy redaction boundaries."
"evidence": "The qmd live adapter sweep does not exercise capture integrations or write-policy redaction boundaries; all capture_integration jobs remain typed not_encoded for qmd."
},
{
"suite_id": "production_ops",
Expand Down Expand Up @@ -838,6 +850,15 @@
"elf_position": "untested",
"evidence": "agentmemory's relevant strength is durable coding-agent continuity and capture, but the Docker harness has not proven a persistent session/capture path. Keep work_resume and capture claims blocked until a durable local adapter path exists.",
"artifact": "apps/elf-eval/fixtures/real_world_external_adapters/memory_projects_manifest.json"
},
{
"scenario_id": "capture_write_policy_hooks",
"suite_id": "capture_integration",
"status": "blocked",
"elf_position": "untested",
"comparison_outcome": "blocked",
"evidence": "agentmemory capture breadth is blocked for comparison because the current Docker baseline uses a process-local StateKV Map and in-memory index; no durable local session/capture path stores source ids, exclusions, write-policy audit, or evidence-bound capture output.",
"artifact": "apps/elf-eval/fixtures/real_world_external_adapters/memory_projects_manifest.json"
}
],
"evidence": [
Expand Down Expand Up @@ -1353,7 +1374,7 @@
"suite_id": "capture_integration",
"status": "not_encoded",
"elf_position": "untested",
"evidence": "The Docker baseline uses repository classes only. claude-mem hooks, viewer, timeline, and observation workflows are not executed by the runner.",
"evidence": "The Docker baseline uses repository classes only. claude-mem hooks, timeline, observations, viewer capture, and automatic capture review workflows are not executed by the runner, so capture breadth remains untested rather than an ELF win/loss.",
"artifact": "apps/elf-eval/fixtures/real_world_external_adapters/memory_projects_manifest.json"
}
],
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,34 @@
"corpus": {
"corpus_id": "real-world-memory-capture-2026-06-09",
"profile": "synthetic",
"capture_behaviors": {
"real": [
"ELF live add_note capture can persist public evidence with source ids and skip excluded evidence ids through the Docker live adapter."
],
"fixture_backed": [
"The fixture encodes public capture, write-policy audit evidence, and a private excluded span as a negative trap."
],
"blocked": [
"agentmemory hook breadth remains blocked until a durable local session/capture adapter replaces the in-memory mock.",
"claude-mem hook and viewer capture remain not encoded because the Docker baseline does not execute hooks, timeline, or viewer workflows."
],
"not_encoded": [
"Host-global capture hooks, Slack/browser capture, and credentialed tool capture are outside this Docker-scoped benchmark."
],
"notes": [
"Live ELF scoring must not store or retrieve the private-excluded-text evidence id."
]
},
"items": [
{
"evidence_id": "public-captured-decision",
"kind": "message",
"text": "Public memory candidate: benchmark reports should include evidence coverage, unsupported claim count, stale retrieval count, scope correctness, and redaction leak count.",
"capture": {
"action": "store",
"source_id": "capture:linear-comment-933",
"evidence_binding": "source_ref"
},
"source_ref": {
"schema": "source_ref/v1",
"resolver": "real_world_job_fixture/v1",
Expand All @@ -28,6 +51,11 @@
"evidence_id": "write-policy-audit",
"kind": "trace",
"text": "Write policy audit: private spans were excluded before evidence binding, and the stored note only retained the public benchmark-report requirement.",
"capture": {
"action": "store",
"source_id": "capture:write-policy-audit-933",
"evidence_binding": "source_ref"
},
"source_ref": {
"schema": "source_ref/v1",
"resolver": "real_world_job_fixture/v1",
Expand All @@ -45,6 +73,11 @@
"evidence_id": "private-excluded-text",
"kind": "message",
"text": "Private excluded text: operator's home server label is cobalt-ridge.",
"capture": {
"action": "exclude",
"source_id": "capture:excluded-private-span-933",
"evidence_binding": "negative_trap"
},
"source_ref": {
"schema": "source_ref/v1",
"resolver": "real_world_job_fixture/v1",
Expand Down
Loading