From c8d6d33bc01e7cf2a3fee579170c6325bf02fb30 Mon Sep 17 00:00:00 2001 From: Yvette Carlisle Date: Thu, 11 Jun 2026 17:06:47 +0800 Subject: [PATCH] {"schema":"decodex/commit/1","summary":"Publish competitor-strength adoption report","authority":"XY-901"} --- ...-11-competitor-strength-adoption-report.md | 131 +++++++ docs/guide/benchmarking/index.md | 4 + ...1-competitor-strength-adoption-report.json | 354 ++++++++++++++++++ 3 files changed, 489 insertions(+) create mode 100644 docs/guide/benchmarking/2026-06-11-competitor-strength-adoption-report.md create mode 100644 docs/research/2026-06-11-competitor-strength-adoption-report.json diff --git a/docs/guide/benchmarking/2026-06-11-competitor-strength-adoption-report.md b/docs/guide/benchmarking/2026-06-11-competitor-strength-adoption-report.md new file mode 100644 index 00000000..e46ba1f7 --- /dev/null +++ b/docs/guide/benchmarking/2026-06-11-competitor-strength-adoption-report.md @@ -0,0 +1,131 @@ +# Competitor-Strength Adoption Report - June 11, 2026 + +Goal: Publish the final benchmark vNext adoption decision and scenario matrix for +ELF against tracked open-source memory, RAG, graph, and agent-continuity projects. +Read this when: You need the current production-adoption answer, the scenario-level +win/tie/loss/not-tested matrix, or the optimization queue behind future ELF work. +Inputs: `2026-06-11-measurement-coverage-audit.md`, +`2026-06-11-first-generation-oss-adapter-promotion-report.md`, +`2026-06-11-qmd-openviking-strength-profile-report.md`, +`2026-06-11-temporal-history-competitor-gap-report.md`, +`2026-06-11-graph-rag-scored-smoke-adapter-report.md`, and +`2026-06-10-production-adoption-refresh.md`. +Depends on: `docs/spec/real_world_agent_memory_benchmark_v1.md` and the current +external adapter manifest. +Outputs: Adoption decision, evidence-class boundaries, scenario matrix, follow-up +optimization queue, and the machine-readable companion file +`docs/research/2026-06-11-competitor-strength-adoption-report.json`. + +## Adoption Decision + +ELF is adoptable for bounded personal production use. + +The verdict is `adopt_with_bounded_caveats`, not broad competitor superiority. The +supporting evidence is strongest where ELF was designed to be strong: source-of-truth +discipline, evidence-bound writes, rebuildable Qdrant derivations, backup/restore, +backfill, and typed benchmark reporting. Those properties are stronger than the +measured alternatives in the current evidence set. + +The remaining caveats are material: + +- Full-suite live real-world pass parity is not proven. +- Live temporal reconciliation is still a measured loss: five of six + `memory_evolution` jobs are `wrong_result`. +- Private-corpus production quality is blocked until an operator-owned manifest + exists. +- Credentialed provider production-ops gates are blocked until explicit provider + setup exists. +- Several competitor strengths remain `not_tested`: qmd replay/debug UX, + mem0/OpenMemory history/UI, OpenViking trajectory, Letta core-vs-archival memory, + and graph/RAG navigation. + +## Evidence Classes + +This report keeps evidence classes separate. Do not convert fixture passes, +same-corpus smokes, research gates, blocked setup, unsupported shapes, wrong +results, or lifecycle failures into one aggregate leaderboard. + +| Evidence class | Meaning | +| --- | --- | +| `fixture_backed` | Checked-in real-world fixtures pass through the benchmark runner. | +| `live_baseline_only` | Docker same-corpus or lifecycle checks ran, but not full real-world jobs. | +| `live_real_world` | A runtime or CLI adapter produced scored real-world job records. | +| `smoke_only` | A tiny setup or output-shape smoke ran. | +| `research_gate` | Source/setup/resource/output-contract evidence exists only as research. | +| `blocked` | A credential, private input, provider, or setup boundary is missing. | +| `unsupported` | The project shape is not comparable for the scenario. | +| `not_encoded` | The benchmark does not yet cover the scenario. | +| `wrong_result` | The system ran but produced the wrong memory answer or evidence. | +| `lifecycle_fail` | Update/delete/reload/persistence behavior failed. | + +## Source Artifacts + +| Command or run | Artifact | Supported claim | +| --- | --- | --- | +| `cargo make real-world-memory` | `2026-06-11-measurement-coverage-audit.md` | ELF fixture aggregate covers 38 jobs across 11 suites with 36 pass and 2 blocked production-ops operator boundaries. | +| `cargo make real-world-memory-live-adapters` | `2026-06-11-measurement-coverage-audit.md` | ELF live service adapter reports 18 pass, 5 wrong_result, 2 blocked, and 13 not_encoded jobs; qmd reports 17 pass, 6 wrong_result, 2 blocked, and 13 not_encoded jobs. | +| `ELF_BASELINE_PROJECTS=ELF,agentmemory,mem0,memsearch,claude-mem cargo make baseline-live-docker` | `2026-06-11-first-generation-oss-adapter-promotion-report.md` | mem0/OpenMemory and memsearch pass basic local baseline smokes; agentmemory remains lifecycle_fail and claude-mem remains wrong_result. | +| `ELF_GRAPHITI_ZEP_SMOKE_START=1 ELF_GRAPHITI_ZEP_SMOKE_RUN=1 cargo make graphiti-zep-docker-temporal-smoke` | `2026-06-11-temporal-history-competitor-gap-report.md` | Graphiti/Zep temporal smoke remains blocked by `provider_api_key_missing`. | +| `cargo make graphify-docker-graph-report-smoke` | `2026-06-11-graph-rag-scored-smoke-adapter-report.md` | graphify reaches tiny Docker graph/report scoring but remains wrong_result. | +| `cargo make baseline-production-synthetic`, `cargo make baseline-backfill-docker`, backup/restore, Qdrant rebuild proof | `2026-06-10-production-adoption-refresh.md` | ELF has provider synthetic, stress, backfill, restore, and rebuild evidence; private-corpus proof is blocked by missing operator-owned manifest. | + +## Scenario Matrix + +| Scenario | ELF outcome | Evidence classes | Measured claim | Follow-up | +| --- | --- | --- | --- | --- | +| Source-of-truth rebuild and evidence-bound writes | `win` | `fixture_backed`, `live_real_world`, `live_baseline_only` | ELF has the strongest measured source-of-truth and rebuild story: Postgres is authoritative, Qdrant is rebuildable, trust-source jobs pass, and production restore/rebuild proof exists. | None | +| Work resume and coding-agent continuity | `tie` | `fixture_backed`, `live_real_world`, `live_baseline_only`, `blocked`, `not_encoded` | ELF and qmd both pass encoded live `work_resume` jobs; agentmemory, claude-mem, and OpenViking continuity strengths remain blocked or not encoded. | XY-925, XY-928 | +| Project decisions and reversals | `tie` | `fixture_backed`, `live_real_world`, `research_gate`, `not_encoded` | ELF and qmd both pass encoded `project_decisions` jobs; Letta-style core/archival decision memory is not tested. | XY-927 | +| Retrieval quality | `tie` | `fixture_backed`, `live_real_world`, `live_baseline_only` | ELF and qmd both pass encoded live retrieval and stress/same-corpus retrieval evidence. | XY-923 | +| Retrieval quality and local debug UX | `not_tested` | `live_baseline_only`, `research_gate`, `not_encoded` | qmd remains the local retrieval-debug UX reference, but no scored rule compares qmd top-10/replay artifacts with ELF trace/admin bundle surfaces. | XY-923 | +| Memory evolution and temporal history | `loss` | `fixture_backed`, `live_real_world`, `wrong_result`, `blocked` | ELF fixture memory evolution passes, but live ELF passes only delete/TTL and reports five wrong_result jobs where current-vs-historical state is not reconciled. | XY-905 | +| Consolidation/proposal review | `not_tested` | `fixture_backed`, `not_encoded` | ELF fixture consolidation passes, but live consolidation proposal generation and review-action scoring are not encoded. | XY-926 | +| Knowledge page compilation | `not_tested` | `fixture_backed`, `live_real_world`, `wrong_result`, `research_gate`, `not_encoded` | ELF fixture knowledge pages pass, but live knowledge compilation is not encoded; graphify reaches a tiny scored smoke and remains wrong_result. | XY-926, XY-929 | +| Operator debugging/viewer UX | `not_tested` | `fixture_backed`, `not_encoded`, `research_gate` | ELF fixture operator-debugging UX passes, but live trace/viewer scoring and qmd/OpenMemory/claude-mem UX comparisons are unscored. | XY-923, XY-926 | +| Capture/write policy and redaction | `not_tested` | `fixture_backed`, `live_baseline_only`, `blocked`, `not_encoded` | ELF fixture capture/write-policy jobs pass, but live capture integration and agentmemory/claude-mem capture hooks are not comparable yet. | XY-925, XY-926 | +| Production ops, restore, backfill, and rebuild | `win` | `live_baseline_only`, `blocked` | ELF has the strongest measured local production-operation story: provider synthetic, stress, resumable backfill, backup/restore, and Qdrant rebuild evidence. | XY-930 | +| Private corpus and provider boundaries | `blocked` | `blocked` | Private production profile fails closed without an operator-owned manifest; provider-backed production-ops gates require explicit credentials. | XY-930 | +| Personalization and scoped preferences | `tie` | `fixture_backed`, `live_real_world`, `not_encoded` | ELF and qmd both pass the single encoded live personalization job; mem0/OpenMemory and Letta personalization/history are not encoded. | XY-924, XY-927 | +| Context trajectory and hierarchical retrieval | `not_tested` | `live_baseline_only`, `research_gate`, `wrong_result`, `not_encoded` | OpenViking reaches the pinned Docker local embedding path but misses expected same-corpus evidence; staged trajectory/hierarchy scoring is not encoded. | XY-928 | +| Core-vs-archival memory | `not_tested` | `research_gate`, `not_encoded` | ELF has core block semantics in the service contract, but comparable core-vs-archival jobs and a contained Letta export path are not encoded. | XY-927 | +| Graph/RAG navigation and citations | `not_tested` | `smoke_only`, `research_gate`, `blocked`, `wrong_result`, `not_encoded` | Graph/RAG smokes produce scored or typed non-pass adapter reports where possible, but broad graph/RAG navigation and citation quality are not tested. | XY-929 | + +## Follow-Up Queue + +| Issue | Priority | State | Gap | +| --- | --- | --- | --- | +| XY-905 | P0 | Backlog | Live temporal reconciliation answer and trace contract. | +| XY-923 | P0 | Backlog | qmd trace-level replay and wrong-result diagnostics. | +| XY-924 | P0 | Backlog | mem0/OpenMemory history and UI-export comparison. | +| XY-925 | P1 | Backlog | First-generation OSS continuity and source-store adapters. | +| XY-926 | P1 | Backlog | Live operator-debugging, capture, consolidation, and knowledge-page suites. | +| XY-927 | P1 | Backlog | Letta-style core-vs-archival memory comparison. | +| XY-928 | P1 | Backlog | OpenViking context-trajectory and hierarchy benchmark. | +| XY-929 | P2 | Backlog | Graph/RAG adapters beyond scored smokes. | +| XY-930 | P1 | Backlog | Private-corpus and credentialed production gates after operator inputs exist. | +| XY-906 | Ops | Todo | Decodex registered-project review-config schema drift blocks Decodex loading of ELF. | + +## Allowed Claims + +- ELF is adoptable for bounded personal production use with caveats. +- ELF has the strongest measured source-of-truth, rebuild, restore, and backfill + evidence among the tracked systems. +- ELF ties qmd on encoded live retrieval, work-resume, project-decisions, and + personalization slices. +- ELF has a live temporal reconciliation loss against the benchmark expectation: + five memory-evolution jobs remain `wrong_result`. +- Most competitor strengths outside qmd retrieval are `not_tested`, `blocked`, + `smoke_only`, or `research_gate`. + +## Claims Not Allowed + +- Do not claim ELF broadly beats qmd. +- Do not claim ELF beats mem0/OpenMemory on history, UI/export, hosted behavior, or + graph memory. +- Do not claim ELF beats OpenViking on staged context trajectory. +- Do not claim ELF beats Letta on core-vs-archival memory. +- Do not claim graph/RAG parity from smoke-only evidence. +- Do not promote `fixture_backed`, `live_baseline_only`, `smoke_only`, + `research_gate`, `blocked`, `wrong_result`, `lifecycle_fail`, `unsupported`, or + `not_encoded` states into a generic pass/fail score. + diff --git a/docs/guide/benchmarking/index.md b/docs/guide/benchmarking/index.md index b6ab2b53..b462818e 100644 --- a/docs/guide/benchmarking/index.md +++ b/docs/guide/benchmarking/index.md @@ -84,6 +84,10 @@ cleanup, use `docs/guide/single_user_production.md`. Graphiti/Zep, and graphify smoke contracts into scored or typed non-pass `real_world_job` adapter reports without converting smoke evidence into quality claims. +- `2026-06-11-competitor-strength-adoption-report.md`: XY-901 final + competitor-strength adoption report with the bounded personal-production decision, + scenario-level win/tie/loss/not-tested matrix, claim boundaries, and optimization + issue queue. - `real_world_agent_memory_benchmark.md`: operator overview for the v1 real-world agent memory benchmark contract, including suite taxonomy, typed report states, knowledge-compilation fixture tasks, and the production-ops fixture target. diff --git a/docs/research/2026-06-11-competitor-strength-adoption-report.json b/docs/research/2026-06-11-competitor-strength-adoption-report.json new file mode 100644 index 00000000..e9fbb3e6 --- /dev/null +++ b/docs/research/2026-06-11-competitor-strength-adoption-report.json @@ -0,0 +1,354 @@ +{ + "schema": "elf.competitor_strength_adoption_report/v1", + "report_id": "xy-901-competitor-strength-adoption-report-2026-06-11", + "authority": "XY-901", + "created_at": "2026-06-11T00:00:00Z", + "adoption_decision": { + "personal_production_adoptable": true, + "verdict": "adopt_with_bounded_caveats", + "summary": "ELF is currently adoptable for bounded personal production use because source-of-truth, evidence-bound writes, rebuild/backfill/restore, and typed benchmark evidence are stronger than the measured alternatives. It is not a broad competitor-superiority claim.", + "remaining_caveats": [ + "Full-suite live real-world pass parity is not proven.", + "Live temporal reconciliation remains wrong_result for five of six memory_evolution jobs.", + "Private-corpus production quality is blocked until an operator-owned manifest exists.", + "Credentialed provider production-ops gates are blocked until explicit provider setup exists.", + "Several competitor strengths remain not_tested: qmd replay/debug UX, mem0/OpenMemory history/UI, OpenViking trajectory, Letta core-vs-archival memory, and graph/RAG navigation." + ] + }, + "evidence_class_terms": [ + "fixture_backed", + "live_baseline_only", + "live_real_world", + "smoke_only", + "research_gate", + "blocked", + "unsupported", + "not_encoded", + "wrong_result", + "lifecycle_fail" + ], + "outcome_terms": [ + "win", + "tie", + "loss", + "not_tested", + "blocked", + "non_goal" + ], + "source_artifacts": [ + { + "command": "cargo make real-world-memory", + "artifact": "docs/guide/benchmarking/2026-06-11-measurement-coverage-audit.md", + "claim": "ELF fixture aggregate covers 38 jobs across 11 suites with 36 pass and 2 blocked production-ops operator boundaries." + }, + { + "command": "cargo make real-world-memory-live-adapters", + "artifact": "docs/guide/benchmarking/2026-06-11-measurement-coverage-audit.md", + "claim": "ELF live service adapter reports 18 pass, 5 wrong_result, 2 blocked, and 13 not_encoded jobs; qmd reports 17 pass, 6 wrong_result, 2 blocked, and 13 not_encoded jobs." + }, + { + "command": "ELF_BASELINE_PROJECTS=ELF,agentmemory,mem0,memsearch,claude-mem cargo make baseline-live-docker", + "artifact": "docs/guide/benchmarking/2026-06-11-first-generation-oss-adapter-promotion-report.md", + "claim": "mem0/OpenMemory and memsearch pass basic local baseline smokes; agentmemory remains lifecycle_fail and claude-mem remains wrong_result on same-corpus retrieval." + }, + { + "command": "ELF_GRAPHITI_ZEP_SMOKE_START=1 ELF_GRAPHITI_ZEP_SMOKE_RUN=1 cargo make graphiti-zep-docker-temporal-smoke", + "artifact": "docs/guide/benchmarking/2026-06-11-temporal-history-competitor-gap-report.md", + "claim": "Graphiti/Zep temporal smoke remains blocked by provider_api_key_missing when live provider execution is explicitly enabled without credentials." + }, + { + "command": "cargo make graphify-docker-graph-report-smoke", + "artifact": "docs/guide/benchmarking/2026-06-11-graph-rag-scored-smoke-adapter-report.md", + "claim": "graphify reaches tiny Docker graph/report scoring but remains wrong_result; broad graph/RAG quality is not tested." + }, + { + "command": "cargo make baseline-production-synthetic, cargo make baseline-backfill-docker, backup/restore plus Qdrant rebuild proof", + "artifact": "docs/guide/benchmarking/2026-06-10-production-adoption-refresh.md", + "claim": "ELF has provider synthetic, stress, backfill, restore, and rebuild evidence, while private-corpus proof remains blocked by missing operator-owned manifest." + } + ], + "scenario_outcomes": [ + { + "scenario_id": "source_of_truth_rebuild_evidence_writes", + "title": "Source-of-truth rebuild and evidence-bound writes", + "outcome": "win", + "evidence_classes": ["fixture_backed", "live_real_world", "live_baseline_only"], + "measured_claim": "ELF has the strongest measured source-of-truth and rebuild story: Postgres is authoritative, Qdrant is rebuildable, trust_source_of_truth passes in fixture and live sweeps, and production restore/rebuild proof exists.", + "command_artifacts": [ + "docs/guide/benchmarking/2026-06-11-measurement-coverage-audit.md", + "docs/guide/benchmarking/2026-06-10-production-adoption-refresh.md" + ], + "follow_up_issues": [], + "caveat": "memsearch canonical Markdown reindex/reload is a useful ergonomics reference, but real-world source-of-truth prompts are not encoded." + }, + { + "scenario_id": "work_resume_coding_agent_continuity", + "title": "Work resume and coding-agent continuity", + "outcome": "tie", + "evidence_classes": ["fixture_backed", "live_real_world", "live_baseline_only", "blocked", "not_encoded"], + "measured_claim": "ELF and qmd both pass the encoded live work_resume jobs. agentmemory, claude-mem, and OpenViking continuity strengths remain blocked or not encoded.", + "command_artifacts": [ + "docs/guide/benchmarking/2026-06-11-measurement-coverage-audit.md", + "docs/guide/benchmarking/2026-06-11-first-generation-oss-adapter-promotion-report.md" + ], + "follow_up_issues": ["XY-925", "XY-928"], + "caveat": "The tie is only for encoded live work_resume behavior, not for broad capture hooks or staged context." + }, + { + "scenario_id": "project_decisions_reversals", + "title": "Project decisions and reversals", + "outcome": "tie", + "evidence_classes": ["fixture_backed", "live_real_world", "research_gate", "not_encoded"], + "measured_claim": "ELF and qmd both pass encoded project_decisions jobs. Letta-style core/archival decision memory is not tested.", + "command_artifacts": [ + "docs/guide/benchmarking/2026-06-11-measurement-coverage-audit.md" + ], + "follow_up_issues": ["XY-927"], + "caveat": "No Letta comparison exists until a contained export path is selected." + }, + { + "scenario_id": "retrieval_quality", + "title": "Retrieval quality", + "outcome": "tie", + "evidence_classes": ["fixture_backed", "live_real_world", "live_baseline_only"], + "measured_claim": "ELF and qmd both pass the encoded live retrieval suite and both pass stress/same-corpus retrieval evidence.", + "command_artifacts": [ + "docs/guide/benchmarking/2026-06-11-qmd-openviking-strength-profile-report.md", + "docs/guide/benchmarking/2026-06-11-measurement-coverage-audit.md" + ], + "follow_up_issues": ["XY-923"], + "caveat": "Retrieval correctness is separate from debug/replay ergonomics." + }, + { + "scenario_id": "local_debug_replay_ux", + "title": "Retrieval quality and local debug UX", + "outcome": "not_tested", + "evidence_classes": ["live_baseline_only", "research_gate", "not_encoded"], + "measured_claim": "qmd remains the local retrieval-debug UX reference, but no scored rule compares qmd top-10/replay artifacts with ELF trace/admin bundle surfaces.", + "command_artifacts": [ + "docs/guide/benchmarking/2026-06-11-qmd-openviking-strength-profile-report.md", + "docs/guide/benchmarking/2026-06-11-elf-qmd-retrieval-debug-profile.md" + ], + "follow_up_issues": ["XY-923"], + "caveat": "No ELF loss is claimed until comparable replay and candidate-diagnosis evidence is scored." + }, + { + "scenario_id": "memory_evolution_temporal_history", + "title": "Memory evolution and temporal history", + "outcome": "loss", + "evidence_classes": ["fixture_backed", "live_real_world", "wrong_result", "blocked"], + "measured_claim": "ELF fixture memory_evolution passes, but live ELF passes only the delete/TTL job and reports five wrong_result jobs where evidence is retrieved but current-vs-historical state is not reconciled.", + "command_artifacts": [ + "docs/guide/benchmarking/2026-06-11-temporal-history-competitor-gap-report.md", + "docs/research/2026-06-11-temporal-history-competitor-gap-report.json" + ], + "follow_up_issues": ["XY-905"], + "caveat": "Graphiti/Zep remains a temporal-validity reference, but its local provider-backed smoke is blocked by provider_api_key_missing." + }, + { + "scenario_id": "consolidation_proposal_review", + "title": "Consolidation/proposal review", + "outcome": "not_tested", + "evidence_classes": ["fixture_backed", "not_encoded"], + "measured_claim": "ELF fixture consolidation passes, but live consolidation proposal generation and review-action scoring are not encoded.", + "command_artifacts": [ + "docs/guide/benchmarking/2026-06-11-measurement-coverage-audit.md" + ], + "follow_up_issues": ["XY-926"], + "caveat": "Fixture evidence cannot be promoted into live proposal-quality proof." + }, + { + "scenario_id": "knowledge_page_compilation", + "title": "Knowledge page compilation", + "outcome": "not_tested", + "evidence_classes": ["fixture_backed", "live_real_world", "wrong_result", "research_gate", "not_encoded"], + "measured_claim": "ELF fixture knowledge pages pass, but live knowledge compilation is not encoded. graphify reaches a tiny scored smoke and remains wrong_result.", + "command_artifacts": [ + "docs/guide/benchmarking/2026-06-11-measurement-coverage-audit.md", + "docs/guide/benchmarking/2026-06-11-graph-rag-scored-smoke-adapter-report.md" + ], + "follow_up_issues": ["XY-926", "XY-929"], + "caveat": "llm-wiki, gbrain, GraphRAG, and graphify remain references until representative citation/lint jobs are scored." + }, + { + "scenario_id": "operator_debugging_viewer_ux", + "title": "Operator debugging/viewer UX", + "outcome": "not_tested", + "evidence_classes": ["fixture_backed", "not_encoded", "research_gate"], + "measured_claim": "ELF fixture operator-debugging UX passes, but live trace/viewer scoring is not encoded and qmd/OpenMemory/claude-mem UX comparisons are unscored.", + "command_artifacts": [ + "docs/guide/benchmarking/2026-06-11-measurement-coverage-audit.md", + "docs/guide/benchmarking/2026-06-11-qmd-openviking-strength-profile-report.md" + ], + "follow_up_issues": ["XY-923", "XY-926"], + "caveat": "No raw-SQL-avoidance or repair-action live benchmark exists yet." + }, + { + "scenario_id": "capture_write_policy_redaction", + "title": "Capture/write policy and redaction", + "outcome": "not_tested", + "evidence_classes": ["fixture_backed", "live_baseline_only", "blocked", "not_encoded"], + "measured_claim": "ELF fixture capture/write-policy jobs pass, but live capture integration remains not encoded and agentmemory/claude-mem capture hooks are not comparable yet.", + "command_artifacts": [ + "docs/guide/benchmarking/2026-06-11-measurement-coverage-audit.md", + "docs/guide/benchmarking/2026-06-11-first-generation-oss-adapter-promotion-report.md" + ], + "follow_up_issues": ["XY-925", "XY-926"], + "caveat": "Future evidence must prove redaction, exclusions, evidence binding, and no secret leakage." + }, + { + "scenario_id": "production_ops_restore_backfill", + "title": "Production ops, restore, backfill, and rebuild", + "outcome": "win", + "evidence_classes": ["live_baseline_only", "blocked"], + "measured_claim": "ELF has the strongest measured local production-operation story: provider synthetic, stress, resumable backfill, backup/restore, and Qdrant rebuild evidence are checked in.", + "command_artifacts": [ + "docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md", + "docs/guide/benchmarking/2026-06-10-production-adoption-refresh.md" + ], + "follow_up_issues": ["XY-930"], + "caveat": "Private-corpus and credentialed provider gates remain blocked, so this is not private production quality proof." + }, + { + "scenario_id": "private_corpus_provider_boundaries", + "title": "Private corpus and provider boundaries", + "outcome": "blocked", + "evidence_classes": ["blocked"], + "measured_claim": "The private production profile fails closed without an operator-owned manifest, and provider-backed production-ops gates require explicit credentials.", + "command_artifacts": [ + "docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md", + "docs/guide/benchmarking/2026-06-10-production-adoption-refresh.md" + ], + "follow_up_issues": ["XY-930"], + "caveat": "The blocker is an input boundary, not a hidden benchmark pass or loss." + }, + { + "scenario_id": "personalization_scoped_preferences", + "title": "Personalization and scoped preferences", + "outcome": "tie", + "evidence_classes": ["fixture_backed", "live_real_world", "not_encoded"], + "measured_claim": "ELF and qmd both pass the single encoded live personalization job. mem0/OpenMemory and Letta personalization/history are not encoded.", + "command_artifacts": [ + "docs/guide/benchmarking/2026-06-11-measurement-coverage-audit.md" + ], + "follow_up_issues": ["XY-924", "XY-927"], + "caveat": "The tie does not prove entity history, UI readback, or long-term preference evolution." + }, + { + "scenario_id": "context_trajectory_hierarchical_retrieval", + "title": "Context trajectory and hierarchical retrieval", + "outcome": "not_tested", + "evidence_classes": ["live_baseline_only", "research_gate", "wrong_result", "not_encoded"], + "measured_claim": "OpenViking reaches the pinned Docker local embedding path but misses expected same-corpus evidence, and staged trajectory/hierarchy scoring is not encoded.", + "command_artifacts": [ + "docs/guide/benchmarking/2026-06-11-qmd-openviking-strength-profile-report.md" + ], + "follow_up_issues": ["XY-928"], + "caveat": "ELF only has a narrow precondition win over OpenViking, not a trajectory win." + }, + { + "scenario_id": "core_vs_archival_memory", + "title": "Core-vs-archival memory", + "outcome": "not_tested", + "evidence_classes": ["research_gate", "not_encoded"], + "measured_claim": "ELF has core block semantics in the service contract, but comparable core-vs-archival benchmark jobs and a contained Letta export path are not encoded.", + "command_artifacts": [ + "docs/spec/system_elf_memory_service_v2.md", + "docs/guide/benchmarking/2026-06-11-temporal-history-competitor-gap-report.md" + ], + "follow_up_issues": ["XY-927"], + "caveat": "No ELF-over-Letta claim is allowed." + }, + { + "scenario_id": "graph_rag_navigation_citations", + "title": "Graph/RAG navigation and citations", + "outcome": "not_tested", + "evidence_classes": ["smoke_only", "research_gate", "blocked", "wrong_result", "not_encoded"], + "measured_claim": "Graph/RAG smokes now produce scored or typed non-pass adapter reports where possible, but broad graph/RAG navigation and citation quality are not tested.", + "command_artifacts": [ + "docs/guide/benchmarking/2026-06-11-graph-rag-scored-smoke-adapter-report.md" + ], + "follow_up_issues": ["XY-929"], + "caveat": "RAGFlow, LightRAG, GraphRAG, Graphiti/Zep, llm-wiki, and gbrain remain blocked, research_gate, or not_encoded; graphify only has a tiny wrong_result smoke." + } + ], + "follow_up_queue": [ + { + "issue": "XY-905", + "priority": "P0", + "state": "Backlog", + "gap": "Live temporal reconciliation answer and trace contract." + }, + { + "issue": "XY-923", + "priority": "P0", + "state": "Backlog", + "gap": "qmd trace-level replay and wrong-result diagnostics." + }, + { + "issue": "XY-924", + "priority": "P0", + "state": "Backlog", + "gap": "mem0/OpenMemory history and UI-export comparison." + }, + { + "issue": "XY-925", + "priority": "P1", + "state": "Backlog", + "gap": "First-generation OSS continuity and source-store adapters." + }, + { + "issue": "XY-926", + "priority": "P1", + "state": "Backlog", + "gap": "Live operator-debugging, capture, consolidation, and knowledge-page suites." + }, + { + "issue": "XY-927", + "priority": "P1", + "state": "Backlog", + "gap": "Letta-style core-vs-archival memory comparison." + }, + { + "issue": "XY-928", + "priority": "P1", + "state": "Backlog", + "gap": "OpenViking context-trajectory and hierarchy benchmark." + }, + { + "issue": "XY-929", + "priority": "P2", + "state": "Backlog", + "gap": "Graph/RAG adapters beyond scored smokes." + }, + { + "issue": "XY-930", + "priority": "P1", + "state": "Backlog", + "gap": "Private-corpus and credentialed production gates after operator inputs exist." + }, + { + "issue": "XY-906", + "priority": "ops", + "state": "Todo", + "gap": "Decodex registered-project review-config schema drift blocks Decodex loading of elf." + } + ], + "claim_boundaries": { + "allowed": [ + "ELF is adoptable for bounded personal production use with caveats.", + "ELF has the strongest measured source-of-truth, rebuild, restore, and backfill evidence among the tracked systems.", + "ELF ties qmd on encoded live retrieval, work_resume, project_decisions, and personalization slices.", + "ELF has a live temporal reconciliation loss against the benchmark expectation: five memory_evolution jobs remain wrong_result.", + "Most competitor strengths outside qmd retrieval are not_tested, blocked, smoke_only, or research_gate." + ], + "not_allowed": [ + "Do not claim ELF broadly beats qmd.", + "Do not claim ELF beats mem0/OpenMemory on history, UI/export, hosted behavior, or graph memory.", + "Do not claim ELF beats OpenViking on staged context trajectory.", + "Do not claim ELF beats Letta on core-vs-archival memory.", + "Do not claim graph/RAG parity from smoke-only evidence.", + "Do not promote fixture-backed, live_baseline_only, smoke_only, research_gate, blocked, wrong_result, lifecycle_fail, unsupported, or not_encoded states into a generic pass/fail score." + ] + } +}