Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 9 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,12 +153,15 @@ provider-backed ELF evidence was required.
jobs across 11 suites, 36 pass, 0 incomplete, 2 blocked, 0 wrong-result,
0 not-encoded, and 0 unsupported-claim results. The remaining non-pass jobs are
production-ops operator boundaries, not hidden benchmark wins.
- Full-suite live real-world adapter sweep after XY-880: ELF and qmd now emit
- Full-suite live real-world adapter sweep after XY-899: ELF and qmd emit
Docker-isolated `live_real_world` records for all 38 encoded jobs across 11 suites
through `cargo make real-world-memory-live-adapters`. Both keep the original
targeted `work_resume`, `retrieval`, and `project_decisions` slice passing, but the
full sweep is not a full-suite pass: each adapter reports 18 pass, 5 wrong_result,
1 incomplete, 2 blocked, and 12 not_encoded jobs.
full sweep is not a full-suite pass. The fresh ELF sweep reports 18 pass,
5 wrong_result, 2 blocked, and 13 not_encoded jobs. The fresh qmd sweep reports
17 pass, 6 wrong_result, 2 blocked, and 13 not_encoded jobs. The difference is the
delete/TTL tombstone case; qmd remains the local retrieval-debug UX reference, and
no broad ELF-over-qmd claim is allowed.
- Expanded adapter-pack coverage after XY-834: the real-world external adapter
manifest now includes `research_gate` records for RAGFlow, LightRAG, GraphRAG,
Graphiti/Zep, Letta, LangGraph, nanograph, llm-wiki, gbrain, and deeper
Expand Down Expand Up @@ -191,6 +194,7 @@ Detailed evidence and interpretation:
- [Real-World Comparison Report - June 10, 2026](docs/guide/benchmarking/2026-06-10-real-world-comparison-report.md)
- [Live Real-World Adapter Sweep Report - June 10, 2026](docs/guide/benchmarking/2026-06-10-live-real-world-sweep-report.md)
- [Post-Adapter Production Adoption Refresh - June 10, 2026](docs/guide/benchmarking/2026-06-10-production-adoption-refresh.md)
- [qmd and OpenViking Strength-Profile Report - June 11, 2026](docs/guide/benchmarking/2026-06-11-qmd-openviking-strength-profile-report.md)
- [Graph/RAG Scored Smoke Adapter Report - June 11, 2026](docs/guide/benchmarking/2026-06-11-graph-rag-scored-smoke-adapter-report.md)
- [Live Baseline Benchmark Runbook](docs/guide/benchmarking/live_baseline_benchmark.md)
- [Single-User Production Runbook](docs/guide/single_user_production.md)
Expand All @@ -204,7 +208,7 @@ Detailed evidence and interpretation:
live sweep, but that sweep still contains typed non-pass states and is not
full-suite parity.

Evidence-backed position after the June 10 real-world report:
Evidence-backed position after the June 11 real-world reports:

- ELF is better evidenced than the tested alternatives on evidence-bound writes,
deterministic ingestion boundaries, Postgres source-of-truth plus rebuildable Qdrant
Expand Down Expand Up @@ -276,7 +280,7 @@ Detailed comparison, mechanism-level analysis, and source map:
- [RAG/Graph Adapter Feasibility Research Run](docs/research/2026-06-10-xy-882-rag-graph-adapter-feasibility.json)

Latest real-world benchmark report: June 11, 2026. Latest external research refresh:
June 10, 2026.
June 11, 2026.

## Documentation

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -290,7 +290,7 @@
},
"result": {
"status": "pass",
"evidence": "The current evidence is same-corpus live-baseline evidence only; no real_world_job qmd adapter is encoded yet.",
"evidence": "This live_baseline_only record is same-corpus evidence only; cite qmd_live_real_world for the full live real-world sweep.",
"artifact": "docs/guide/benchmarking/live_baseline_benchmark.md"
},
"capabilities": [
Expand All @@ -314,7 +314,7 @@
{
"suite_id": "retrieval",
"status": "not_encoded",
"evidence": "qmd is a retrieval-debug reference, but no real_world_job retrieval adapter run is encoded."
"evidence": "This live_baseline_only record does not execute real_world_job retrieval prompts; cite qmd_live_real_world for the live retrieval adapter run."
},
{
"suite_id": "memory_evolution",
Expand Down Expand Up @@ -425,7 +425,7 @@
{
"suite_id": "memory_evolution",
"status": "wrong_result",
"evidence": "qmd passed the delete/TTL case but failed five current-versus-historical conflict jobs because retrieval-backed answers did not provide the required historical conflict evidence links."
"evidence": "qmd failed all six memory-evolution jobs in the fresh June 11 diagnostic, including the delete/TTL tombstone job where qmd retrieved only the current plan and missed the tombstone evidence."
},
{
"suite_id": "consolidation",
Expand Down Expand Up @@ -1036,11 +1036,12 @@
},
"run": {
"status": "not_encoded",
"evidence": "No expanded qmd stress or real_world_job deep-profile artifact is checked in for this adapter-pack gate."
"evidence": "The XY-899 strength-profile report is checked in, but no new live qmd deep-profile adapter artifact is claimed from it."
},
"result": {
"status": "not_encoded",
"evidence": "qmd deep retrieval-debug evidence remains a planned profile, not a new pass claim."
"evidence": "The XY-899 report records qmd scenario-level retrieval/debug/replay outcomes and wrong-result diagnosis taxonomy, while expansion/fusion/rerank scoring remains not_encoded.",
"artifact": "docs/research/2026-06-11-qmd-openviking-strength-profile-report.json"
},
"capabilities": [
{
Expand All @@ -1051,7 +1052,7 @@
{
"capability": "real_world_job_adapter",
"status": "not_encoded",
"evidence": "The qmd live real-world slice covers representative jobs only; expanded retrieval-debug suites need their own materialized adapter run."
"evidence": "The qmd live real-world sweep covers the current encoded fixture corpus; expanded retrieval-debug strength suites still need their own materialized adapter run."
},
{
"capability": "host_global_install_boundary",
Expand Down Expand Up @@ -1107,7 +1108,7 @@
{
"adapter_id": "openviking_deep_profile_gate",
"project": "OpenViking",
"adapter_kind": "docker_local_embed_deep_profile_gate",
"adapter_kind": "docker_local_embed_context_trajectory_gate",
"evidence_class": "research_gate",
"docker_default": true,
"host_global_installs_required": false,
Expand All @@ -1120,11 +1121,12 @@
},
"run": {
"status": "not_encoded",
"evidence": "The adapter cannot fairly exercise hierarchical trajectory behavior until same-corpus add_resource/find returns evidence-bearing results."
"evidence": "The XY-899 strength-profile report records staged retrieval, hierarchy selection, recursive/context expansion, and missed-term evidence as typed not_tested or wrong_result states; no new live trajectory adapter artifact is claimed."
},
"result": {
"status": "not_encoded",
"evidence": "No OpenViking deep context-trajectory result is claimed from the current wrong-result smoke run."
"evidence": "No OpenViking deep context-trajectory result is claimed from the current wrong-result smoke run; the XY-899 report preserves the trajectory surfaces as not_tested.",
"artifact": "docs/research/2026-06-11-qmd-openviking-strength-profile-report.json"
},
"capabilities": [
{
Expand All @@ -1135,7 +1137,7 @@
{
"capability": "hierarchical_context_trajectory",
"status": "not_encoded",
"evidence": "Stage trajectory scoring is not encoded until setup reaches runnable OpenViking APIs."
"evidence": "Stage trajectory scoring remains not encoded until the smoke adapter returns evidence-bearing same-corpus output instead of the current wrong_result missed-term evidence."
},
{
"capability": "host_global_install_boundary",
Expand Down
Loading