Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 15 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,8 @@ provider-backed ELF evidence was required.
does not create a managed-memory parity claim. The new `proactive_brief` fixture
scores 5 jobs, with 4 pass and 1 blocked private-corpus case; it does not create
Pulse or hosted managed-memory parity.
- Dreaming competitor-strength closeout after XY-955: the June 17 retest keeps ELF
- Dreaming competitor-strength closeout after XY-955: the June 17 competitor-strength closeout
retest keeps ELF
locally and partially stronger only. The aggregate fixture retest remains 53 pass
and 7 typed blockers, the representative graph/RAG slice remains typed non-pass,
first-generation OSS fixture coverage remains 4 pass and 2 blocked, and the fresh
Expand Down Expand Up @@ -216,6 +217,11 @@ provider-backed ELF evidence was required.
boundaries. This upgrades ELF's own knowledge-page evidence from fixture-only to
service-native proof, but it does not claim llm-wiki, gbrain, GraphRAG, RAGFlow,
LightRAG, or graphify parity without comparable contained adapter outputs.
- Knowledge Workspace version diffs after XY-1019: the June 20 follow-up adds
`elf.knowledge_page.version_diff/v1` readback under knowledge page rebuild metadata
and surfaces it as `page_version_diff` in benchmark artifacts. The live command now
reports `version_diff_coverage = 1.000` while preserving deterministic page content
hashes and `source_mutation_allowed = false`.
- Operator-approved public-proxy addendum after XY-930: the June 19 follow-up runs
`cargo make baseline-production-private-addendum` with a simulated/public-proxy
production corpus manifest approved for this stage. The run records 12 documents,
Expand Down Expand Up @@ -342,6 +348,8 @@ Detailed evidence and interpretation:
- [Service-Native Dreaming Readback Report - June 19, 2026](docs/evidence/benchmarking/2026-06-19-service-native-dreaming-readback-report.md)
- [OpenMemory UI/Export Product Readback Report - June 19, 2026](docs/evidence/benchmarking/2026-06-19-openmemory-ui-export-product-readback-report.md)
- [Operator-Approved Public-Proxy Production-Private Addendum - June 19, 2026](docs/evidence/benchmarking/2026-06-19-operator-approved-public-proxy-production-private-addendum.md)
- [Knowledge Workspace Version-Diff Report - June 20, 2026](docs/evidence/benchmarking/2026-06-20-knowledge-workspace-version-diff-report.md)
- [Live Knowledge-Page Rebuild/Lint Report - June 20, 2026](docs/evidence/benchmarking/2026-06-20-live-knowledge-page-rebuild-lint-report.md)
- [Live Baseline Benchmark Runbook](docs/runbook/benchmarking/live_baseline_benchmark.md)
- [Single-User Production Runbook](docs/runbook/single_user_production.md)
- Benchmark contract:
Expand Down Expand Up @@ -443,11 +451,12 @@ Detailed comparison, mechanism-level analysis, and source map:
- [Dreaming Product Surface Follow-Up Research](docs/research/dreaming_product_surface_followup.md)

Latest real-world benchmark report: June 20, 2026. Latest external research refresh:
June 11, 2026; June 20 adds the Live Knowledge-Page Rebuild/Lint Report - June 20, 2026
after the June 19 XY-930 operator-approved public-proxy production addendum and
service-native Dreaming readback, the qmd debug-ergonomics Dreaming retest, the
June 17 competitor-strength closeout, and the June 16 temporal reconciliation,
live consolidation self-check, proactive-brief, and scheduled-memory scoring evidence.
June 11, 2026; June 20 adds the Knowledge Workspace Version-Diff Report - June 20, 2026
and the Live Knowledge-Page Rebuild/Lint Report - June 20, 2026 after the June 19
XY-930 operator-approved public-proxy production addendum and service-native Dreaming
readback, the qmd debug-ergonomics Dreaming retest, the June 17 competitor-strength
closeout, and the June 16 temporal reconciliation, live consolidation self-check,
proactive-brief, and scheduled-memory scoring evidence.

## Documentation

Expand Down
51 changes: 48 additions & 3 deletions apps/elf-eval/src/bin/real_world_job_benchmark.rs
Original file line number Diff line number Diff line change
Expand Up @@ -460,6 +460,8 @@ struct DerivedPageArtifact {
lint_findings: Vec<DerivedPageLintFinding>,
#[serde(skip_serializing_if = "Option::is_none")]
rebuild: Option<DerivedPageRebuild>,
#[serde(skip_serializing_if = "Option::is_none")]
page_version_diff: Option<Value>,
}

#[derive(Clone, Debug, Deserialize, Serialize)]
Expand Down Expand Up @@ -1271,10 +1273,12 @@ struct KnowledgeSummary {
section_count: usize,
backlink_count: usize,
pages_with_backlinks: usize,
pages_with_version_diff: usize,
citation_coverage: f64,
stale_claim_detection: f64,
rebuild_determinism: f64,
backlink_coverage: f64,
version_diff_coverage: f64,
page_usefulness: f64,
unsupported_summary_count: usize,
untraced_section_count: usize,
Expand Down Expand Up @@ -1459,6 +1463,7 @@ struct KnowledgeJobMetrics {
unsupported_summary_count: usize,
backlink_count: usize,
pages_with_backlinks: usize,
pages_with_version_diff: usize,
stale_trap_count: usize,
stale_traps_detected: usize,
rebuild_page_count: usize,
Expand All @@ -1469,6 +1474,7 @@ struct KnowledgeJobMetrics {
stale_claim_detection: f64,
rebuild_determinism: f64,
backlink_coverage: f64,
version_diff_coverage: f64,
page_usefulness: f64,
}

Expand Down Expand Up @@ -2195,6 +2201,23 @@ fn validate_page_artifact(
page.page_id
));
}
if let Some(diff) = &page.page_version_diff {
if !diff.is_object() {
return Err(eyre::eyre!(
"{} page {} previous-version diff must be a JSON object.",
path.display(),
page.page_id
));
}
if diff.get("schema").and_then(Value::as_str) != Some("elf.knowledge_page.version_diff/v1")
{
return Err(eyre::eyre!(
"{} page {} previous-version diff has an unexpected schema.",
path.display(),
page.page_id
));
}
}

Ok(())
}
Expand Down Expand Up @@ -3854,6 +3877,7 @@ fn knowledge_metrics(job: &RealWorldJob, answer: &ProducedAnswer) -> Option<Know
ratio_or_full(metrics.stale_traps_detected, metrics.stale_trap_count);
metrics.rebuild_determinism = ratio(metrics.deterministic_rebuild_count, metrics.page_count);
metrics.backlink_coverage = ratio(metrics.pages_with_backlinks, metrics.page_count);
metrics.version_diff_coverage = ratio(metrics.pages_with_version_diff, metrics.page_count);
metrics.page_usefulness = round3(
(metrics.citation_coverage
+ metrics.stale_claim_detection
Expand All @@ -3876,6 +3900,9 @@ fn accumulate_page_metrics(page: &DerivedPageArtifact, metrics: &mut KnowledgeJo
if !page.backlinks.is_empty() {
metrics.pages_with_backlinks += 1;
}
if page_has_version_diff(page) {
metrics.pages_with_version_diff += 1;
}

metrics.backlink_count += page.backlinks.len();

Expand Down Expand Up @@ -3911,6 +3938,13 @@ fn accumulate_page_metrics(page: &DerivedPageArtifact, metrics: &mut KnowledgeJo
metrics.rebuild_page_count += 1;
}

fn page_has_version_diff(page: &DerivedPageArtifact) -> bool {
page.page_version_diff.as_ref().is_some_and(|diff| {
diff.get("schema").and_then(Value::as_str) == Some("elf.knowledge_page.version_diff/v1")
&& diff.get("available").and_then(Value::as_bool).unwrap_or(false)
})
}

fn section_is_traced(section: &DerivedPageSection) -> bool {
!section.evidence_ids.is_empty() || !section.timeline_event_ids.is_empty()
}
Expand Down Expand Up @@ -5804,6 +5838,8 @@ fn knowledge_summary(jobs: &[JobReport]) -> Option<KnowledgeSummary> {
let backlink_count = knowledge_jobs.iter().map(|metrics| metrics.backlink_count).sum::<usize>();
let pages_with_backlinks =
knowledge_jobs.iter().map(|metrics| metrics.pages_with_backlinks).sum::<usize>();
let pages_with_version_diff =
knowledge_jobs.iter().map(|metrics| metrics.pages_with_version_diff).sum::<usize>();
let page_usefulness = round3(
knowledge_jobs.iter().map(|metrics| metrics.page_usefulness).sum::<f64>()
/ job_count as f64,
Expand All @@ -5815,10 +5851,12 @@ fn knowledge_summary(jobs: &[JobReport]) -> Option<KnowledgeSummary> {
section_count,
backlink_count,
pages_with_backlinks,
pages_with_version_diff,
citation_coverage: ratio(traced_section_count, section_count),
stale_claim_detection: ratio_or_full(stale_traps_detected, stale_trap_count),
rebuild_determinism: ratio(deterministic_rebuild_count, rebuild_page_count),
backlink_coverage: ratio(pages_with_backlinks, page_count),
version_diff_coverage: ratio(pages_with_version_diff, page_count),
page_usefulness,
unsupported_summary_count: knowledge_jobs
.iter()
Expand Down Expand Up @@ -6810,6 +6848,10 @@ fn render_markdown_optional_summary_metrics(out: &mut String, summary: &ReportSu
"- Backlinks: `{}` total, `{:.3}` page coverage\n",
knowledge.backlink_count, knowledge.backlink_coverage
));
out.push_str(&format!(
"- Version diff coverage: `{:.3}`\n",
knowledge.version_diff_coverage
));
out.push_str(&format!("- Page usefulness: `{:.3}`\n", knowledge.page_usefulness));
out.push_str(&format!(
"- Unsupported summary count: `{}`\n",
Expand Down Expand Up @@ -7296,22 +7338,25 @@ fn render_markdown_knowledge(out: &mut String, report: &RealWorldReport) {
}

out.push_str("## Knowledge Page Metrics\n\n");
out.push_str("| Job | Pages | Sections | Citation Coverage | Stale Claim Detection | Rebuild Determinism | Page Usefulness | Backlinks | Unsupported Summaries | Untraced Sections | Allowed Variance |\n");
out.push_str("| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |\n");
out.push_str("| Job | Pages | Sections | Citation Coverage | Stale Claim Detection | Rebuild Determinism | Version Diff Coverage | Page Usefulness | Backlinks | Unsupported Summaries | Untraced Sections | Allowed Variance |\n");
out.push_str(
"| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |\n",
);

for job in knowledge_jobs {
let Some(knowledge) = &job.knowledge else {
continue;
};

out.push_str(&format!(
"| {} | {} | {} | `{:.3}` | `{:.3}` | `{:.3}` | `{:.3}` | {} | {} | {} | {} |\n",
"| {} | {} | {} | `{:.3}` | `{:.3}` | `{:.3}` | `{:.3}` | `{:.3}` | {} | {} | {} | {} |\n",
md_cell(job.job_id.as_str()),
knowledge.page_count,
knowledge.section_count,
knowledge.citation_coverage,
knowledge.stale_claim_detection,
knowledge.rebuild_determinism,
knowledge.version_diff_coverage,
knowledge.page_usefulness,
knowledge.backlink_count,
knowledge.unsupported_summary_count,
Expand Down
9 changes: 9 additions & 0 deletions apps/elf-eval/src/bin/real_world_live_adapter.rs
Original file line number Diff line number Diff line change
Expand Up @@ -350,6 +350,7 @@ struct KnowledgeMaterializationEvidence {
unsupported_claim_count: usize,
citation_count: usize,
source_ref_count: usize,
version_diff_available: bool,
}

#[derive(Clone, Debug, Default, Serialize)]
Expand Down Expand Up @@ -3455,6 +3456,7 @@ fn knowledge_page_artifact(
"sections": sections,
"backlinks": source_backlinks(ingested),
"lint_findings": lint_findings_for_page(loaded, ingested, lint),
"page_version_diff": second.page.previous_version_diff.clone(),
"rebuild": {
"first_hash": first.page.content_hash.clone(),
"second_hash": second.page.content_hash.clone(),
Expand Down Expand Up @@ -3485,6 +3487,13 @@ fn knowledge_materialization_evidence(
unsupported_claim_count,
citation_count: page.sections.iter().map(|section| section.citation_count).sum(),
source_ref_count: page.source_refs.len(),
version_diff_available: page
.page
.previous_version_diff
.as_ref()
.and_then(|diff| diff.get("available"))
.and_then(serde_json::Value::as_bool)
.unwrap_or(false),
}
}

Expand Down
23 changes: 23 additions & 0 deletions apps/elf-eval/tests/real_world_job_benchmark.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2348,6 +2348,16 @@ fn live_knowledge_page_rebuild_lint_has_dedicated_docker_task() -> Result<()> {
fs::read_to_string(workspace.join("scripts/real-world-knowledge-live-adapter.sh"))?;
let live_adapter =
fs::read_to_string(workspace.join("apps/elf-eval/src/bin/real_world_live_adapter.rs"))?;
let knowledge_spec = fs::read_to_string(
workspace.join("docs").join("spec").join("system_knowledge_pages_v1.md"),
)?;
let version_diff_report = fs::read_to_string(
workspace
.join("docs")
.join("evidence")
.join("benchmarking")
.join("2026-06-20-knowledge-workspace-version-diff-report.md"),
)?;
let benchmark_runbook = fs::read_to_string(
workspace
.join("docs")
Expand Down Expand Up @@ -2380,16 +2390,29 @@ fn live_knowledge_page_rebuild_lint_has_dedicated_docker_task() -> Result<()> {
assert!(live_script.contains("knowledge_page_lint"));
assert!(live_script.contains("knowledge_pages_search"));
assert!(live_script.contains("pages remain derived benchmark artifacts"));
assert!(live_adapter.contains("\"page_version_diff\""));
assert!(live_adapter.contains("version_diff_available"));
assert!(live_adapter.contains("fn materialize_elf_knowledge("));
assert!(live_adapter.contains("KnowledgePageRebuildRequest"));
assert!(live_adapter.contains("KnowledgePageLintRequest"));
assert!(live_adapter.contains("KnowledgePageSearchRequest"));
assert!(
fs::read_to_string(workspace.join("apps/elf-eval/src/bin/real_world_job_benchmark.rs"))?
.contains("version_diff_coverage")
);
assert!(knowledge_spec.contains("elf.knowledge_page.version_diff/v1"));
assert!(
version_diff_report.contains("Knowledge Workspace Version-Diff Report - June 20, 2026")
);
assert!(version_diff_report.contains("version_diff_coverage = 1.000"));
assert!(benchmark_runbook.contains("Current live knowledge-page rebuild/lint increment"));
assert!(benchmark_runbook.contains("cargo make real-world-memory-live-knowledge"));
assert!(benchmark_runbook.contains("tmp/real-world-memory/live-knowledge/summary.json"));
assert!(live_runbook.contains("cargo make real-world-memory-live-knowledge"));
assert!(benchmarking_index.contains("2026-06-20-live-knowledge-page-rebuild-lint-report.md"));
assert!(benchmarking_index.contains("2026-06-20-knowledge-workspace-version-diff-report.md"));
assert!(readme.contains("Live Knowledge-Page Rebuild/Lint Report - June 20, 2026"));
assert!(readme.contains("Knowledge Workspace Version-Diff Report - June 20, 2026"));

Ok(())
}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
type: Evidence
title: "Knowledge Workspace Version-Diff Report - June 20, 2026"
description: "Checked-in benchmark evidence record: Knowledge Workspace Version-Diff Report - June 20, 2026."
resource: docs/evidence/benchmarking/2026-06-20-knowledge-workspace-version-diff-report.md
status: active
authority: current_state
owner: evidence
last_verified: 2026-06-20
tags:
- docs
- evidence
- benchmarking
---
# Knowledge Workspace Version-Diff Report - June 20, 2026

Goal: Close XY-1019's product-quality Knowledge Workspace increment by proving
derived pages expose previous-version diffs while preserving citations, lint,
rebuild determinism, search readback, and source-of-truth boundaries.
Read this when: You need to know whether ELF knowledge pages now show rebuild diffs
without turning derived pages into authoritative memory.
Inputs: `cargo make real-world-memory-live-knowledge`,
`packages/elf-service/src/knowledge.rs`,
`apps/elf-eval/src/bin/real_world_live_adapter.rs`, and
`apps/elf-eval/src/bin/real_world_job_benchmark.rs`.
Outputs: Service and benchmark evidence for `elf.knowledge_page.version_diff/v1`.

## Executive Judgment

ELF Knowledge Workspace pages now expose previous-version diff metadata under
`rebuild_metadata.previous_version_diff` and surface it as `page_version_diff` in
live benchmark artifacts. The diff records previous/new content and source hashes,
title/source/content change booleans, section added/removed/changed/unchanged counts,
section key lists, a summary, and `source_mutation_allowed = false`.

This is a product-quality readback improvement for ELF's derived knowledge pages. It
does not claim broad llm-wiki, gbrain, GraphRAG, RAGFlow, LightRAG, or graphify parity.
External comparisons still need contained adapters with comparable page sections,
source ids, citation mappings, lint findings, previous-version diffs, and typed
statuses.

## Command Evidence

| Command | Result |
| --- | --- |
| `cargo test -p elf-service knowledge::tests::previous_version_diff_records_delta_without_changing_content_hash -- --nocapture` | Passed; proves diff metadata does not perturb page content hashes. |
| `cargo test -p elf-eval --test real_world_job_benchmark live_knowledge_page_rebuild_lint_has_dedicated_docker_task -- --nocapture` | Passed; proves the live adapter and benchmark report keep the version-diff contract wired. |
| `cargo make real-world-memory-live-knowledge` | Passed; Docker-contained live materialization reports `version_diff_coverage = 1.000`. |

## Current Live Metrics

From `tmp/real-world-memory/live-knowledge/elf-report.json`:

| Metric | Value |
| --- | ---: |
| Knowledge jobs | 2 |
| Pages | 2 |
| Pages with version diff | 2 |
| Version diff coverage | 1.000 |
| Rebuild determinism | 1.000 |
| Stale claim detection | 1.000 |
| Backlink coverage | 1.000 |
| Page usefulness | 0.938 |

## Contract Boundary

| Allowed claim | Boundary |
| --- | --- |
| ELF derived pages expose previous-version diff metadata after repeated rebuilds. | The diff is readback metadata only; it must not mutate source memory. |
| Search and benchmark artifacts can show `page_version_diff`. | Page snippets remain derived artifacts and must carry citations/lint/source coverage. |
| Rebuild determinism remains stable when diff metadata is present. | The page content hash excludes previous-version diff metadata. |
| External knowledge-product comparison remains future work. | Competitors need comparable contained artifacts before any parity or win/loss claim. |

## Follow-Up Queue

| Follow-up | Reason |
| --- | --- |
| XY-1020 | Temporal graph-lite facts can now feed cited pages without making pages source truth. |
| XY-1021 | Dreaming review queue can propose page rebuilds using source-backed diffs and lint. |
| Graph/RAG contained adapters | External comparison needs comparable version-diff and citation/lint outputs. |

Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Goal: Close XY-935 by moving ELF knowledge-page rebuild/lint scoring from fixtur
evidence into a Docker-contained service materialization command.
Read this when: You need to know whether ELF has service-native evidence for
derived knowledge pages, citation coverage, stale-source lint, unsupported sections,
rebuild metadata, backlinks, and page search.
rebuild metadata, previous-version diffs, backlinks, and page search.
Inputs: `cargo make real-world-memory-knowledge`,
`cargo make real-world-memory-live-knowledge`,
`apps/elf-eval/fixtures/real_world_memory/knowledge/`, and
Expand All @@ -37,7 +37,7 @@ This improves ELF's own knowledge-page authority from fixture-only page artifact
service-backed rebuild/lint/search evidence. It does not prove parity or superiority
against llm-wiki, gbrain, GraphRAG, RAGFlow, LightRAG, or graphify. Those comparisons
remain valid only when a contained adapter emits comparable page sections, source ids,
citation mappings, lint findings, and typed benchmark statuses.
citation mappings, lint findings, previous-version diffs, and typed benchmark statuses.

## Command Evidence

Expand Down Expand Up @@ -68,6 +68,7 @@ The command is intentionally Docker-scoped. Host execution is refused unless
| Stale-source lint | Stale source updates after rebuild produce lint findings instead of silently rewriting truth. |
| Unsupported sections | Unsupported summaries remain visible as unsupported, not hidden claims. |
| Rebuild metadata | First and second rebuild hashes, deterministic status, and allowed variance remain explicit. |
| Previous-version diff | Repeated rebuilds expose `elf.knowledge_page.version_diff/v1` metadata without changing page content hashes. |
| Backlinks and search | Page artifacts expose backlinks, and `knowledge_pages_search` returns the materialized page surface. |
| Source-of-truth boundary | Knowledge pages remain derived benchmark artifacts and do not replace Memory Notes or source records. |

Expand Down Expand Up @@ -95,6 +96,8 @@ The command is intentionally Docker-scoped. Host execution is refused unless
command for the checked-in `knowledge_compilation` fixture pack.
- The command exercises `knowledge_page_rebuild`, `knowledge_page_lint`, and
`knowledge_pages_search` before scoring.
- The current service-native artifact includes previous-version diff metadata and
reports `version_diff_coverage = 1.000`.
- ELF's own knowledge-page evidence is stronger than fixture-only proof for this
narrow slice.

Expand Down
Loading