feat(belief-state): add phase0 runtime measurement#228
Conversation
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 86 | 86 | 86 |
| Confidence | 75 | 75 | 75 |
| Correctness | 86 | 86 | 86 |
| Security | 86 | 86 | 86 |
| Testing | 86 | 86 | 86 |
| Architecture | 86 | 86 | 86 |
Full multi-shot audit completed 3/3 planned shots over 5 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 5 changed files. Global verifier still owns final merge decision.
🟡 LOW split metadata requirement inconsistency with roadmap — .evolve/pursuits/2026-06-04-belief-state-agent-eval.md
Line 30: next-action promotion requirements dropped 'split metadata' from the list (old: '>= 200 labeled decision points, split metadata, integrity checks...', new: '>= 200 labeled decision points, integrity checks...'). However the roadmap at docs/research/belief-state-agent-eval-roadmap.md:228 still lists 'Every row has train/dev/holdout split' as a Phase 0 completion criterion. The code (phase0-measurement.ts:104,162) captures split metadata from RunRecord, but the research evidence gates (research-evidence.ts) do not currently enforce it. This may be intentional — reflecting that split is data-property not gate — but the inconsistency
🟡 LOW Test map row is aspirational, not verified — docs/research/belief-state-agent-eval-roadmap.md
Line 572 adds a test-map row for
phase0-measurement.test.tsdescribing expected behavior ('joins runtime producer decisions...without fabricating missing joins or propensities'). This is a planning artifact — the test file exists in the PR's code changes (outside this shot's scope) but the roadmap assertion about what it tests is only as reliable as the test implementation. No action needed for a docs-only shot, but the global verifier should confirm the test file matches this description.
🟡 LOW Tests don't exercise label-to-point probability propagation — src/belief-state/phase0-measurement.test.ts
No test verifies that RuntimeBeliefDecisionLabel.behaviorProb or .targetProb propagate through to BeliefDecisionPoint.behaviorProb/.targetProb. The counterfactual test only asserts their absence. The underlying runtimeDecisionPointToBeliefDecisionPoint handles this correctly (tested in runtime-hooks.test.ts), but the Phase 0 integration path isn't covered. Add a test with labels carrying behaviorProb=0.3 and targetProb=0.5 and assert withBehaviorProb and withTargetProb summary counts.
🟡 LOW Loose options spread passes runtime fields to downstream packet builder — src/belief-state/phase0-measurement.ts
Line 127-130:
buildBeliefDecisionResearchEvidencePacket({ ...options, points })spreads all BuildRuntimeBeliefPhase0MeasurementOptions fields including runs, decisions, events, labels into the packet builder, which passes them to analyzeBeliefDecisionCorpus. JS ignores unknown properties at runtime so no functional bug, but future field name collisions and auditability suffer. Fix: destructure only the fields that BuildBeliefDecisionResearchEvidencePacketOptions accepts.
🟡 LOW compactMetadata duplicated across phase0-measurement.ts and runtime-hooks.ts — src/belief-state/phase0-measurement.ts
Lines 175-178 define
compactMetadataidentically toruntime-hooks.ts:381-384. Same signature, same filter-from-entries logic. Should be extracted to a shared internal utility (e.g../internal/compact-metadata.ts) or re-exported from runtime-hooks. Minor DRY violation that increases maintenance surface.
🟡 LOW labelJoinRate conflates label join success with downstream validation failures — src/belief-state/phase0-measurement.ts
Line 157:
labelJoinRate: ratio(points.length, producerDecisionCount). The numerator ispoints.length(points that passed ALL downstream validation viaruntimeDecisionPointToBeliefDecisionPoint), notproducerDecisionCount - missingRunRecordCount - missingLabelCount. If a label join succeeds butruntimeDecisionPointToBeliefDecisionPointreturns no point (e.g.chosenActionmissing, unsupportedkind),labelJoinRatedrops without any diagnostic explaining the mismatch betweenmissingLabelCountand the actual completed count. TherunJoinRateon [line 156](https://github.com/tangle-network/agent-eval/blob/ab2a6f7ad415e3b4a6c0d76309be8a8750b7
tangletools · 2026-06-06T13:55:29Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 6 non-blocking findings — ab2a6f7a
Full multi-shot audit completed 3/3 planned shots over 5 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 5 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-06T13:55:29Z · immutable trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — bcf09d0f
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-06T17:28:50Z
Summary
buildRuntimeBeliefPhase0Measurement()for joining runtime producer decisions, lifecycle evidence, labels, and run split metadataBeliefDecisionPointrows plus aBeliefDecisionResearchEvidencePacketand coverage summarytests/belief-state/because it exercises cross-module join behavior rather than a single local unitVerification
pnpm exec vitest run tests/belief-state/phase0-measurement.test.tspnpm exec vitest run src/belief-statepnpm typecheckpnpm lintpnpm testpnpm buildpnpm verify:packageNotes:
pnpm lintexits cleanly with two pre-existing warnings outside this patch.