feat(retrieval): confidence-gate + dedup page-based citations#32
Conversation
FinanceBench (n=8) showed the page-based strategy's loss is commitment,
not retrieval: a confident single citation scored f1=1.0/hit=1.0
(id_00807, id_00941), but when unsure the agent sprayed ~5 page ranges
and missed all of them — once returning the same section id five times
(id_00499: sec_363... x5), deflating precision.
This makes the final citation set tight:
- Dedup: every terminal cited-range set now flows through one
selectCitedRanges chokepoint that clamps + dedups by range, so a
range cited five times collapses to one and a section id never
repeats in SelectedIDs.
- Confidence gate: the done action takes an optional confidence (0-1),
parsed from a top-level field or a rich cited_pages object form
({"pages":[5,7],"confidence":0.9}). Surfaced on Result.Confidence,
Result.Confidences, the done event, and the API response.
- Cap: when done cites more than retrieval.pageindex.max_citations
(default 3) distinct ranges, keep the top-N by confidence (ties ->
emission order). Mechanical backstop even when the prompt doesn't
fully tame the spray. max_hops is unchanged — this bounds the FINAL
citation set, not navigation.
- Citations are now built from the final cited ranges (deduped+capped)
rather than every page read, so a confident single pick surfaces ONE
citation even when several pages were skimmed; falls back to the
PagesRead footprint on a refusal / hop-capped run.
- System prompt gains a CITATION DISCIPLINE block: cite the fewest
ranges (ideally one), framing a spray of uncertain ranges as worse
than committing, and keeping the single best pick even at low
confidence (annotate, don't suppress).
Config: RetrievalConfig.PageIndex.MaxCitations (default 3); env
VLE_/VLS_RETRIEVAL_PAGEINDEX_MAX_CITATIONS, forwarded in internal/config.
Tests: dedup (5 cites incl dupes -> <=3 distinct, no repeated id),
confident-single preserved, confidence-aware cap, low-confidence still
commits one, clamp, configurable cap, refusal carries no confidence,
parser confidence + rich cited_pages, plus API-layer dedup/cap and
single-citation proofs. Existing tests stay green.
|
Warning Review limit reached
More reviews will be available in 47 minutes and 4 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (14)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The signal (FinanceBench, n=8)
The page-based strategy (
/v1/answer/pageindex) shows a clean, exploitable split:sec_363...x5), deflating precision.The retrieval mechanism works; the loss is confidence calibration + dedup.
What changed
1. Dedup (correctness). Every terminal cited-range set now flows through a single
selectCitedRangeschokepoint that clamps + dedups by range. A range cited 5x collapses to one; a section id never repeats inSelectedIDs. Thesec_363... x5case becomes one citation.2. Confidence gating. The
doneaction takes an optionalconfidence(0-1), parsed from a top-level field or a richcited_pagesobject form ({"pages":[5,7],"confidence":0.9}). Surfaced onResult.Confidence,Result.Confidences, thedoneevent, and the API response. A low confidence does not suppress the answer — the agent still returns its single best pick.3. Cap. When
donecites more thanretrieval.pageindex.max_citations(default 3) distinct ranges, keep the top-N by confidence (ties -> emission order). Mechanical backstop even if the prompt doesn't fully tame the spray.max_hopsunchanged — this bounds the FINAL citation set, not navigation.4. Tight citations[]. The response's
citations[]is now built from the final cited ranges (deduped + capped) instead of every page read, so a confident single pick surfaces ONE citation even after skimming several pages. Falls back to thePagesReadfootprint on a refusal / hop-capped run.5. Prompt. A
CITATION DISCIPLINEblock: cite the fewest ranges (ideally one), framing a spray of uncertain ranges as worse than committing, and keeping the single best pick even at low confidence (annotate, don't suppress). Prompt wording is high-leverage — please review it.Config:
RetrievalConfig.PageIndex.MaxCitations(default 3). EnvVLE_/VLS_RETRIEVAL_PAGEINDEX_MAX_CITATIONS, forwarded ininternal/config.Test plan
MaxCitations=1-> exactly one).confidence(number/string) + richcited_pagesobject form.VLS_/VLE_forwarding, negative rejected by Validate.go build ./...,go vet ./..., fullgo test ./...green.Do not merge yet.