Skip to content

Normalize safe output outcome evaluation and demote existence-only fallback#35040

Open
Copilot wants to merge 5 commits into
mainfrom
copilot/define-outcome-model
Open

Normalize safe output outcome evaluation and demote existence-only fallback#35040
Copilot wants to merge 5 commits into
mainfrom
copilot/define-outcome-model

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 26, 2026

This change introduces a shared normalized outcome model for safe output evaluation and updates the existence-only fallback to report weak unknown evidence instead of accepted. It also propagates the normalized fields through JSONL/telemetry output and adds summary buckets so dashboards can distinguish strong outcomes from fallback-only signals.

  • Shared outcome model

    • Adds OutcomeStatus with: accepted, rejected, pending, ignored, skipped, unknown
    • Adds EvidenceStrength with: strong, medium, weak
    • Adds shared OutcomeEvaluation fields:
      • outcome_status
      • evidence_strength
      • signal
  • Fallback semantics

    • Changes the generic existence-only evaluator from implicit acceptance to:
      • outcome_status: unknown
      • evidence_strength: weak
      • signal: target_exists_only
    • Prevents existence-only fallbacks from inflating accepted metrics
  • Reporting and telemetry

    • Emits normalized fields in outcome JSONL
    • Carries normalized fields into OTLP outcome spans
    • Keeps legacy result alongside normalized fields for compatibility where needed
  • Summary / dashboard fields

    • Adds evidence-strength acceptance buckets:
      • accepted_strong
      • accepted_medium
      • accepted_weak
    • Adds fallback tracking:
      • fallback_exists_only_count
  • Coverage

    • Adds focused Go tests for:
      • normalized outcome derivation
      • existence-only fallback behavior
      • summary exclusion of fallback-only outcomes from accepted counts
      • JSONL emission of normalized fields
    • Adds focused JS tests for normalization and OTLP span export fields

Example of the normalized payload shape:

{
  "result": "unknown",
  "outcome_status": "unknown",
  "evidence_strength": "weak",
  "signal": "target_exists_only",
  "detail": "object still exists"
}

pr-sous-chef: updated branch on https://github.com/github/gh-aw/actions/runs/26481536311

Generated by 👨‍🍳 PR Sous Chef · gpt54 17.2M ·


pr-sous-chef: branch updated during run https://github.com/github/gh-aw/actions/runs/26484943570

Generated by 👨‍🍳 PR Sous Chef · gpt54 7.1M ·

Copilot AI and others added 4 commits May 26, 2026 20:31
Co-authored-by: mnkiefer <8320933+mnkiefer@users.noreply.github.com>
Co-authored-by: mnkiefer <8320933+mnkiefer@users.noreply.github.com>
Co-authored-by: mnkiefer <8320933+mnkiefer@users.noreply.github.com>
Co-authored-by: mnkiefer <8320933+mnkiefer@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement normalized outcome model and fallback semantics Normalize safe output outcome evaluation and demote existence-only fallback May 26, 2026
Copilot AI requested a review from mnkiefer May 26, 2026 20:45
@mnkiefer mnkiefer marked this pull request as ready for review May 26, 2026 21:15
Copilot AI review requested due to automatic review settings May 26, 2026 21:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a shared, normalized outcome model for safe output evaluation (status + evidence strength + signal), changes the existence-only fallback to report weak/unknown instead of accepted, and propagates the normalized fields through JSONL/telemetry outputs so dashboards can distinguish strong outcomes from fallback-only signals.

Changes:

  • Add normalized outcome fields (outcome_status, evidence_strength, signal) and derive them from legacy result/detail when needed.
  • Demote existence-only fallback from accepted to weak/unknown and track it explicitly in summaries.
  • Emit normalized fields in JSONL and include them in OTLP spans/summary attributes, with tests covering the new behavior.
Show a summary per file
File Description
pkg/cli/outcome_evaluation.go Adds normalized outcome model types and normalization logic.
pkg/cli/outcome_eval.go Embeds normalized fields in reports; updates summary aggregation to use normalized status/strength and adds new buckets.
pkg/cli/outcome_eval_test.go Adds Go tests for normalization, existence-only fallback behavior, summary bucketing, and JSONL emission.
pkg/cli/outcome_eval_jsonl.go Emits normalized fields in JSONL and adds write error handling.
pkg/cli/outcome_eval_generic.go Changes generic existence-only fallback to weak/unknown and makes GH API getter injectable for tests.
actions/setup/js/evaluate_outcomes.cjs Adds JS normalization helper and propagates normalized fields into JSONL + fleet summary counters.
actions/setup/js/evaluate_outcomes.test.cjs Adds JS tests for normalization behavior.
actions/setup/js/emit_outcome_spans.cjs Adds normalized span attributes + status mapping and summary attributes for new buckets.
actions/setup/js/emit_outcome_spans.test.cjs Updates span export tests for normalized attributes and new summary fields.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 9/9 changed files
  • Comments generated: 2

Comment on lines +142 to +146
// Fall back to the legacy result field so older JSONL artifacts still render
// useful spans while newer artifacts carry explicit normalized fields.
const outcomeStatus = typeof eval_.outcome_status === "string" ? eval_.outcome_status : result;
const evidenceStrength = typeof eval_.evidence_strength === "string" ? eval_.evidence_strength : "weak";
const signal = typeof eval_.signal === "string" ? eval_.signal : "";
Comment on lines 228 to +232
const out = {
result: "pending",
outcome_status: "pending",
evidence_strength: "medium",
signal: "pending",
@github-actions
Copy link
Copy Markdown
Contributor

Please resolve the unstable merge/check state so Copilot can keep this moving.

pr-sous-chef: https://github.com/github/gh-aw/actions/runs/26478924995

Generated by 👨‍🍳 PR Sous Chef · gpt54 33.6M ·

@github-actions
Copy link
Copy Markdown
Contributor

@copilot review all comments and address the unresolved review feedback.

Generated by 👨‍🍳 PR Sous Chef · gpt54 7.1M ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[plan] Define shared outcome model and fallback semantics for safe output evaluation

3 participants