Skip to content

refactor(eval): consolidate version-specific datasets into prompts.yaml#1908

Open
jiahaog wants to merge 6 commits into
a2ui-project:mainfrom
jiahaog:refactor-eval-dataset-design
Open

refactor(eval): consolidate version-specific datasets into prompts.yaml#1908
jiahaog wants to merge 6 commits into
a2ui-project:mainfrom
jiahaog:refactor-eval-dataset-design

Conversation

@jiahaog

@jiahaog jiahaog commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Refactoring Prompts for Version-Agnostic Evaluation

We consolidated the protocol version-specific prompt datasets (v0_9_prompts.yaml and v1_0_prompts.yaml) into a single, unified dataset (prompts.yaml). This change allows the same evaluation cases to run against both protocol versions (v0.9.1 and v1.0) without maintaining duplicate datasets.

Key Changes

1. Phrasing Prompts in a Protocol-Agnostic Way

Older prompts explicitly instructed the model to output specific protocol message types, such as createSurface or updateComponents. The new prompts describe the required action or UI component at a high level.

  • Delete Surface Example:

    • Before (v0.9): Generate a JSON message containing a deleteSurface for the surface 'dashboard-surface-1'.
    • Before (v1.0): Generate a deleteSurface for the surface 'dashboard-surface-1'.
    • After: Delete the surface 'dashboard-surface-1'.
  • Create Form Example (Login form):

    • Before (v0.9): Generate a 'createSurface' message and a 'updateComponents' message with surfaceId 'main' for a login form.
    • Before (v1.0): Generate the layout for the surface 'main' describing a login form.
    • After: Create a UI on surface 'main' for a login form.

2. Generalizing Validation Targets

The validation targets previously asserted the exact structure of the protocol messages (e.g., that they must contain a list of specific messages). The new targets focus on verifying the layout components and their properties, regardless of how they are wrapped.

  • Login Form Target:

    • Before (v0.9): A valid A2UI payload with surfaceId 'main' containing 'createSurface' and 'updateComponents' messages for a login form. It must feature...
    • Before (v1.0): A valid A2UI payload with surfaceId 'main' containing a 'createSurface' message with inline components for a login form. It must feature...
    • After: A valid A2UI payload with surfaceId 'main' for a login form. It must feature...
  • Data Update Target:

    • Before (v0.9): The payload should contain a createSurface message with surfaceId 'main', followed by an updateDataModel message. The updateDataModel message must set...
    • Before (v1.0): The payload should contain a createSurface message with surfaceId 'main', followed by inline data model. The inline data model must set...
    • After: The payload should target surfaceId 'main' and contain data model updates. The payload must set...

3. Parameterizing Catalog Paths

Prompts that require specific catalogs used version-specific directory paths in the past. These paths are now parameterized with {version} so they can be resolved dynamically.

  • Catalog Path Resolution:
    • Before (v0.9): catalog: specification/v0_9/catalogs/basic/catalog.json
    • After: catalog: specification/{version}/catalogs/basic/catalog.json

The evaluation framework resolves the {version} placeholder at runtime based on the task configuration (e.g., converting 0.9.1 to v0_9_1 or 1.0 to v1_0).

TAG=agy
CONV=434bb731-15b5-4e15-8474-68d4f4d14636

TAG=agy

CONV=434bb731-15b5-4e15-8474-68d4f4d14636

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request parameterizes the evaluation strategies with a version string, allowing solvers to dynamically resolve version-specific catalog paths and initialize schema managers with the correct version. It also consolidates the datasets into a single prompts.yaml file. The review feedback suggests replacing .format() with .replace() when substituting the {version} placeholder in catalog paths to prevent potential formatting errors, and adding a safety check in subagent_tool.py to handle cases where the catalog path is missing from the store.

Comment thread eval/a2ui_eval/strategies/direct.py Outdated
Comment thread eval/a2ui_eval/strategies/express.py Outdated
Comment thread eval/a2ui_eval/strategies/express.py Outdated
Comment thread eval/a2ui_eval/strategies/subagent_tool.py Outdated
jiahaog added 2 commits July 2, 2026 04:42
TAG=agy

CONV=434bb731-15b5-4e15-8474-68d4f4d14636
TAG=agy

CONV=434bb731-15b5-4e15-8474-68d4f4d14636
@jiahaog

jiahaog commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

All review comments have been addressed and the fixes have been pushed. I have resolved the discussion threads.

jiahaog added 3 commits July 2, 2026 09:30
…mat code

TAG=agy

CONV=434bb731-15b5-4e15-8474-68d4f4d14636
TAG=agy

CONV=434bb731-15b5-4e15-8474-68d4f4d14636
TAG=agy

CONV=434bb731-15b5-4e15-8474-68d4f4d14636
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant