refactor(eval): consolidate version-specific datasets into prompts.yaml#1908
Open
jiahaog wants to merge 6 commits into
Open
refactor(eval): consolidate version-specific datasets into prompts.yaml#1908jiahaog wants to merge 6 commits into
jiahaog wants to merge 6 commits into
Conversation
TAG=agy CONV=434bb731-15b5-4e15-8474-68d4f4d14636
Contributor
There was a problem hiding this comment.
Code Review
This pull request parameterizes the evaluation strategies with a version string, allowing solvers to dynamically resolve version-specific catalog paths and initialize schema managers with the correct version. It also consolidates the datasets into a single prompts.yaml file. The review feedback suggests replacing .format() with .replace() when substituting the {version} placeholder in catalog paths to prevent potential formatting errors, and adding a safety check in subagent_tool.py to handle cases where the catalog path is missing from the store.
TAG=agy CONV=434bb731-15b5-4e15-8474-68d4f4d14636
TAG=agy CONV=434bb731-15b5-4e15-8474-68d4f4d14636
Collaborator
Author
|
All review comments have been addressed and the fixes have been pushed. I have resolved the discussion threads. |
…mat code TAG=agy CONV=434bb731-15b5-4e15-8474-68d4f4d14636
TAG=agy CONV=434bb731-15b5-4e15-8474-68d4f4d14636
TAG=agy CONV=434bb731-15b5-4e15-8474-68d4f4d14636
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refactoring Prompts for Version-Agnostic Evaluation
We consolidated the protocol version-specific prompt datasets (
v0_9_prompts.yamlandv1_0_prompts.yaml) into a single, unified dataset (prompts.yaml). This change allows the same evaluation cases to run against both protocol versions (v0.9.1 and v1.0) without maintaining duplicate datasets.Key Changes
1. Phrasing Prompts in a Protocol-Agnostic Way
Older prompts explicitly instructed the model to output specific protocol message types, such as
createSurfaceorupdateComponents. The new prompts describe the required action or UI component at a high level.Delete Surface Example:
Generate a JSON message containing a deleteSurface for the surface 'dashboard-surface-1'.Generate a deleteSurface for the surface 'dashboard-surface-1'.Delete the surface 'dashboard-surface-1'.Create Form Example (Login form):
Generate a 'createSurface' message and a 'updateComponents' message with surfaceId 'main' for a login form.Generate the layout for the surface 'main' describing a login form.Create a UI on surface 'main' for a login form.2. Generalizing Validation Targets
The validation targets previously asserted the exact structure of the protocol messages (e.g., that they must contain a list of specific messages). The new targets focus on verifying the layout components and their properties, regardless of how they are wrapped.
Login Form Target:
A valid A2UI payload with surfaceId 'main' containing 'createSurface' and 'updateComponents' messages for a login form. It must feature...A valid A2UI payload with surfaceId 'main' containing a 'createSurface' message with inline components for a login form. It must feature...A valid A2UI payload with surfaceId 'main' for a login form. It must feature...Data Update Target:
The payload should contain a createSurface message with surfaceId 'main', followed by an updateDataModel message. The updateDataModel message must set...The payload should contain a createSurface message with surfaceId 'main', followed by inline data model. The inline data model must set...The payload should target surfaceId 'main' and contain data model updates. The payload must set...3. Parameterizing Catalog Paths
Prompts that require specific catalogs used version-specific directory paths in the past. These paths are now parameterized with
{version}so they can be resolved dynamically.catalog: specification/v0_9/catalogs/basic/catalog.jsoncatalog: specification/{version}/catalogs/basic/catalog.jsonThe evaluation framework resolves the
{version}placeholder at runtime based on the task configuration (e.g., converting0.9.1tov0_9_1or1.0tov1_0).TAG=agy
CONV=434bb731-15b5-4e15-8474-68d4f4d14636