Skip to content

build(deps): bump dev tooling + wire gooddata-eval into tox#1676

Open
hkad98 wants to merge 4 commits into
gooddata:masterfrom
hkad98:jkd/deps
Open

build(deps): bump dev tooling + wire gooddata-eval into tox#1676
hkad98 wants to merge 4 commits into
gooddata:masterfrom
hkad98:jkd/deps

Conversation

@hkad98

@hkad98 hkad98 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Bumps development/test tooling to current releases and, along the way, fixes a gap where gooddata-eval was never running in the test harness.

Commits

build(deps) — bump dev tooling

  • ty 0.0.40 → 0.0.55
  • ruff 0.15.15 → 0.15.20
  • pre-commit 4.0.1 → 4.6.0
  • tox 4.32.0 → 4.56.1, tox-uv 1.29.0 → 1.35.2
  • vcrpy 8.0.0 → 8.2.1 (root + sdk/pandas/fdw test groups)
  • moto 5.1.22 → 5.2.2 (re-lock within existing >=5.1.6 spec)

ty 0.0.55's improved inference flags the cast in MetricValueFilter.description as redundant, so it (and the now-unused typing.cast import) is removed. Other runtime deps left capped as-is; urllib3 stays pinned to the OpenAPI generator.

chore(eval) — fix lint violations flagged by ruff

Ruff autofixes across gooddata-eval: remove unused imports, re-sort import blocks, and add the missing # noqa: PLC0415 to the deliberately function-local timezone as _tz imports. No behavior change.

test(eval) — wire gooddata-eval into tox and fix stale unit tests

gooddata-eval was the only package without a tox.ini, so its tests never ran via make test/CI. The suite had silently drifted from the source across recent refactors, accumulating 6 deterministic failures.

  • Add a tox.ini mirroring the other packages, with extras = llm-judge so the optional openai dependency is installed.
  • Add pytest-json-report to the test group (shared pytest command emits a JSON report).
  • Update the 6 stale tests to match current, intentional source behavior: vis_agentic test-kind inference, provider-prefixed JSON report run keys, required ItemReport.question, and run_agentic_visualization only deleting conversations it created.

All 216 eval tests now pass via tox.

Validation

make lint, make format, make type-check green across all packages; sdk (488), pandas (343), fdw (43), pipelines (193), eval (216) test suites pass.

Summary by CodeRabbit

  • New Features

    • Added richer test reporting with per-environment JSON output and JSON reports for test runs.
  • Bug Fixes

    • Updated CLI JSON output to use provider-prefixed model keys consistently.
    • Improved conversation handling to preserve caller-supplied conversation IDs.
    • Refined visualization-related item classification.
  • Chores

    • Updated test and workspace development dependencies (including tox, linting, and vcrpy).
    • Aligned import/formatting patterns across evaluation components and tests.

hkad98 added 3 commits June 30, 2026 16:24
Bump development and test tooling to current releases:
- ty 0.0.40 -> 0.0.55
- ruff 0.15.15 -> 0.15.20
- pre-commit 4.0.1 -> 4.6.0
- tox 4.32.0 -> 4.56.1, tox-uv 1.29.0 -> 1.35.2
- vcrpy 8.0.0 -> 8.2.1 (root + sdk/pandas/fdw test groups)
- moto 5.1.22 -> 5.2.2 (re-lock within existing >=5.1.6 spec)

ty 0.0.55's improved inference flags the cast in MetricValueFilter.description
as redundant, so drop it (and the now-unused typing.cast import). All other
runtime deps left capped as-is; urllib3 stays pinned to the OpenAPI generator.

Validated: make lint, format, type-check green; sdk/pandas/fdw/pipelines tests pass.

jira: trivial
risk: low
Apply ruff autofixes surfaced by `make lint-fix` across gooddata-eval:
remove unused imports and re-sort import blocks, and add the missing
`# noqa: PLC0415` to the deliberately function-local `timezone as _tz`
imports (matching the `_dt` line directly above them).

No behavior change. These violations went unnoticed because gooddata-eval
is not yet wired into the test/lint harness (see follow-up commit).

jira: trivial
risk: nonprod
gooddata-eval was the only package without a tox.ini, so its tests never
ran via `make test`/CI. As a result the suite silently drifted from the
source as the package was refactored (notably "move all agentic evaluation
logic into gooddata_eval SDK"), accumulating 6 deterministic failures.

Add a tox.ini mirroring the other packages, with `extras = llm-judge` so
the optional `openai` dependency is installed (9 tests import it). Add
pytest-json-report to the test group since the shared pytest command emits
a JSON report.

Update the 6 stale tests to match current, intentional source behavior:
- langfuse test_kind inference now returns "vis_agentic" for visualization
  expected_output
- JSON report run keys are provider-prefixed ("Provider/model") to stay
  collision-free across providers
- ItemReport now requires a `question` field
- run_agentic_visualization only deletes conversations it created; a
  caller-supplied initial_conversation_id is left intact

All 216 eval tests pass via tox.

jira: trivial
risk: nonprod
@hkad98 hkad98 requested review from lupko and pcerny as code owners July 1, 2026 05:59
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 46.15385% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.74%. Comparing base (f9639cb) to head (10ea9f2).
⚠️ Report is 53 commits behind head on master.

Files with missing lines Patch % Lines
...eval/src/gooddata_eval/core/agentic/alert_skill.py 33.33% 2 Missing ⚠️
...val/src/gooddata_eval/core/agentic/conversation.py 0.00% 2 Missing ⚠️
...src/gooddata_eval/core/agentic/general_question.py 0.00% 2 Missing ⚠️
...a-eval/src/gooddata_eval/core/agentic/guardrail.py 0.00% 2 Missing ⚠️
...val/src/gooddata_eval/core/agentic/metric_skill.py 33.33% 2 Missing ⚠️
...eval/src/gooddata_eval/core/agentic/search_tool.py 0.00% 2 Missing ⚠️
...al/src/gooddata_eval/core/agentic/visualization.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1676      +/-   ##
==========================================
- Coverage   79.10%   77.74%   -1.36%     
==========================================
  Files         231      271      +40     
  Lines       15718    18570    +2852     
==========================================
+ Hits        12433    14437    +2004     
- Misses       3285     4133     +848     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f6572f35-153e-47ea-9f10-54e6e893f6d1

📥 Commits

Reviewing files that changed from the base of the PR and between ecd1f66 and 10ea9f2.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (1)
  • pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • pyproject.toml

📝 Walkthrough

Walkthrough

This PR adds tox.ini and pytest-json-report to gooddata-eval, splits combined datetime imports across agentic evaluation modules, refactors runner imports and a TypedDict definition, updates tests for narrowed imports and changed CLI/visualization expectations, bumps dependency versions, and removes an unused cast in filter.py.

Changes

gooddata-eval test tooling and CI reporting

Layer / File(s) Summary
Tox configuration and JSON test reporting
packages/gooddata-eval/tox.ini, packages/gooddata-eval/pyproject.toml
Adds a tox test matrix for Python 3.10-3.14 with wheel packaging, llm-judge extra, and pytest commands producing coverage XML and JSON reports; adds pytest-json-report dependency.

Agentic eval import cleanups and runner refactor

Layer / File(s) Summary
Runner import and TypedDict refactor
.../cli/agentic_runner.py, .../cli/main.py
Removes unused HttpxLangfuseClient import, converts _LfKw to a class-based TypedDict, and repositions an import in main.py.
Datetime import splits across agentic skill modules
.../core/agentic/alert_skill.py, .../conversation.py, .../general_question.py, .../guardrail.py, .../metric_skill.py, .../search_tool.py, .../visualization.py
Splits combined datetime/timezone imports into separate statements and reorders minor imports.
Test import narrowing for agentic skills
.../tests/test_agentic_alert_skill.py, .../tests/test_agentic_conversation.py, .../tests/test_agentic_general_question.py, .../tests/test_agentic_guardrail.py, .../tests/test_agentic_search_tool.py
Narrows test imports to match reduced module exports.
Visualization conversation-deletion test updates
.../tests/test_agentic_visualization.py
Updates assertions so caller-supplied conversations are preserved while internally created ones are deleted.
CLI test updates for provider-prefixed keys and thread-safety
.../tests/test_cli.py, .../tests/test_langfuse_source.py
Expects provider-prefixed model keys, streamlines thread-safety test imports and ItemReport construction, and updates expected test_kind to vis_agentic.

Dependency version bumps and filter.py cast removal

Layer / File(s) Summary
Dependency version bumps
packages/gooddata-fdw/pyproject.toml, packages/gooddata-pandas/pyproject.toml, packages/gooddata-sdk/pyproject.toml, pyproject.toml
Bumps vcrpy, pre-commit, ruff, ty, tox, and tox-uv versions.
Filter description cast removal
packages/gooddata-sdk/src/gooddata_sdk/compute/model/filter.py
Removes cast import and replaces cast(...) with a direct values assignment.

Estimated code review effort: 2 (Simple) | ~12 minutes

Poem

A rabbit hopped through imports neat,
Split datetimes clean, no more repeat,
Tox hums along with JSON in tow,
vcrpy bumped, versions aglow,
Tests now dance to provider's beat! 🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main changes: development dependency bumps and adding gooddata-eval to the tox test harness.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Comment @coderabbitai help to get the list of available commands.

scripts/docs/python_ref_builder.py imports jinja2, but jinja2 was never
declared in the workspace test group -- it was only present transitively.
Bumping the dev tooling pruned that transitive provider, breaking the
docs-scripts-tests CI job (uv sync --group test --locked; make
test-docs-scripts) with "ModuleNotFoundError: No module named 'jinja2'".

Add jinja2~=3.1 (matching scripts/script-requirements.txt) to the test
group alongside the other scripts/docs dependencies so the synced env
provides it.

jira: trivial
risk: nonprod
@hkad98 hkad98 enabled auto-merge July 1, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants