test(telemetry/e2e): make TestTelemetryE2E deterministic + deflake retry tests under merge-queue load by vikrantpuppala · Pull Request #812 · databricks/databricks-sql-python

vikrantpuppala · 2026-05-27T13:02:56Z

Summary

Two flake fixes that surfaced from chasing the same incident — both are about asserting on the connector's intent rather than its async side effects, and about isolating the warehouse from concurrent CI load.

1. `tests/e2e/test_telemetry_e2e.py::TestTelemetryE2E` (original scope)

Chronically flaky (see git log on the file — multiple "Fix CI test failure" commits over the last few months). Root cause: the tests asserted "telemetry round-tripped to the server" by intercepting TelemetryClient._telemetry_request_callback and counting completed futures.

def callback_wrapper(self_client, future, sent_count):
    captured_futures.append(future)         # ← appended HERE
    original_callback(self_client, future, sent_count)

The callback only fires after the HTTP request completes, so captured_futures.append(future) happens later than the actual submission. Two failure modes:

Append-lag race. If the test reached wait(captured_futures, ...) before all callbacks had run, captured_futures was still empty (or partial). The wait saw nothing to wait on and returned instantly with done={}. Result: assert 0 == 2.
Executor-shutdown race. On the last connection's close, TelemetryClientFactory.close() shuts the shared executor down with wait=False (intentional for production connection-close latency). Not-yet-started submissions are dropped silently. So even with correct draining we'd be measuring "did the server receive the request in time" rather than "did the connector dispatch it."

Fix: intercept at _send_telemetry (the connector's submission point) instead of _telemetry_request_callback. The submission is synchronous on the same thread that runs _export_event → _flush(). By the time verify_events() runs, every batch the connector intended to submit is in captured_submissions. No async wait, no sleep, no race against executor shutdown.

The renamed variable (captured_futures → captured_submissions) makes the contract clearer: each item is a batch the connector submitted, regardless of what the executor did with it afterward.

Verification: 5 consecutive runs on dogfood, all green, ~20s each.

2. `tests/e2e/test_driver.py::TestPySQLCoreSuite::test_oserror_retries` (added after the first merge_group run)

The first merge_group attempt of this PR hit a different flake: test_oserror_retries failed with assert mock_validate_conn.call_count == 6 because unexpected /telemetry-ext requests were counted alongside the intended session-endpoint retries.

Two interacting causes:

_isolated_from_telemetry() only patches TelemetryClientFactory.initialize_telemetry_client. That covers new connections created during the test, but not real TelemetryClient instances that slip in via stale module state, paths that bypass initialize_telemetry_client, or clients created before the context entered.
The merge queue ran multiple entries in parallel against the same warehouse. The previous concurrency group was keyed on github.ref (per-PR in queue: gh-readonly-queue/main/pr-N-…), so PR ci(code-coverage): move push:main trigger to merge_group #810's and PR test(telemetry/e2e): make TestTelemetryE2E deterministic + deflake retry tests under merge-queue load #812's queue entries ran concurrently. The warehouse was fine for individual entries but couldn't handle two simultaneous loads — telemetry/retry paths started intermittently failing on /telemetry-ext, and those failures got counted by mock_validate_conn.

Fix 2a (retry_test_mixins.py): three-layer telemetry isolation in _isolated_from_telemetry(). Layer 1 (factory swap) was already there. Added layer 2 (_send_telemetry no-op) and layer 3 (_export_event no-op) as defensive backstops at the class level. test_oserror_retries now passes 5/5 locally.

Fix 2b (code-coverage.yml): serialise merge_group runs under a single fixed concurrency group (e2e-mq-serial). Only one queue entry runs the suite at a time. PR-event runs keep per-ref + cancel-in-progress for fast author feedback. Trade-off: queue throughput drops to one ~17-min run at a time. Acceptable for this repo's PR volume.

Why this is the right boundary to test

The connector's responsibility is: given a telemetry event, queue it, flush per batch policy, submit. The tests should verify that. What happens after submission — HTTP latency, server availability, executor lifecycle — isn't behavior of the code under test and shouldn't be coupled into these assertions.

What's NOT in this PR

TelemetryClientFactory.close() still uses shutdown(wait=False). That's intentional behaviour by design, kept for production connection-close latency.
test_retry_max_count_not_exceeded still fails on this branch, but it also fails on baseline main with 'SimpleHttpResponse' object has no attribute 'version_string'. Pre-existing unrelated issue, out of scope.

Test plan

CI passes on this PR.
PR can be queued and merges cleanly even when there's another PR in the queue (queue serialisation kicks in once this lands).
Telemetry e2e tests and retry tests no longer flake on merge_group runs over the next ~week.

This pull request and its description were written by Isaac.

The previous tests asserted "telemetry round-tripped to the server" by intercepting TelemetryClient._telemetry_request_callback and counting completed futures. That recording lags the actual work — the callback fires asynchronously after the HTTP request completes, and on the *last* connection close TelemetryClientFactory.close() shuts the shared executor down with wait=False (intentional, for connection-close latency in production). Two consequences: 1. A `wait(captured_futures, timeout=10)` call right after `with conn:` can return before any callbacks have fired — so the wait is "waiting on" an empty list, returns immediately, and the assertion `assert len(done) == expected_count` fails non- deterministically with `assert 0 == 2` or `assert 1 == 2`. 2. The shared-executor shutdown(wait=False) can drop in-flight submissions that haven't started running yet, so even if we drained correctly we'd be testing whether the server happened to receive the request in time, not whether the connector correctly dispatched it. Switch interception from `_telemetry_request_callback` to `_send_telemetry`. That captures the connector's *intent to submit* synchronously, which is what we actually want to test — the connector either decided to send a batch or it didn't, regardless of what happens to the future afterward. No sleep needed, no timeout-based wait needed, no race against the executor shutdown. 5 consecutive local runs pass deterministically in ~20s each (down from ~17 min when the flake hit). Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

samikshya-db

Thanks for this.

…t-fix-flakiness

Two compounding fixes that surfaced on PR #812's first merge_group run, where test_oserror_retries failed with `assert mock_validate_conn.call_count == 6` — unexpected `/telemetry-ext` requests had been counted alongside the intended session-endpoint retries. 1. tests/e2e/common/retry_test_mixins.py — strengthen `_isolated_from_telemetry()` with two additional defensive patches: - TelemetryClient._send_telemetry → no-op - TelemetryClient._export_event → no-op The existing factory swap installs NoopTelemetryClient for new connections, but doesn't cover real TelemetryClient instances that slip in via other paths (stale module-global, code that bypasses initialize_telemetry_client, anything created before the context entered). Patching at the class level catches all of them. 2. .github/workflows/code-coverage.yml — serialise merge_group runs. Previous concurrency group keyed on github.ref, which is per-PR in the queue (gh-readonly-queue/main/pr-N-…). That allowed multiple queue entries to hammer the same warehouse in parallel, stressing telemetry / retry paths that single-PR runs don't exercise. Group merge_group + workflow_dispatch under a single fixed name (e2e-mq-serial) so they run one at a time. PR-event runs keep per-ref grouping + cancel-in-progress for fast author feedback. Trade-off: queue throughput drops to one ~17-min run at a time. Folded into PR #812 so the telemetry-test rewrite and the retry-test deflake ship as a single unit. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

The retry tests in PySQLRetryTestsMixin patch urllib3 globally (HTTPSConnectionPool._get_conn / _validate_conn) and assert the connection's own request was retried exactly N times. With telemetry on (the default), the same connection makes two kinds of side-channel HTTP calls through the same mocked pool that inflate the count and flake the assertion (e.g. `assert call_count == 6` -> AssertionError under merge-queue load): 1. A synchronous feature-flag GET during Connection.__init__ (is_telemetry_enabled -> FeatureFlagsContext._refresh_flags), and 2. Async telemetry POSTs to /telemetry-ext from the background executor and the periodic flush thread. The prior mitigation (_isolated_from_telemetry, #812) only attacked source #2 and imperfectly, so the tests kept flaking. Fix (test-only): - Add enable_telemetry=False to the shared _retry_policy dict. It short-circuits is_telemetry_enabled before the feature-flag fetch (kills #1) and installs a NoopTelemetryClient (kills #2), removing both sources at their origin without changing the retry behavior under test. Every test in the mixin merges _retry_policy, so this one line covers them all. - Harden _isolated_from_telemetry with a backstop that no-ops the actual HTTP boundary (_send_with_unified_client), closing the already-submitted-future gap for any other caller of the helper. - Add a version_string forward-compat shim to the SimpleHttpResponse mock so the suite also runs on urllib3 >= 2.3 (harmless on the 2.2.x pinned in CI). Verified against dogfood: test_oserror_retries and test_retry_max_count_not_exceeded pass 3/3 deterministically for both Thrift and SEA; the full PySQLRetryTestsMixin (36 tests) passes. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

…tabricks#822) The retry tests in PySQLRetryTestsMixin patch urllib3 globally (HTTPSConnectionPool._get_conn / _validate_conn) and assert the connection's own request was retried exactly N times. With telemetry on (the default), the same connection makes two kinds of side-channel HTTP calls through the same mocked pool that inflate the count and flake the assertion (e.g. `assert call_count == 6` -> AssertionError under merge-queue load): 1. A synchronous feature-flag GET during Connection.__init__ (is_telemetry_enabled -> FeatureFlagsContext._refresh_flags), and 2. Async telemetry POSTs to /telemetry-ext from the background executor and the periodic flush thread. The prior mitigation (_isolated_from_telemetry, databricks#812) only attacked source databricks#2 and imperfectly, so the tests kept flaking. Fix (test-only): - Add enable_telemetry=False to the shared _retry_policy dict. It short-circuits is_telemetry_enabled before the feature-flag fetch (kills #1) and installs a NoopTelemetryClient (kills databricks#2), removing both sources at their origin without changing the retry behavior under test. Every test in the mixin merges _retry_policy, so this one line covers them all. - Harden _isolated_from_telemetry with a backstop that no-ops the actual HTTP boundary (_send_with_unified_client), closing the already-submitted-future gap for any other caller of the helper. - Add a version_string forward-compat shim to the SimpleHttpResponse mock so the suite also runs on urllib3 >= 2.3 (harmless on the 2.2.x pinned in CI). Verified against dogfood: test_oserror_retries and test_retry_max_count_not_exceeded pass 3/3 deterministically for both Thrift and SEA; the full PySQLRetryTestsMixin (36 tests) passes. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

vikrantpuppala temporarily deployed to azure-prod May 27, 2026 13:03 — with GitHub Actions Inactive

vikrantpuppala requested a review from samikshya-db May 27, 2026 13:04

samikshya-db approved these changes May 27, 2026

View reviewed changes

vikrantpuppala enabled auto-merge May 27, 2026 13:33

vikrantpuppala disabled auto-merge May 27, 2026 13:38

vikrantpuppala enabled auto-merge May 27, 2026 13:41

Merge remote-tracking branch 'origin/main' into vikrant/telemetry-tes…

f65d3d8

…t-fix-flakiness

vikrantpuppala temporarily deployed to azure-prod May 27, 2026 13:44 — with GitHub Actions Inactive

vikrantpuppala added this pull request to the merge queue May 27, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 27, 2026

vikrantpuppala added this pull request to the merge queue May 27, 2026

vikrantpuppala removed this pull request from the merge queue due to a manual request May 27, 2026

vikrantpuppala mentioned this pull request May 27, 2026

ci/test: deflake retry tests under merge-queue load #813

Closed

3 tasks

vikrantpuppala temporarily deployed to azure-prod May 27, 2026 17:02 — with GitHub Actions Inactive

vikrantpuppala changed the title ~~test(telemetry/e2e): make TestTelemetryE2E deterministic by capturing submissions, not callbacks~~ test(telemetry/e2e): make TestTelemetryE2E deterministic + deflake retry tests under merge-queue load May 27, 2026

vikrantpuppala enabled auto-merge May 27, 2026 17:03

vikrantpuppala added this pull request to the merge queue May 27, 2026

Merged via the queue into main with commit 0309e7c May 27, 2026
46 checks passed

vikrantpuppala mentioned this pull request Jun 3, 2026

test(retry/e2e): deflake retry-count tests by disabling telemetry #822

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(telemetry/e2e): make TestTelemetryE2E deterministic + deflake retry tests under merge-queue load#812

test(telemetry/e2e): make TestTelemetryE2E deterministic + deflake retry tests under merge-queue load#812
vikrantpuppala merged 3 commits into
mainfrom
vikrant/telemetry-test-fix-flakiness

vikrantpuppala commented May 27, 2026 •

edited

Loading

Uh oh!

samikshya-db left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

vikrantpuppala commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. tests/e2e/test_telemetry_e2e.py::TestTelemetryE2E (original scope)

2. tests/e2e/test_driver.py::TestPySQLCoreSuite::test_oserror_retries (added after the first merge_group run)

Why this is the right boundary to test

What's NOT in this PR

Test plan

Uh oh!

samikshya-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vikrantpuppala commented May 27, 2026 •

edited

Loading

1. `tests/e2e/test_telemetry_e2e.py::TestTelemetryE2E` (original scope)

2. `tests/e2e/test_driver.py::TestPySQLCoreSuite::test_oserror_retries` (added after the first merge_group run)