test(telemetry/e2e): make TestTelemetryE2E deterministic + deflake retry tests under merge-queue load#812
Merged
Conversation
The previous tests asserted "telemetry round-tripped to the server"
by intercepting TelemetryClient._telemetry_request_callback and
counting completed futures. That recording lags the actual work —
the callback fires asynchronously after the HTTP request completes,
and on the *last* connection close TelemetryClientFactory.close()
shuts the shared executor down with wait=False (intentional, for
connection-close latency in production). Two consequences:
1. A `wait(captured_futures, timeout=10)` call right after `with
conn:` can return before any callbacks have fired — so the
wait is "waiting on" an empty list, returns immediately, and
the assertion `assert len(done) == expected_count` fails non-
deterministically with `assert 0 == 2` or `assert 1 == 2`.
2. The shared-executor shutdown(wait=False) can drop in-flight
submissions that haven't started running yet, so even if we
drained correctly we'd be testing whether the server happened
to receive the request in time, not whether the connector
correctly dispatched it.
Switch interception from `_telemetry_request_callback` to
`_send_telemetry`. That captures the connector's *intent to submit*
synchronously, which is what we actually want to test — the
connector either decided to send a batch or it didn't, regardless
of what happens to the future afterward. No sleep needed, no
timeout-based wait needed, no race against the executor shutdown.
5 consecutive local runs pass deterministically in ~20s each (down
from ~17 min when the flake hit).
Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
3 tasks
Two compounding fixes that surfaced on PR #812's first merge_group run, where test_oserror_retries failed with `assert mock_validate_conn.call_count == 6` — unexpected `/telemetry-ext` requests had been counted alongside the intended session-endpoint retries. 1. tests/e2e/common/retry_test_mixins.py — strengthen `_isolated_from_telemetry()` with two additional defensive patches: - TelemetryClient._send_telemetry → no-op - TelemetryClient._export_event → no-op The existing factory swap installs NoopTelemetryClient for new connections, but doesn't cover real TelemetryClient instances that slip in via other paths (stale module-global, code that bypasses initialize_telemetry_client, anything created before the context entered). Patching at the class level catches all of them. 2. .github/workflows/code-coverage.yml — serialise merge_group runs. Previous concurrency group keyed on github.ref, which is per-PR in the queue (gh-readonly-queue/main/pr-N-…). That allowed multiple queue entries to hammer the same warehouse in parallel, stressing telemetry / retry paths that single-PR runs don't exercise. Group merge_group + workflow_dispatch under a single fixed name (e2e-mq-serial) so they run one at a time. PR-event runs keep per-ref grouping + cancel-in-progress for fast author feedback. Trade-off: queue throughput drops to one ~17-min run at a time. Folded into PR #812 so the telemetry-test rewrite and the retry-test deflake ship as a single unit. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two flake fixes that surfaced from chasing the same incident — both are about asserting on the connector's intent rather than its async side effects, and about isolating the warehouse from concurrent CI load.
1.
tests/e2e/test_telemetry_e2e.py::TestTelemetryE2E(original scope)Chronically flaky (see
git logon the file — multiple "Fix CI test failure" commits over the last few months). Root cause: the tests asserted "telemetry round-tripped to the server" by interceptingTelemetryClient._telemetry_request_callbackand counting completed futures.The callback only fires after the HTTP request completes, so
captured_futures.append(future)happens later than the actual submission. Two failure modes:Append-lag race. If the test reached
wait(captured_futures, ...)before all callbacks had run,captured_futureswas still empty (or partial). Thewaitsaw nothing to wait on and returned instantly withdone={}. Result:assert 0 == 2.Executor-shutdown race. On the last connection's close,
TelemetryClientFactory.close()shuts the shared executor down withwait=False(intentional for production connection-close latency). Not-yet-started submissions are dropped silently. So even with correct draining we'd be measuring "did the server receive the request in time" rather than "did the connector dispatch it."Fix: intercept at
_send_telemetry(the connector's submission point) instead of_telemetry_request_callback. The submission is synchronous on the same thread that runs_export_event→_flush(). By the timeverify_events()runs, every batch the connector intended to submit is incaptured_submissions. No async wait, no sleep, no race against executor shutdown.The renamed variable (
captured_futures→captured_submissions) makes the contract clearer: each item is a batch the connector submitted, regardless of what the executor did with it afterward.Verification: 5 consecutive runs on dogfood, all green, ~20s each.
2.
tests/e2e/test_driver.py::TestPySQLCoreSuite::test_oserror_retries(added after the first merge_group run)The first merge_group attempt of this PR hit a different flake:
test_oserror_retriesfailed withassert mock_validate_conn.call_count == 6because unexpected/telemetry-extrequests were counted alongside the intended session-endpoint retries.Two interacting causes:
_isolated_from_telemetry()only patchesTelemetryClientFactory.initialize_telemetry_client. That covers new connections created during the test, but not realTelemetryClientinstances that slip in via stale module state, paths that bypassinitialize_telemetry_client, or clients created before the context entered.The merge queue ran multiple entries in parallel against the same warehouse. The previous concurrency group was keyed on
github.ref(per-PR in queue:gh-readonly-queue/main/pr-N-…), so PR ci(code-coverage): move push:main trigger to merge_group #810's and PR test(telemetry/e2e): make TestTelemetryE2E deterministic + deflake retry tests under merge-queue load #812's queue entries ran concurrently. The warehouse was fine for individual entries but couldn't handle two simultaneous loads — telemetry/retry paths started intermittently failing on/telemetry-ext, and those failures got counted bymock_validate_conn.Fix 2a (
retry_test_mixins.py): three-layer telemetry isolation in_isolated_from_telemetry(). Layer 1 (factory swap) was already there. Added layer 2 (_send_telemetryno-op) and layer 3 (_export_eventno-op) as defensive backstops at the class level.test_oserror_retriesnow passes 5/5 locally.Fix 2b (
code-coverage.yml): serialisemerge_groupruns under a single fixed concurrency group (e2e-mq-serial). Only one queue entry runs the suite at a time. PR-event runs keep per-ref + cancel-in-progress for fast author feedback. Trade-off: queue throughput drops to one ~17-min run at a time. Acceptable for this repo's PR volume.Why this is the right boundary to test
The connector's responsibility is: given a telemetry event, queue it, flush per batch policy, submit. The tests should verify that. What happens after submission — HTTP latency, server availability, executor lifecycle — isn't behavior of the code under test and shouldn't be coupled into these assertions.
What's NOT in this PR
TelemetryClientFactory.close()still usesshutdown(wait=False). That's intentional behaviour by design, kept for production connection-close latency.test_retry_max_count_not_exceededstill fails on this branch, but it also fails on baseline main with'SimpleHttpResponse' object has no attribute 'version_string'. Pre-existing unrelated issue, out of scope.Test plan
This pull request and its description were written by Isaac.