Skip to content

test(telemetry/e2e): make TestTelemetryE2E deterministic + deflake retry tests under merge-queue load#812

Merged
vikrantpuppala merged 3 commits into
mainfrom
vikrant/telemetry-test-fix-flakiness
May 27, 2026
Merged

test(telemetry/e2e): make TestTelemetryE2E deterministic + deflake retry tests under merge-queue load#812
vikrantpuppala merged 3 commits into
mainfrom
vikrant/telemetry-test-fix-flakiness

Conversation

@vikrantpuppala
Copy link
Copy Markdown
Contributor

@vikrantpuppala vikrantpuppala commented May 27, 2026

Summary

Two flake fixes that surfaced from chasing the same incident — both are about asserting on the connector's intent rather than its async side effects, and about isolating the warehouse from concurrent CI load.

1. tests/e2e/test_telemetry_e2e.py::TestTelemetryE2E (original scope)

Chronically flaky (see git log on the file — multiple "Fix CI test failure" commits over the last few months). Root cause: the tests asserted "telemetry round-tripped to the server" by intercepting TelemetryClient._telemetry_request_callback and counting completed futures.

def callback_wrapper(self_client, future, sent_count):
    captured_futures.append(future)         # ← appended HERE
    original_callback(self_client, future, sent_count)

The callback only fires after the HTTP request completes, so captured_futures.append(future) happens later than the actual submission. Two failure modes:

  1. Append-lag race. If the test reached wait(captured_futures, ...) before all callbacks had run, captured_futures was still empty (or partial). The wait saw nothing to wait on and returned instantly with done={}. Result: assert 0 == 2.

  2. Executor-shutdown race. On the last connection's close, TelemetryClientFactory.close() shuts the shared executor down with wait=False (intentional for production connection-close latency). Not-yet-started submissions are dropped silently. So even with correct draining we'd be measuring "did the server receive the request in time" rather than "did the connector dispatch it."

Fix: intercept at _send_telemetry (the connector's submission point) instead of _telemetry_request_callback. The submission is synchronous on the same thread that runs _export_event_flush(). By the time verify_events() runs, every batch the connector intended to submit is in captured_submissions. No async wait, no sleep, no race against executor shutdown.

The renamed variable (captured_futurescaptured_submissions) makes the contract clearer: each item is a batch the connector submitted, regardless of what the executor did with it afterward.

Verification: 5 consecutive runs on dogfood, all green, ~20s each.

2. tests/e2e/test_driver.py::TestPySQLCoreSuite::test_oserror_retries (added after the first merge_group run)

The first merge_group attempt of this PR hit a different flake: test_oserror_retries failed with assert mock_validate_conn.call_count == 6 because unexpected /telemetry-ext requests were counted alongside the intended session-endpoint retries.

Two interacting causes:

  1. _isolated_from_telemetry() only patches TelemetryClientFactory.initialize_telemetry_client. That covers new connections created during the test, but not real TelemetryClient instances that slip in via stale module state, paths that bypass initialize_telemetry_client, or clients created before the context entered.

  2. The merge queue ran multiple entries in parallel against the same warehouse. The previous concurrency group was keyed on github.ref (per-PR in queue: gh-readonly-queue/main/pr-N-…), so PR ci(code-coverage): move push:main trigger to merge_group #810's and PR test(telemetry/e2e): make TestTelemetryE2E deterministic + deflake retry tests under merge-queue load #812's queue entries ran concurrently. The warehouse was fine for individual entries but couldn't handle two simultaneous loads — telemetry/retry paths started intermittently failing on /telemetry-ext, and those failures got counted by mock_validate_conn.

Fix 2a (retry_test_mixins.py): three-layer telemetry isolation in _isolated_from_telemetry(). Layer 1 (factory swap) was already there. Added layer 2 (_send_telemetry no-op) and layer 3 (_export_event no-op) as defensive backstops at the class level. test_oserror_retries now passes 5/5 locally.

Fix 2b (code-coverage.yml): serialise merge_group runs under a single fixed concurrency group (e2e-mq-serial). Only one queue entry runs the suite at a time. PR-event runs keep per-ref + cancel-in-progress for fast author feedback. Trade-off: queue throughput drops to one ~17-min run at a time. Acceptable for this repo's PR volume.

Why this is the right boundary to test

The connector's responsibility is: given a telemetry event, queue it, flush per batch policy, submit. The tests should verify that. What happens after submission — HTTP latency, server availability, executor lifecycle — isn't behavior of the code under test and shouldn't be coupled into these assertions.

What's NOT in this PR

  • TelemetryClientFactory.close() still uses shutdown(wait=False). That's intentional behaviour by design, kept for production connection-close latency.
  • test_retry_max_count_not_exceeded still fails on this branch, but it also fails on baseline main with 'SimpleHttpResponse' object has no attribute 'version_string'. Pre-existing unrelated issue, out of scope.

Test plan

  • CI passes on this PR.
  • PR can be queued and merges cleanly even when there's another PR in the queue (queue serialisation kicks in once this lands).
  • Telemetry e2e tests and retry tests no longer flake on merge_group runs over the next ~week.

This pull request and its description were written by Isaac.

The previous tests asserted "telemetry round-tripped to the server"
by intercepting TelemetryClient._telemetry_request_callback and
counting completed futures. That recording lags the actual work —
the callback fires asynchronously after the HTTP request completes,
and on the *last* connection close TelemetryClientFactory.close()
shuts the shared executor down with wait=False (intentional, for
connection-close latency in production). Two consequences:

  1. A `wait(captured_futures, timeout=10)` call right after `with
     conn:` can return before any callbacks have fired — so the
     wait is "waiting on" an empty list, returns immediately, and
     the assertion `assert len(done) == expected_count` fails non-
     deterministically with `assert 0 == 2` or `assert 1 == 2`.

  2. The shared-executor shutdown(wait=False) can drop in-flight
     submissions that haven't started running yet, so even if we
     drained correctly we'd be testing whether the server happened
     to receive the request in time, not whether the connector
     correctly dispatched it.

Switch interception from `_telemetry_request_callback` to
`_send_telemetry`. That captures the connector's *intent to submit*
synchronously, which is what we actually want to test — the
connector either decided to send a batch or it didn't, regardless
of what happens to the future afterward. No sleep needed, no
timeout-based wait needed, no race against the executor shutdown.

5 consecutive local runs pass deterministically in ~20s each (down
from ~17 min when the flake hit).

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
Copy link
Copy Markdown
Contributor

@samikshya-db samikshya-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this.

@vikrantpuppala vikrantpuppala enabled auto-merge May 27, 2026 13:33
@vikrantpuppala vikrantpuppala disabled auto-merge May 27, 2026 13:38
@vikrantpuppala vikrantpuppala enabled auto-merge May 27, 2026 13:41
@vikrantpuppala vikrantpuppala added this pull request to the merge queue May 27, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 27, 2026
@vikrantpuppala vikrantpuppala added this pull request to the merge queue May 27, 2026
@vikrantpuppala vikrantpuppala removed this pull request from the merge queue due to a manual request May 27, 2026
Two compounding fixes that surfaced on PR #812's first merge_group
run, where test_oserror_retries failed with
`assert mock_validate_conn.call_count == 6` — unexpected
`/telemetry-ext` requests had been counted alongside the intended
session-endpoint retries.

1. tests/e2e/common/retry_test_mixins.py — strengthen
   `_isolated_from_telemetry()` with two additional defensive patches:

     - TelemetryClient._send_telemetry → no-op
     - TelemetryClient._export_event  → no-op

   The existing factory swap installs NoopTelemetryClient for new
   connections, but doesn't cover real TelemetryClient instances that
   slip in via other paths (stale module-global, code that bypasses
   initialize_telemetry_client, anything created before the context
   entered). Patching at the class level catches all of them.

2. .github/workflows/code-coverage.yml — serialise merge_group runs.

   Previous concurrency group keyed on github.ref, which is per-PR in
   the queue (gh-readonly-queue/main/pr-N-…). That allowed multiple
   queue entries to hammer the same warehouse in parallel, stressing
   telemetry / retry paths that single-PR runs don't exercise.

   Group merge_group + workflow_dispatch under a single fixed name
   (e2e-mq-serial) so they run one at a time. PR-event runs keep
   per-ref grouping + cancel-in-progress for fast author feedback.
   Trade-off: queue throughput drops to one ~17-min run at a time.

Folded into PR #812 so the telemetry-test rewrite and the retry-test
deflake ship as a single unit.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
@vikrantpuppala vikrantpuppala changed the title test(telemetry/e2e): make TestTelemetryE2E deterministic by capturing submissions, not callbacks test(telemetry/e2e): make TestTelemetryE2E deterministic + deflake retry tests under merge-queue load May 27, 2026
@vikrantpuppala vikrantpuppala enabled auto-merge May 27, 2026 17:03
@vikrantpuppala vikrantpuppala added this pull request to the merge queue May 27, 2026
Merged via the queue into main with commit 0309e7c May 27, 2026
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants