Skip to content

adapter: apply group commits off the coordinator loop#37363

Draft
aljoscha wants to merge 4 commits into
MaterializeInc:mainfrom
aljoscha:sql-447-group-commit-oracle-off-loop
Draft

adapter: apply group commits off the coordinator loop#37363
aljoscha wants to merge 4 commits into
MaterializeInc:mainfrom
aljoscha:sql-447-group-commit-oracle-off-loop

Conversation

@aljoscha

@aljoscha aljoscha commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What

This PR follows #37355. It contains the remaining SQL-447 work: taking periodic group-commit timestamp-oracle round trips off the coordinator loop.

The branch has four commits:

  1. adapter: apply group commits off the coordinator loop (Fixes SQL-447)

    • Adds a long-lived background group committer that owns shareable handles needed for group commits.
    • The coordinator stages pending writes and sends the batch to the committer, then resumes on a small completion message.
    • Timestamp-oracle write_ts / apply_write work and builtin-table/table appends happen off the coordinator loop.
    • The committer uses the just-applied write timestamp as the local read timestamp for read-hold downgrades, avoiding batching-oracle read_ts calls during shutdown.
  2. adapter: retry txns-shard writes on conflict instead of locking

    • Removes the pessimistic shard_advance_lock approach.
    • Table registration, table forget, and table appends surface StorageError::InvalidUppers instead of panicking when a stale timestamp races with another txns-shard write.
    • Coordinator DDL paths retry with a fresh oracle timestamp on those conflicts.
  3. storage: register tables in the txns shard as one isolated step

    • Makes txns-shard table registration an explicit all-or-nothing storage-controller operation.
    • create_collections / alter_table_desc set up storage collections only. register_table_collections is the sole path that registers tables in the txns shard on both first attempt and retry.
    • The storage layer decides which catalog items belong in the txns shard by checking authoritative DataSource::Table, so source-fed tables and webhooks are not incorrectly registered.
  4. adapter: advance group commit catalog upper off-loop

    • Moves group-commit catalog-upper advancement into the background committer.
    • Makes catalog write timestamps late-bound at the durable catalog boundary. After acquiring the catalog storage mutex, durable catalog writes commit at max(requested_ts, current_catalog_upper).
    • Makes table-collection catalog upper advancement choose its final readable timestamp after acquiring the catalog storage mutex. If optional empty advancement has already passed the requested timestamp, the table operation advances again from the current upper and uses that timestamp for txns-shard registration.
    • Removes the extra catalog_timestamp_guard. The catalog storage mutex is now the boundary that orders empty upper advancement with real catalog writes.

Why

The coordinator is single-threaded. Before this PR, the periodic group commit could perform timestamp-oracle round trips inline on that loop. When the oracle backend is slow, those waits stall unrelated sessions, including statements that do not need an oracle timestamp themselves.

The goal is to keep the coordinator loop responsive while preserving the ordering guarantees around group commits, catalog transactions, table txns-shard state, and response retirement.

Correctness notes

  • Group commits still serialize through one group-commit worker. The worker is the serialization point for staged writes.
  • Statement retirement and statement-log timestamp assignment remain on the coordinator loop. The coordinator sets timestamps before retiring responses.
  • Catalog upper advancement runs off-loop, but durable catalog writes and table-collection setup tolerate racing empty upper advancement by choosing their final commit or registration timestamp while holding the catalog storage mutex.
  • The background empty catalog-upper advance remains optional and idempotent. It only keeps mz_catalog_raw readable at the oracle read timestamp.
  • Txns-shard conflicts are handled optimistically. If registration, forget, or append observes stale uppers, the coordinator re-allocates a fresh timestamp and retries the affected operation.
  • Table registration is all-or-nothing from the coordinator's perspective. Retries do not depend on resuming half-finished state from create_collections.
  • Source-fed tables are filtered in the storage controller using DataSource::Table, not by adapter-side assumptions about which catalog item kinds are txns-writable.

Tests

This PR relies primarily on existing coverage for group commit, table DDL, txns-shard registration, and catalog/storage integration. Local validation also exercised:

  • read-your-writes and cross-session read-after-write behavior,
  • concurrent CREATE/INSERT/DROP table bursts with both fast and delayed oracle responses,
  • source-fed tables from load generators,
  • statement logging while group commits run off-loop,
  • SUBSCRIBE table read/write timestamp behavior.

Manual smoke/perf validation with a Cockroach timestamp oracle behind toxiproxy:

  • With 500ms injected oracle latency and 8 concurrent INSERT workers, unrelated SELECT 1 stayed responsive: p50 2ms, p95 3ms, max 7ms.
  • Read-your-writes, cross-session read-after-write, DDL-under-load at 500ms and 0ms oracle latency, and statement logging all passed.
  • In the rough 100ms-latency perf gauge, victim SELECT 1 p95 dropped to 2.0ms and max to 3.3ms after moving the catalog-upper advance off the loop.

Additional validation after replacing the catalog timestamp guard with late-bound catalog timestamps:

  • cargo +1.96.0 fmt --check
  • cargo +1.96.0 check -p mz-adapter
  • cargo +1.96.0 clippy -p mz-adapter --all-targets -- -D warnings
  • bin/sqllogictest --optimized -- --auto-index-selects test/sqllogictest/bounded_staleness.slt
  • bin/sqllogictest --optimized -- test/sqllogictest/replacement-materialized-views.slt
  • bin/sqllogictest --optimized -- --auto-transactions --auto-index-tables --enable-table-keys --replica-size=scale=1,workers=2 --replicas=2 test/sqllogictest/sqlite/test/index/commute/10/slt_good_12.test

Fixes SQL-447

@aljoscha aljoscha force-pushed the sql-447-group-commit-oracle-off-loop branch 8 times, most recently from 70d6efa to f49e081 Compare July 1, 2026 18:22
aljoscha added 4 commits July 2, 2026 09:11
The periodic group commit and the user-write path did their timestamp-oracle
round trips inline on the coordinator loop: `peek_write_ts` and `write_ts` when
allocating the write timestamp, and `read_ts` when downgrading read holds
afterwards. The keepalive commit fires every `default_timestamp_interval` (1s)
regardless of load, so under a slow oracle backend the loop spent most of its
time blocked on these round trips and stalled every other session, including
pure reads that never touch the oracle.

Move the oracle round trips and the table append off the loop into a single,
long-lived group committer task. The loop only does the cheap, state-touching
work: `stage_group_commit` drains pending writes, validates and acquires write
locks, builds the append batch, and hands a `GroupCommitRequest` to the
committer. The committer does the throttle peek, allocates the write timestamp,
appends to the tables, applies the write to the oracle, and reads the local read
ts. It then hands the parts that need coordinator state back to the loop via
`Message::GroupCommitApplied`: advance the catalog upper, record statement
execution timestamps, retire client responses, and downgrade read holds.

Processing one request at a time makes the committer the serialization point for
group commits, so appends reach the single table-write worker in timestamp
order. The committer holds only shareable handles, captured once at startup: the
oracle is never replaced in-process, and the table appender (a new
`StorageController::table_appender` returning a cloneable handle over the txns
worker) stays valid for the controller's lifetime. Promotion out of read-only
mode is a process restart, so a fresh committer is spawned then.

Two ordering constraints keep the finalization on the loop rather than in the
committer:

  - The catalog upper must be advanced in step with the oracle read ts so reads
    of `mz_catalog_raw` at that ts do not block. This has to be serialized with
    on-loop catalog transactions, whose commit timestamp is fixed when the
    transaction begins, so the committer cannot advance it off the loop without
    tripping the single-writer assert. The loop advances it in
    `GroupCommitApplied` using a new tolerant
    `DurableCatalogState::try_advance_upper`, which is a no-op if an on-loop DDL
    transaction already advanced the upper past this commit's timestamp. The
    strict `advance_upper` (with its linearizability soft-panic) stays for the
    on-loop DDL callers, whose allocate-and-advance is atomic on the loop.

  - Statement execution timestamps must be recorded before the client responses
    are retired, because retiring ends the statement execution and drops its
    logging record. Both happen in `GroupCommitApplied`, in that order.

The txns shard is advanced by on-loop DDL (register / forget / alter of tables)
as well as by the off-loop committer, and both must reach it in timestamp order.
`shard_advance_lock` serializes `[allocate write ts + txns-shard write]` across
the committer and the on-loop DDL sites. Reads never take this lock, so they are
never blocked by it. On-loop DDL can briefly wait for the committer's lock hold
(one oracle round trip), which is acceptable: DDL is infrequent and already did
its own oracle round trips on the loop.

`execute` (DDL builtin-table writes) and bootstrap now go through the committer
too: `execute` stages the write and returns the notify, and bootstrap awaits it
rather than applying the timestamp itself.

Fixes SQL-447

Under an injected 500ms oracle latency with 8 concurrent workers, a victim
`SELECT 1` (no oracle write of its own) goes from multiple seconds to p50 6ms,
p95 9ms while the workers hammer INSERT. Read-your-writes, cross-session
read-after-write, statement-logging writes, and CREATE/DROP/INSERT DDL stress at
both 500ms and 0ms oracle latency all pass with no errors or panics.
The off-loop group committer and on-loop DDL (register / forget / alter of
tables) both write to the txns shard at oracle-allocated timestamps, and both
must reach it in timestamp order. The previous commit serialized them with a
`shard_advance_lock` held around `[allocate write ts + txns-shard write]`. That
couples the two paths: on-loop DDL can wait a full oracle round trip for the
committer's lock hold, and the lock is an all-or-nothing coupling that cannot be
relaxed on just one side.

Replace the lock with optimistic concurrency. Each side allocates a timestamp
from the oracle (which is monotonic) and attempts its txns-shard write. If the
other side advanced the txns upper past it, the write conflicts and the caller
re-allocates a fresh timestamp (necessarily beyond the txns upper) and retries.
The committer's append already did this on `InvalidUppers`. It is now the
primary mechanism rather than a defensive backstop, and the lock is gone.

To make DDL retryable:

  - The table write worker returns the conflict as `InvalidUppers` instead of
    panicking, rolling back its `write_handles` bookkeeping so a retry at a
    fresh timestamp can re-register or re-forget cleanly.

  - `create_collections` and `alter_table_desc` register tables in the txns
    shard as their last step. On conflict they return `InvalidUppers` with all
    other setup already done. The new `register_table_collections` re-opens
    write handles from the stored collection metadata (the originals are
    consumed by the txns register) and registers them, so the coordinator can
    retry it in a loop with fresh timestamps. `drop_tables` forgets
    synchronously and returns the conflict for the same retry treatment.

  - The coordinator drives the retry loops for CREATE / ALTER / DROP TABLE,
    re-allocating the write timestamp and re-advancing the catalog upper (which
    also re-checks leadership, see materialize#28216) each attempt.

Reads never take part in the txns-shard advance, so they are never blocked by
either side, and a reader never observes a partially registered table. The
oracle read timestamp is only advanced past the register timestamp
(`apply_local_write`) after a successful register.

At bootstrap no group commits run concurrently, so the register cannot conflict
there and stays a single attempt.

Under an injected 500ms oracle latency with concurrent inserters and DDL, a
victim `SELECT 1` stays at p50 8ms / p95 10ms. A fast-oracle stress with three
inserters and three concurrent CREATE/DROP TABLE loops drove 112 register and
10 forget conflicts, all retried to success with read-your-writes and
cross-session reads intact and no panics.
`create_collections` used to register tables in the txns shard as its last step
on the happy path, and return `InvalidUppers` with everything-but-the-register
done when the off-loop group committer raced it. The caller then finished the
registration through a different method, `register_table_collections`. So there
were two registration code paths, and `create_collections` had a "returns
half-done, call the other method to finish" contract.

Make `register_table_collections` the sole path that registers tables in the
txns shard, used identically for the first attempt and for retries, by both
bootstrap and runtime DDL. `create_collections` and `alter_table_desc` now only
set up storage and never touch the txns shard. The caller (coordinator or
bootstrap) always registers as a separate, retryable step, re-allocating a fresh
oracle timestamp on `InvalidUppers`.

`register_ts` still flows into `create_collections`, but only for the read side:
the initial since downgrade that makes a table "spring into existence" at that
timestamp. That downgrade happens on the freshly-opened since handle before it
is handed to the storage-collections background task, so it belongs in
`create_collections`. The write side, txns registration, is what needs to be
retryable, and that is now fully separated out.

`register_table_collections` re-opens write handles from the stored collection
metadata (a conflicting register consumes the originals) and opens them
concurrently, since bootstrap registers many tables at once. It also folds in
the read-only filtering that `register_table_writes` used to do: in read-only
mode only migrated tables are registered. `register_table_writes` is gone.

Bootstrap collects the table ids across all dependency layers and registers them
once, after all layers are set up. Bootstrap runs no group commits concurrently,
so this cannot conflict.

Validated with the fast-oracle DDL stress: 110 register and 9 forget conflicts,
all retried to success, read-your-writes and cross-session reads intact, no
panics, and a clean read-write bootstrap.
Move the catalog upper advance for completed group commits into the background committer. The committer uses a cloned Catalog, which shares the durable storage mutex with coordinator catalog transactions, so the advance remains serialized with DDL commits without blocking unrelated coordinator work.\n\nThis removes the remaining coordinator-loop stalls for unrelated statements when the timestamp oracle backend is slow.\n\nFollow-up to SQL-447.
@aljoscha aljoscha force-pushed the sql-447-group-commit-oracle-off-loop branch 5 times, most recently from 6e68cfd to 0b75ba2 Compare July 3, 2026 12:21
@aljoscha aljoscha added the ci-nightly PR CI control: also trigger Nightly label Jul 3, 2026
@aljoscha aljoscha changed the title adapter: take timestamp-oracle round trips off the coordinator loop (SQL-425, SQL-447) adapter: apply group commits off the coordinator loop Jul 3, 2026
@aljoscha aljoscha force-pushed the sql-447-group-commit-oracle-off-loop branch 4 times, most recently from 171a704 to 5e4b7f1 Compare July 4, 2026 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-nightly PR CI control: also trigger Nightly

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant