adapter: apply group commits off the coordinator loop by aljoscha · Pull Request #37363 · MaterializeInc/materialize

aljoscha · 2026-06-30T16:02:04Z

What

This PR follows #37355. It contains the remaining SQL-447 work: taking periodic group-commit timestamp-oracle round trips off the coordinator loop.

The branch has four commits:

adapter: apply group commits off the coordinator loop (Fixes SQL-447)
- Adds a long-lived background group committer that owns shareable handles needed for group commits.
- The coordinator stages pending writes and sends the batch to the committer, then resumes on a small completion message.
- Timestamp-oracle write_ts / apply_write work and builtin-table/table appends happen off the coordinator loop.
- The committer uses the just-applied write timestamp as the local read timestamp for read-hold downgrades, avoiding batching-oracle read_ts calls during shutdown.
adapter: retry txns-shard writes on conflict instead of locking
- Removes the pessimistic shard_advance_lock approach.
- Table registration, table forget, and table appends surface StorageError::InvalidUppers instead of panicking when a stale timestamp races with another txns-shard write.
- Coordinator DDL paths retry with a fresh oracle timestamp on those conflicts.
storage: register tables in the txns shard as one isolated step
- Makes txns-shard table registration an explicit all-or-nothing storage-controller operation.
- create_collections / alter_table_desc set up storage collections only. register_table_collections is the sole path that registers tables in the txns shard on both first attempt and retry.
- The storage layer decides which catalog items belong in the txns shard by checking authoritative DataSource::Table, so source-fed tables and webhooks are not incorrectly registered.
adapter: advance group commit catalog upper off-loop
- Moves group-commit catalog-upper advancement into the background committer.
- Makes catalog write timestamps late-bound at the durable catalog boundary. After acquiring the catalog storage mutex, durable catalog writes commit at max(requested_ts, current_catalog_upper).
- Makes table-collection catalog upper advancement choose its final readable timestamp after acquiring the catalog storage mutex. If optional empty advancement has already passed the requested timestamp, the table operation advances again from the current upper and uses that timestamp for txns-shard registration.
- Removes the extra catalog_timestamp_guard. The catalog storage mutex is now the boundary that orders empty upper advancement with real catalog writes.

Why

The coordinator is single-threaded. Before this PR, the periodic group commit could perform timestamp-oracle round trips inline on that loop. When the oracle backend is slow, those waits stall unrelated sessions, including statements that do not need an oracle timestamp themselves.

The goal is to keep the coordinator loop responsive while preserving the ordering guarantees around group commits, catalog transactions, table txns-shard state, and response retirement.

Correctness notes

Group commits still serialize through one group-commit worker. The worker is the serialization point for staged writes.
Statement retirement and statement-log timestamp assignment remain on the coordinator loop. The coordinator sets timestamps before retiring responses.
Catalog upper advancement runs off-loop, but durable catalog writes and table-collection setup tolerate racing empty upper advancement by choosing their final commit or registration timestamp while holding the catalog storage mutex.
The background empty catalog-upper advance remains optional and idempotent. It only keeps mz_catalog_raw readable at the oracle read timestamp.
Txns-shard conflicts are handled optimistically. If registration, forget, or append observes stale uppers, the coordinator re-allocates a fresh timestamp and retries the affected operation.
Table registration is all-or-nothing from the coordinator's perspective. Retries do not depend on resuming half-finished state from create_collections.
Source-fed tables are filtered in the storage controller using DataSource::Table, not by adapter-side assumptions about which catalog item kinds are txns-writable.

Tests

This PR relies primarily on existing coverage for group commit, table DDL, txns-shard registration, and catalog/storage integration. Local validation also exercised:

read-your-writes and cross-session read-after-write behavior,
concurrent CREATE/INSERT/DROP table bursts with both fast and delayed oracle responses,
source-fed tables from load generators,
statement logging while group commits run off-loop,
SUBSCRIBE table read/write timestamp behavior.

Manual smoke/perf validation with a Cockroach timestamp oracle behind toxiproxy:

With 500ms injected oracle latency and 8 concurrent INSERT workers, unrelated SELECT 1 stayed responsive: p50 2ms, p95 3ms, max 7ms.
Read-your-writes, cross-session read-after-write, DDL-under-load at 500ms and 0ms oracle latency, and statement logging all passed.
In the rough 100ms-latency perf gauge, victim SELECT 1 p95 dropped to 2.0ms and max to 3.3ms after moving the catalog-upper advance off the loop.

Additional validation after replacing the catalog timestamp guard with late-bound catalog timestamps:

cargo +1.96.0 fmt --check
cargo +1.96.0 check -p mz-adapter
cargo +1.96.0 clippy -p mz-adapter --all-targets -- -D warnings
bin/sqllogictest --optimized -- --auto-index-selects test/sqllogictest/bounded_staleness.slt
bin/sqllogictest --optimized -- test/sqllogictest/replacement-materialized-views.slt
bin/sqllogictest --optimized -- --auto-transactions --auto-index-tables --enable-table-keys --replica-size=scale=1,workers=2 --replicas=2 test/sqllogictest/sqlite/test/index/commute/10/slt_good_12.test

Fixes SQL-447

The periodic group commit and the user-write path did their timestamp-oracle round trips inline on the coordinator loop: `peek_write_ts` and `write_ts` when allocating the write timestamp, and `read_ts` when downgrading read holds afterwards. The keepalive commit fires every `default_timestamp_interval` (1s) regardless of load, so under a slow oracle backend the loop spent most of its time blocked on these round trips and stalled every other session, including pure reads that never touch the oracle. Move the oracle round trips and the table append off the loop into a single, long-lived group committer task. The loop only does the cheap, state-touching work: `stage_group_commit` drains pending writes, validates and acquires write locks, builds the append batch, and hands a `GroupCommitRequest` to the committer. The committer does the throttle peek, allocates the write timestamp, appends to the tables, applies the write to the oracle, and reads the local read ts. It then hands the parts that need coordinator state back to the loop via `Message::GroupCommitApplied`: advance the catalog upper, record statement execution timestamps, retire client responses, and downgrade read holds. Processing one request at a time makes the committer the serialization point for group commits, so appends reach the single table-write worker in timestamp order. The committer holds only shareable handles, captured once at startup: the oracle is never replaced in-process, and the table appender (a new `StorageController::table_appender` returning a cloneable handle over the txns worker) stays valid for the controller's lifetime. Promotion out of read-only mode is a process restart, so a fresh committer is spawned then. Two ordering constraints keep the finalization on the loop rather than in the committer: - The catalog upper must be advanced in step with the oracle read ts so reads of `mz_catalog_raw` at that ts do not block. This has to be serialized with on-loop catalog transactions, whose commit timestamp is fixed when the transaction begins, so the committer cannot advance it off the loop without tripping the single-writer assert. The loop advances it in `GroupCommitApplied` using a new tolerant `DurableCatalogState::try_advance_upper`, which is a no-op if an on-loop DDL transaction already advanced the upper past this commit's timestamp. The strict `advance_upper` (with its linearizability soft-panic) stays for the on-loop DDL callers, whose allocate-and-advance is atomic on the loop. - Statement execution timestamps must be recorded before the client responses are retired, because retiring ends the statement execution and drops its logging record. Both happen in `GroupCommitApplied`, in that order. The txns shard is advanced by on-loop DDL (register / forget / alter of tables) as well as by the off-loop committer, and both must reach it in timestamp order. `shard_advance_lock` serializes `[allocate write ts + txns-shard write]` across the committer and the on-loop DDL sites. Reads never take this lock, so they are never blocked by it. On-loop DDL can briefly wait for the committer's lock hold (one oracle round trip), which is acceptable: DDL is infrequent and already did its own oracle round trips on the loop. `execute` (DDL builtin-table writes) and bootstrap now go through the committer too: `execute` stages the write and returns the notify, and bootstrap awaits it rather than applying the timestamp itself. Fixes SQL-447 Under an injected 500ms oracle latency with 8 concurrent workers, a victim `SELECT 1` (no oracle write of its own) goes from multiple seconds to p50 6ms, p95 9ms while the workers hammer INSERT. Read-your-writes, cross-session read-after-write, statement-logging writes, and CREATE/DROP/INSERT DDL stress at both 500ms and 0ms oracle latency all pass with no errors or panics.

The off-loop group committer and on-loop DDL (register / forget / alter of tables) both write to the txns shard at oracle-allocated timestamps, and both must reach it in timestamp order. The previous commit serialized them with a `shard_advance_lock` held around `[allocate write ts + txns-shard write]`. That couples the two paths: on-loop DDL can wait a full oracle round trip for the committer's lock hold, and the lock is an all-or-nothing coupling that cannot be relaxed on just one side. Replace the lock with optimistic concurrency. Each side allocates a timestamp from the oracle (which is monotonic) and attempts its txns-shard write. If the other side advanced the txns upper past it, the write conflicts and the caller re-allocates a fresh timestamp (necessarily beyond the txns upper) and retries. The committer's append already did this on `InvalidUppers`. It is now the primary mechanism rather than a defensive backstop, and the lock is gone. To make DDL retryable: - The table write worker returns the conflict as `InvalidUppers` instead of panicking, rolling back its `write_handles` bookkeeping so a retry at a fresh timestamp can re-register or re-forget cleanly. - `create_collections` and `alter_table_desc` register tables in the txns shard as their last step. On conflict they return `InvalidUppers` with all other setup already done. The new `register_table_collections` re-opens write handles from the stored collection metadata (the originals are consumed by the txns register) and registers them, so the coordinator can retry it in a loop with fresh timestamps. `drop_tables` forgets synchronously and returns the conflict for the same retry treatment. - The coordinator drives the retry loops for CREATE / ALTER / DROP TABLE, re-allocating the write timestamp and re-advancing the catalog upper (which also re-checks leadership, see materialize#28216) each attempt. Reads never take part in the txns-shard advance, so they are never blocked by either side, and a reader never observes a partially registered table. The oracle read timestamp is only advanced past the register timestamp (`apply_local_write`) after a successful register. At bootstrap no group commits run concurrently, so the register cannot conflict there and stays a single attempt. Under an injected 500ms oracle latency with concurrent inserters and DDL, a victim `SELECT 1` stays at p50 8ms / p95 10ms. A fast-oracle stress with three inserters and three concurrent CREATE/DROP TABLE loops drove 112 register and 10 forget conflicts, all retried to success with read-your-writes and cross-session reads intact and no panics.

`create_collections` used to register tables in the txns shard as its last step on the happy path, and return `InvalidUppers` with everything-but-the-register done when the off-loop group committer raced it. The caller then finished the registration through a different method, `register_table_collections`. So there were two registration code paths, and `create_collections` had a "returns half-done, call the other method to finish" contract. Make `register_table_collections` the sole path that registers tables in the txns shard, used identically for the first attempt and for retries, by both bootstrap and runtime DDL. `create_collections` and `alter_table_desc` now only set up storage and never touch the txns shard. The caller (coordinator or bootstrap) always registers as a separate, retryable step, re-allocating a fresh oracle timestamp on `InvalidUppers`. `register_ts` still flows into `create_collections`, but only for the read side: the initial since downgrade that makes a table "spring into existence" at that timestamp. That downgrade happens on the freshly-opened since handle before it is handed to the storage-collections background task, so it belongs in `create_collections`. The write side, txns registration, is what needs to be retryable, and that is now fully separated out. `register_table_collections` re-opens write handles from the stored collection metadata (a conflicting register consumes the originals) and opens them concurrently, since bootstrap registers many tables at once. It also folds in the read-only filtering that `register_table_writes` used to do: in read-only mode only migrated tables are registered. `register_table_writes` is gone. Bootstrap collects the table ids across all dependency layers and registers them once, after all layers are set up. Bootstrap runs no group commits concurrently, so this cannot conflict. Validated with the fast-oracle DDL stress: 110 register and 9 forget conflicts, all retried to success, read-your-writes and cross-session reads intact, no panics, and a clean read-write bootstrap.

Move the catalog upper advance for completed group commits into the background committer. The committer uses a cloned Catalog, which shares the durable storage mutex with coordinator catalog transactions, so the advance remains serialized with DDL commits without blocking unrelated coordinator work.\n\nThis removes the remaining coordinator-loop stalls for unrelated statements when the timestamp oracle backend is slow.\n\nFollow-up to SQL-447.

aljoscha force-pushed the sql-447-group-commit-oracle-off-loop branch 8 times, most recently from 70d6efa to f49e081 Compare July 1, 2026 18:22

aljoscha added 4 commits July 2, 2026 09:11

aljoscha force-pushed the sql-447-group-commit-oracle-off-loop branch 5 times, most recently from 6e68cfd to 0b75ba2 Compare July 3, 2026 12:21

aljoscha added the ci-nightly PR CI control: also trigger Nightly label Jul 3, 2026

aljoscha changed the title ~~adapter: take timestamp-oracle round trips off the coordinator loop (SQL-425, SQL-447)~~ adapter: apply group commits off the coordinator loop Jul 3, 2026

aljoscha force-pushed the sql-447-group-commit-oracle-off-loop branch 4 times, most recently from 171a704 to 5e4b7f1 Compare July 4, 2026 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

adapter: apply group commits off the coordinator loop#37363

adapter: apply group commits off the coordinator loop#37363
aljoscha wants to merge 4 commits into
MaterializeInc:mainfrom
aljoscha:sql-447-group-commit-oracle-off-loop

aljoscha commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

aljoscha commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Correctness notes

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aljoscha commented Jun 30, 2026 •

edited

Loading