Skip to content

fix(server): decouple asyncify thread pool from COOP_TASKRUN for old-kernel compat#3517

Open
mfyuce wants to merge 3 commits into
apache:masterfrom
mfyuce:fix/coop-taskrun-thread-pool
Open

fix(server): decouple asyncify thread pool from COOP_TASKRUN for old-kernel compat#3517
mfyuce wants to merge 3 commits into
apache:masterfrom
mfyuce:fix/coop-taskrun-thread-pool

Conversation

@mfyuce

@mfyuce mfyuce commented Jun 20, 2026

Copy link
Copy Markdown

Summary

IORING_SETUP_COOP_TASKRUN requires Linux ≥ 5.19. On kernels 5.10–5.18 the
shard io_uring setup fails with EINVAL even though the main runtime starts
fine, preventing server boot entirely. These three commits fix that in order:

  1. Gate COOP_TASKRUN flags behind an env var (IGGY_SHARD_RUNTIME_COOP_TASKRUN,
    default true = unchanged). Set it to false to run on 5.10–5.19 kernels
    at a small latency cost. (feat(server))

  2. Keep asyncify worker pool when COOP_TASKRUN is off. With the flag off,
    compio routes fs/JWT-storage ops through the asyncify thread pool.
    thread_pool_limit(0) then panics "thread pool is needed but no worker
    thread is running" on shard 0. Gate the thread_pool_limit(0) call behind
    the same flag. (fix(server))

  3. Decouple thread_pool_limit from COOP_TASKRUN entirely. TCP, HTTP, and
    WebSocket transports dispatch some ops through the asyncify pool even when
    COOP_TASKRUN=true, so tying thread_pool_limit(0) to the flag still
    panics on 6.8+ kernels with those transports active. Add a
    keep_worker_pool: bool parameter to create_shard_executor; the pool is
    only dropped when COOP_TASKRUN=true and no TCP/HTTP/WS transport is
    enabled. Both server and server-ng derive keep_worker_pool from their
    loaded config. (fix(server))

After these changes, operators can set IGGY_SHARD_RUNTIME_COOP_TASKRUN=true
on Linux 6.8+ with TCP transport enabled and get lower io_uring latency without
the worker-pool panic. On ≤5.18 kernels they set it to false and the server
boots normally.

Files changed

  • core/server_common/src/executor.rscreate_shard_executor(keep_worker_pool: bool), flag-gated COOP_TASKRUN/TASKRUN_FLAG
  • core/server/src/main.rs — derive keep_worker_pool from config
  • core/server-ng/src/main.rs, core/server-ng/src/bootstrap.rs — same

Test plan

  • cargo clippy -p server -p server-ng -- -D warnings passes
  • Server boots on Linux 5.15 with IGGY_SHARD_RUNTIME_COOP_TASKRUN=false + TCP transport
  • Server boots on Linux 6.8 with IGGY_SHARD_RUNTIME_COOP_TASKRUN=true + TCP transport (no worker-pool panic)
  • cargo test -p server_common passes

🤖 Generated with Claude Code

https://claude.ai/code/session_018duZYBkbguQ2pn8RJ82PUw

mfyuce and others added 3 commits June 20, 2026 14:35
Shard executors hardcoded IORING_SETUP_COOP_TASKRUN + TASKRUN_FLAG, which
require Linux >= 5.19. On 5.15 the shard io_uring setup fails with EINVAL
even though the default-flag main runtime starts fine, so the server can't
boot at all. Gate the flags behind IGGY_SHARD_RUNTIME_COOP_TASKRUN (default
true = unchanged behavior); set it to false to run on 5.10..5.19 kernels at
a small latency cost.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QBxbPbdKXzoMdvLBeugNBX
With COOP_TASKRUN off (old-kernel fallback), compio routes some ops (fs,
JWT storage) through the asyncify thread pool; thread_pool_limit(0) then
panics 'thread pool is needed but no worker thread is running' and the
HTTP server task dies on shard 0. Gate thread_pool_limit(0) behind the
same flag so the default worker pool stays when COOP_TASKRUN is off.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QBxbPbdKXzoMdvLBeugNBX
TCP, HTTP and WebSocket transports dispatch some ops through the asyncify
thread pool even when COOP_TASKRUN is on, so thread_pool_limit(0) cannot
be tied to the COOP_TASKRUN flag alone: enabling COOP_TASKRUN on a 6.8+
kernel still panics with "thread pool is needed" when those transports
are active.

Add a keep_worker_pool parameter to create_shard_executor. The asyncify
pool is only dropped when COOP_TASKRUN is true AND the caller signals no
TCP/HTTP/WS transport is active. Both server and server-ng derive the
flag from their loaded config; the server-ng bootstrap runtime passes
true because it runs before config is available.

This lets operators set IGGY_SHARD_RUNTIME_COOP_TASKRUN=true on
Linux 6.8+ even with TCP transports enabled, gaining the lower
io_uring latency without the worker-pool panic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_018duZYBkbguQ2pn8RJ82PUw
@github-actions

Copy link
Copy Markdown

Thanks for the PR. It is labeled S-waiting-on-review and queued for review.

Slash commands (own line, regular comment) move it around the queue:

  • /ready - back to S-waiting-on-review after addressing feedback
  • /author - flip to S-waiting-on-author while you finish changes
  • /request-review @user-or-team - request a reviewer

See CONTRIBUTING.md for details.

@github-actions github-actions Bot added the S-waiting-on-review PR is waiting on a reviewer label Jun 20, 2026
mfyuce added a commit to mfyuce/iggy that referenced this pull request Jun 21, 2026
Mark COOP_TASKRUN PR apache#3517 as submitted; clear TOBEDECIDED.md.
Both apache#3516 and apache#3517 are now S-waiting-on-review on apache/iggy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QBxbPbdKXzoMdvLBeugNBX
mfyuce added a commit to mfyuce/iggy that referenced this pull request Jun 21, 2026
- AGENTS.md: 104→75 lines. Removed redundant repo structure (derivable
  by ls), collapsed principles to iggy-specific rules only, merged
  Jenkins/QW infra into Infra section, updated handover block.
- TODO.md: replaced stale checked items with 4 open PRs (apache#3516 apache#3517
  apache#3523 apache#3525) + QW 0.9 upgrade task.
- DONE.md: added sessions 5-10 block (QW sink pipeline, collector
  cutover, InvalidOffset bug + fix).
- quickwit_sink/src/lib.rs: cargo fmt reformatting only.
@numinnex

Copy link
Copy Markdown
Contributor

I am not sure about it, I think we rather not allow our server to run on kernels < 6.8, rather than adding extra startup flags that to disable COOP_TASKRUN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review PR is waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants