Skip to content

<fix>[ceph]: cap & parallelize host conn check#4145

Open
zstack-robot-2 wants to merge 1 commit into
5.5.22from
sync/jin.ma/fix/ZSTAC-85421
Open

<fix>[ceph]: cap & parallelize host conn check#4145
zstack-robot-2 wants to merge 1 commit into
5.5.22from
sync/jin.ma/fix/ZSTAC-85421

Conversation

@zstack-robot-2
Copy link
Copy Markdown
Collaborator

问题

同一 ceph primary storage 下,每台 host 的存储连通性检查 (CheckHostStorageConnectionMsg) 共用一条按 PS 串行的 chain(sync 签名只含 psUuid,不含 host)。当某台 host 的 in-flight check 卡住(吃默认 1800s http 超时),同 PS 其它所有 host 的 check 在队头被堵满 ~30min。

现网表现:MN 重启后并发 reconnect,一台 host 的 ceph check 被孤立(kvmagent 重启 + drain 60s cushion 漏过),空等 1800s,拖垮同 PS 其它 host 的恢复。

修复(两条,正交)

  1. 并行化 — check chain getSyncLevel() 由默认 1 提到 10,同 PS 不同 host 的 check 可并发,单台卡死只占 1/10 槽位,不再独占队头。
  2. 限超时 — 给 ceph check 的 KVMHostAsyncHttpCallMsg 单独设 5min 超时(CHECK_HOST_STORAGE_CONNECTION_TIMEOUT),替代默认 1800s。通过新增的 per-message timeout 透传实现:KVMHost.executeAsyncHttpCallmsg.getTimeout()(复用 NeedReplyMessage.timeout,-1=默认)传给内部 Http>0 时覆盖 timeoutManager.getTimeout()

未按 isNoStatusCheck() 一刀切——storage-device 的 RAID 自检/FC 扫描等同样是 noStatusCheck 的长操作,会被误杀。只精准限 ceph check 这一条。

合起来:最坏情况从「N 台 × 30min」压到「单台 5min、互不牵连」。

测试

新增 IT CephHostStorageCheckCase(ceph PS + 2 host):

  • testCheckNotSerializedAcrossHosts — 卡住 host1 的 check,断言 host2 的 check 仍并发返回、host1 仍 pending
  • testPerMessageTimeoutHonored — 发带 setTimeout(3s) 的 KVMHostAsyncHttpCallMsg 到 stuck simulator,断言 ~3s 超时失败而非默认 1800s

本地隔离 .m2 串行实跑:

  • Tests run: 1, Failures: 0, Errors: 0 — BUILD SUCCESS
  • Test1 #1 日志实证 Async Http Timeout ... timeout after 3000 MILLISECONDS
  • tao.yang/zstack-ZSV-3564@@2 #2 负向对照:syncLevel 改回 1 → reply2Done.await(15s) 失败(host2 被串行堵死)→ BUILD FAILURE,证明测试守得住修复

影响

  • 无 API / DB / GlobalConfig 变更(per-message timeout 复用既有 NeedReplyMessage.timeout 字段)
  • 单仓 zstack,无需 @@2

Resolves: ZSTAC-85421

sync from gitlab !10049

Per-host ceph connectivity check shared one serial chain per
primary storage. A stuck in-flight check (riding the default
1800s http timeout) blocked every other host's check on the
same PS for ~30min.

- raise check chain syncLevel 1 -> 10 so hosts run concurrently
- give the check call a 5min timeout instead of default 1800s,
  honored via a per-message timeout on KVMHostAsyncHttpCallMsg

Resolves: ZSTAC-85421

Change-Id: I717a746f676c6f776168646f6f776c64636e6870
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Review Change Stack

概述

此 PR 为 Ceph 主存储的主机存储连接检查添加了可配置的按消息粒度超时支持。在 KVM 主机的 HTTP 调用框架中实现了 timeout 字段和传播机制,Ceph 存储层定义了 5 分钟的检查超时常量并应用到异步调用,最后通过两个集成测试用例验证了并发检查的非串行化行为和超时强制执行。

变更

主存储连接检查超时支持

Layer / File(s) 概述
Http 超时字段和异步 HTTP 应用
plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
Http 类新增 timeout 字段与 timeout(long) 链式方法,用于携带自定义超时。异步 HTTP POST 调用时,当 timeout > 0 时使用该值,否则回退到默认的 timeoutManager.getTimeout()。executeAsyncHttpCall 在创建 Http 实例时注入 msg.getTimeout() 以完成消息级超时的传播。
Ceph 存储检查超时常量和同步等级
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
新增 CHECK_HOST_STORAGE_CONNECTION_TIMEOUT 常量(5 分钟)。CheckHostStorageConnectionMsg 处理器的 ChainTask 中增加 getSyncLevel() 覆盖返回同步等级 10。KVMHostAsyncHttpCallMsg 在构造时应用该超时常量。
并发检查和超时强制执行的集成测试
test/src/test/groovy/org/zstack/test/integration/storage/ceph/CephHostStorageCheckCase.groovy
新增 CephHostStorageCheckCase 测试类,配置含两台 KVM 主机的 Ceph 存储环境。sendCheckMsg() 辅助方法用于发送 CheckHostStorageConnectionMsg。testCheckNotSerializedAcrossHosts() 验证同一存储上向多主机的并发检查不被串行化阻塞,其中一台主机延迟响应时不影响另一台完成。testPerMessageTimeoutHonored() 验证显式 3 秒消息超时在 60 秒延迟响应下被强制执行,实际耗时在预期范围内确保非默认超时行为。

序列图

sequenceDiagram
  participant Client
  participant CephStorage
  participant KVMHost
  participant Http
  participant REST
  
  Client->>CephStorage: CheckHostStorageConnectionMsg
  CephStorage->>CephStorage: ChainTask with syncLevel=10
  CephStorage->>KVMHost: executeAsyncHttpCall(timeout=5min)
  KVMHost->>Http: new Http().timeout(5min)
  Http-->>KVMHost: configured
  KVMHost->>REST: asyncJsonPost with timeout
  REST-->>KVMHost: response or timeout
  KVMHost-->>Client: MessageReply
Loading

代码审查工作量评估

🎯 2 (简单) | ⏱️ ~12 分钟

🐰 主机连接何须久等候,
五分钟限时细节俏,
并发检查各奔东西走,
超时机制保驾护航好!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed PR 标题完全符合格式要求([scope]: ),长度 46 字符(<72),清晰总结了两个核心修复:并行化与超时限制。
Description check ✅ Passed PR 描述详细阐述了问题背景、修复方案、测试验证和影响范围,与变更集高度相关。
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/jin.ma/fix/ZSTAC-85421

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ast-grep (0.42.3)
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java (1)

4904-4907: ⚡ Quick win

把并发上限提取成具名常量。

这里的 10 是本次修复的关键调参值,直接写裸值会让后续调优和排障时不够直观。建议像超时一样提成常量,例如 CHECK_HOST_STORAGE_CONNECTION_SYNC_LEVEL

可参考的修改
+    private static final int CHECK_HOST_STORAGE_CONNECTION_SYNC_LEVEL = 10;
+
             `@Override`
             protected int getSyncLevel() {
-                return 10;
+                return CHECK_HOST_STORAGE_CONNECTION_SYNC_LEVEL;
             }
As per coding guidelines “避免使用魔法值(Magic Value):直接使用未经定义的数值或字符串应替换为枚举或常量。”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java`
around lines 4904 - 4907, Replace the magic number 10 in
CephPrimaryStorageBase.getSyncLevel() with a named constant (e.g.,
CHECK_HOST_STORAGE_CONNECTION_SYNC_LEVEL) to make the concurrency limit explicit
and configurable; add a static final int
CHECK_HOST_STORAGE_CONNECTION_SYNC_LEVEL = 10 (near other class constants or top
of the class) and return that constant from getSyncLevel() instead of the
literal 10.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@test/src/test/groovy/org/zstack/test/integration/storage/ceph/CephHostStorageCheckCase.groovy`:
- Around line 159-163: 当前断言只验证“失败 + 耗时区间”,需要进一步收紧为超时错误:在
CephHostStorageCheckCase 的断言块(包含 reply、done、start 变量)中,除了 assert
!reply.get().isSuccess() 外,再断言返回的错误类型是 timeout(例如检查 reply.get().getError()
表示为超时,或 error.code/递归方法如 isTimeout() 为 true),以确保 setTimeout(3s) 被真正触发并返回超时错误。
- Around line 111-125: The test currently only checks that host1 was blocked
(host1Entered) and that host2 succeeded, but never asserts host1's final reply;
modify the host1 sendCheckMsg call to capture its MessageReply into an
AtomicReference (e.g., reply1), use reply1Done as the latch callback to set the
reference and count down, then after release.countDown() await reply1Done and
assert reply1.get().isSuccess() (mirroring the reply2 pattern for sendCheckMsg,
reply2, reply2Done).

---

Nitpick comments:
In
`@plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java`:
- Around line 4904-4907: Replace the magic number 10 in
CephPrimaryStorageBase.getSyncLevel() with a named constant (e.g.,
CHECK_HOST_STORAGE_CONNECTION_SYNC_LEVEL) to make the concurrency limit explicit
and configurable; add a static final int
CHECK_HOST_STORAGE_CONNECTION_SYNC_LEVEL = 10 (near other class constants or top
of the class) and return that constant from getSyncLevel() instead of the
literal 10.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: bd3d4d00-3965-4bc8-937c-6df87b34e69a

📥 Commits

Reviewing files that changed from the base of the PR and between f2ead4b and 147ad24.

📒 Files selected for processing (3)
  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
  • test/src/test/groovy/org/zstack/test/integration/storage/ceph/CephHostStorageCheckCase.groovy

Comment on lines +111 to +125
CountDownLatch reply1Done = new CountDownLatch(1)
sendCheckMsg(ps.uuid, host1.uuid, { MessageReply r -> reply1Done.countDown() })
assert host1Entered.await(10, TimeUnit.SECONDS)

AtomicReference<MessageReply> reply2 = new AtomicReference<>()
CountDownLatch reply2Done = new CountDownLatch(1)
sendCheckMsg(ps.uuid, host2.uuid, { MessageReply r -> reply2.set(r); reply2Done.countDown() })

assert reply2Done.await(15, TimeUnit.SECONDS)
assert reply2.get().isSuccess()
assert reply1Done.getCount() == 1

release.countDown()
assert reply1Done.await(15, TimeUnit.SECONDS)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

补上 host1 最终结果断言。

这里现在只证明了 host2 没被串行化,但没有验证 host1 在放行后是否成功完成;如果实现变成“并发了但最终失败”,这个用例仍然会通过。建议像 reply2 一样保存 reply1 并断言成功。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@test/src/test/groovy/org/zstack/test/integration/storage/ceph/CephHostStorageCheckCase.groovy`
around lines 111 - 125, The test currently only checks that host1 was blocked
(host1Entered) and that host2 succeeded, but never asserts host1's final reply;
modify the host1 sendCheckMsg call to capture its MessageReply into an
AtomicReference (e.g., reply1), use reply1Done as the latch callback to set the
reference and count down, then after release.countDown() await reply1Done and
assert reply1.get().isSuccess() (mirroring the reply2 pattern for sendCheckMsg,
reply2, reply2Done).

Comment on lines +159 to +163
assert done.await(20, TimeUnit.SECONDS)
long elapsed = System.currentTimeMillis() - start
assert !reply.get().isSuccess()
assert elapsed >= 2000
assert elapsed < 20000
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

把失败类型收紧到 timeout。

当前只断言“失败 + 耗时区间”,这会把路由错误、模拟器异常等非超时失败也算成通过,不能直接证明 setTimeout(3s) 被真正尊重。建议再断言返回的是 timeout 类错误。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@test/src/test/groovy/org/zstack/test/integration/storage/ceph/CephHostStorageCheckCase.groovy`
around lines 159 - 163, 当前断言只验证“失败 + 耗时区间”,需要进一步收紧为超时错误:在
CephHostStorageCheckCase 的断言块(包含 reply、done、start 变量)中,除了 assert
!reply.get().isSuccess() 外,再断言返回的错误类型是 timeout(例如检查 reply.get().getError()
表示为超时,或 error.code/递归方法如 isTimeout() 为 true),以确保 setTimeout(3s) 被真正触发并返回超时错误。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants