device-health-oracle: detect link impairment from monitoring data#3678
device-health-oracle: detect link impairment from monitoring data#3678nikw9944 wants to merge 2 commits into
Conversation
Add bidirectional LinkHealth transitions backed by the existing link_rollup_5m ClickHouse table: - ReadyForService -> Impaired when the most recent rollup bucket has ISIS down or per-direction packet loss above --link-loss-threshold (default 5%). - Impaired -> ReadyForService only after every bucket in the recovery window (reuses --drained-slot-count) is clean. The asymmetry — fast demote, slow recover — keeps borderline links from flapping while still surfacing real impairment quickly. LinkHealth is a signal only; the serviceability program does not gate link.status on it. Refs #2652
Architecture review (HIGH): - Recovery SQL counts duplicate ingested_at rows rather than distinct buckets. Wrap in argMax-per-bucket subquery so a corrected late row cannot keep an Impaired link stuck even after every distinct bucket reads as clean. - LinkHealthRecent had no recency floor — a stale latest bucket (telemetry pipeline broken) could indefinitely demote/keep a link impaired. Return bucket_ts and treat anything older than 15 minutes (3x rollup cadence) as no data. Architecture review (MEDIUM): - Increment MetricCriterionResults symmetrically in checkAllLink so link impairment/recovery rates are graphable. Security review (LOW): - Validate --link-loss-threshold is finite and within [0, 100] at startup; fail fast on misconfiguration. Plus debuggability: include bucket timestamps in impairment fail reasons; surface bad/total bucket counts in recovery debug logs and the failure reason. Refs #2652
ben-dz
left a comment
There was a problem hiding this comment.
Solid implementation that matches #2652. What looks good:
- Evaluator state machine is clean and well-documented; the no-criteria guards preserve existing behavior for deployments without ClickHouse.
argMax(..., ingested_at)dedupe per bucket is the right call for late-arriving rewrites, and the SQL-shape tests actually pin it.- Stale-bucket floor on the impairment side is well-reasoned (15 min = 3× cadence) and tested with an adversarial fixture (stale bucket showing 100% loss must not demote).
- Startup validation of
--link-loss-threshold(NaN/Inf/range) is a nice touch. - Test coverage is genuinely thorough: all six transition cases, threshold boundaries (at-threshold passes, above fails), both no-data modes, bind-arg ordering, db-name quoting.
Main ask is the ClickHouse error-handling comment on checkImpairment (one-line fix plus a test flip); the recovery freshness/coverage comment is worth addressing or explicitly accepting; the other two are minor.
| if err != nil { | ||
| c.log.Error("Failed to query link health recent", | ||
| "link", pubkey, "code", link.Code, "error", err) | ||
| return false, fmt.Sprintf("clickhouse query failed: %v", err) | ||
| } |
There was a problem hiding this comment.
[HIGH] A query error here fails the impairment criterion, and in the ReadyForService branch of LinkHealthEvaluator.Evaluate a failed criterion demotes the link to Impaired. A transient ClickHouse outage (network blip, CH restart, query timeout — anything after the successful startup ping) would therefore demote every RFS link onchain in a single tick, which is the write storm the --link-loss-threshold validation guards against. Once ClickHouse recovers, the recovery check immediately passes (the window's buckets are genuinely clean), so you also get a matching wave of RFS re-writes plus false operator signal in between.
Errors here should be treated like the !found case below: hold current health (return true, ""), keep the error log, and ideally bump a distinct metric so outages don't show up as impairment in MetricCriterionResults{link_health_impairment, fail}.
(The mirrored error path in checkRecovery is correct as-is — there a fail means "stay Impaired", which is the right behavior on error.)
| } | ||
| } | ||
|
|
||
| func TestLinkHealthCriterion_Impairment_QueryError_Fails(t *testing.T) { |
There was a problem hiding this comment.
[HIGH — companion to the checkImpairment error-path comment] TestLinkHealthCriterion_Impairment_QueryError_Fails asserts the demote-on-error behavior flagged in link_health.go — if the error path changes to hold health, this test should assert passed == true instead.
| r, found, err := c.checker.LinkHealthWindowAllClean(ctx, pubkey, start, now, c.lossThreshold) | ||
| if err != nil { | ||
| c.log.Error("Failed to query link health recovery window", | ||
| "link", pubkey, "code", link.Code, "error", err) | ||
| return false, fmt.Sprintf("clickhouse query failed: %v", err) | ||
| } | ||
| if !found { | ||
| return false, "no rollup data in recovery window" | ||
| } |
There was a problem hiding this comment.
[MEDIUM] The recovery gate requires every present bucket in the window to be clean, but doesn't require the window to be covered or recent:
- Frozen pipeline: if rollups stop flowing after an impairment, the dirty buckets age out of the trailing
[DrainedStart, Now]window while older clean buckets remain → the link recovers on stale data, and then stays RFS becausecheckImpairmenttreats the stale latest bucket as no-data. That contradicts the "neither demote nor recover on stale data" intent documented onLinkHealthRecentResult. - Sparse ingestion: a single clean bucket in the 30-minute window satisfies "all clean", so the anti-flap dwell can effectively shrink to one bucket.
expectedMinutes is already computed here; consider requiring approximate coverage (e.g. Total >= expectedMinutes/5 - 1) and/or having the window query also return max(bucket_ts) and applying the same linkHealthRecentMaxAge floor to it.
| @@ -164,6 +180,8 @@ func main() { | |||
| } | |||
| linkEvaluator := &worker.LinkHealthEvaluator{ | |||
| ReadyForServiceCriteria: nil, | |||
There was a problem hiding this comment.
[LOW/suggestion] ReadyForServiceCriteria stays nil, so Pending/Unknown links vacuously promote to RFS even when the latest bucket already shows isis_down or 100% loss — and then demote one tick later (two onchain writes plus a misleading intermediate RFS). Since the impairment criterion already passes on no-data, wiring linkImpairmentCriteria in here would block promotion of known-impaired links without changing behavior for links with no telemetry. The issue spec calls the current promotion path "existing behavior", so fine to defer to a follow-up.
Summary of Changes
LinkHealthtransitions to the device-health-oracle. The DHO previously had zeroLinkCriterions configured — links auto-advanced from Pending to ReadyForService and stayed there forever, even when ISIS was down or every packet was being dropped. This PR adds the demotion path (and a slow recovery path back).ReadyForService → Impairedtriggers on the most recentlink_rollup_5mbucket (fast detection, single bucket).Impaired → ReadyForServicerequires every distinct bucket in the recovery window to be clean (slow recovery, anti-flap).--drained-slot-count(~30 min default) — no new burn-in flag introduced. New--link-loss-thresholdflag (default 5.0) controls the per-direction loss cutoff.LinkHealthis reported as a signal only; the serviceability program does not gatelink.statuson it.Implementation notes
Impairedforever even after the link recovered.argMax(..., ingested_at)per bucket, so a corrected re-write doesn't keep a linkImpairedafter every distinct bucket reads as clean.--link-loss-thresholdis validated at startup (must be finite and in[0, 100]) — a misconfiguration would otherwise produce an onchain write storm.MetricCriterionResultsis now incremented for link criteria too (parity with device criteria) so impairment/recovery rates are graphable.Testing Verification
LinkBurnInhelper.provisioning = falsefilter, double-quoted database name, and?bind parameters.link_rollup_5mdata via the DoubleZero MCP — known impaired linkDzHDqj3cdi77eMLWKemdhfr6YZJeHHGxuysvAdekniCreturnsbad=5/total=5over the last 30 min and the latest bucket readsisis_down=true, a_loss=100, z_loss=100, matching what the issue reports.make go-test,golangci-lint, andgo vetall clean.