device-health-oracle: detect link impairment from monitoring data by nikw9944 · Pull Request #3678 · malbeclabs/doublezero

nikw9944 · 2026-05-06T16:42:25Z

Summary of Changes

Add bidirectional LinkHealth transitions to the device-health-oracle. The DHO previously had zero LinkCriterions configured — links auto-advanced from Pending to ReadyForService and stayed there forever, even when ISIS was down or every packet was being dropped. This PR adds the demotion path (and a slow recovery path back).
ReadyForService → Impaired triggers on the most recent link_rollup_5m bucket (fast detection, single bucket). Impaired → ReadyForService requires every distinct bucket in the recovery window to be clean (slow recovery, anti-flap).
Recovery window reuses the existing --drained-slot-count (~30 min default) — no new burn-in flag introduced. New --link-loss-threshold flag (default 5.0) controls the per-direction loss cutoff.
LinkHealth is reported as a signal only; the serviceability program does not gate link.status on it.
Fixes device-health-oracle: detect link impairment from monitoring data and update link.health #2652

Implementation notes

Stale-data floor: a "latest" bucket older than 15 min (3× rollup cadence) is treated as no data. Without this, a frozen telemetry pipeline at the moment of an ISIS flap would keep the link Impaired forever even after the link recovered.
Recovery query deduplicates late-arriving rows by argMax(..., ingested_at) per bucket, so a corrected re-write doesn't keep a link Impaired after every distinct bucket reads as clean.
--link-loss-threshold is validated at startup (must be finite and in [0, 100]) — a misconfiguration would otherwise produce an onchain write storm.
MetricCriterionResults is now incremented for link criteria too (parity with device criteria) so impairment/recovery rates are graphable.
Backwards compatible: deployments without ClickHouse continue to behave exactly as before — no demotion, no recovery, no impairment criterion is wired up.

Testing Verification

New unit tests cover all six transition cases (Pending/RFS/Impaired × pass/fail), the impairment-mode and recovery-mode criterion paths, the stale-bucket floor, threshold boundaries, no-data handling in both modes, ClickHouse error propagation, and the LinkBurnIn helper.
New ClickHouse tests assert the actual SQL shape: argMax-per-bucket dedupe, provisioning = false filter, double-quoted database name, and ? bind parameters.
Manually validated both queries against mainnet link_rollup_5m data via the DoubleZero MCP — known impaired link DzHDqj3cdi77eMLWKemdhfr6YZJeHHGxuysvAdekniC returns bad=5/total=5 over the last 30 min and the latest bucket reads isis_down=true, a_loss=100, z_loss=100, matching what the issue reports.
make go-test, golangci-lint, and go vet all clean.

Add bidirectional LinkHealth transitions backed by the existing link_rollup_5m ClickHouse table: - ReadyForService -> Impaired when the most recent rollup bucket has ISIS down or per-direction packet loss above --link-loss-threshold (default 5%). - Impaired -> ReadyForService only after every bucket in the recovery window (reuses --drained-slot-count) is clean. The asymmetry — fast demote, slow recover — keeps borderline links from flapping while still surfacing real impairment quickly. LinkHealth is a signal only; the serviceability program does not gate link.status on it. Refs #2652

Architecture review (HIGH): - Recovery SQL counts duplicate ingested_at rows rather than distinct buckets. Wrap in argMax-per-bucket subquery so a corrected late row cannot keep an Impaired link stuck even after every distinct bucket reads as clean. - LinkHealthRecent had no recency floor — a stale latest bucket (telemetry pipeline broken) could indefinitely demote/keep a link impaired. Return bucket_ts and treat anything older than 15 minutes (3x rollup cadence) as no data. Architecture review (MEDIUM): - Increment MetricCriterionResults symmetrically in checkAllLink so link impairment/recovery rates are graphable. Security review (LOW): - Validate --link-loss-threshold is finite and within [0, 100] at startup; fail fast on misconfiguration. Plus debuggability: include bucket timestamps in impairment fail reasons; surface bad/total bucket counts in recovery debug logs and the failure reason. Refs #2652

ben-dz

Solid implementation that matches #2652. What looks good:

Evaluator state machine is clean and well-documented; the no-criteria guards preserve existing behavior for deployments without ClickHouse.
argMax(..., ingested_at) dedupe per bucket is the right call for late-arriving rewrites, and the SQL-shape tests actually pin it.
Stale-bucket floor on the impairment side is well-reasoned (15 min = 3× cadence) and tested with an adversarial fixture (stale bucket showing 100% loss must not demote).
Startup validation of --link-loss-threshold (NaN/Inf/range) is a nice touch.
Test coverage is genuinely thorough: all six transition cases, threshold boundaries (at-threshold passes, above fails), both no-data modes, bind-arg ordering, db-name quoting.

Main ask is the ClickHouse error-handling comment on checkImpairment (one-line fix plus a test flip); the recovery freshness/coverage comment is worth addressing or explicitly accepting; the other two are minor.

ben-dz · 2026-06-11T14:46:35Z

+	if err != nil {
+		c.log.Error("Failed to query link health recent",
+			"link", pubkey, "code", link.Code, "error", err)
+		return false, fmt.Sprintf("clickhouse query failed: %v", err)
+	}


[HIGH] A query error here fails the impairment criterion, and in the ReadyForService branch of LinkHealthEvaluator.Evaluate a failed criterion demotes the link to Impaired. A transient ClickHouse outage (network blip, CH restart, query timeout — anything after the successful startup ping) would therefore demote every RFS link onchain in a single tick, which is the write storm the --link-loss-threshold validation guards against. Once ClickHouse recovers, the recovery check immediately passes (the window's buckets are genuinely clean), so you also get a matching wave of RFS re-writes plus false operator signal in between.

Errors here should be treated like the !found case below: hold current health (return true, ""), keep the error log, and ideally bump a distinct metric so outages don't show up as impairment in MetricCriterionResults{link_health_impairment, fail}.

(The mirrored error path in checkRecovery is correct as-is — there a fail means "stay Impaired", which is the right behavior on error.)

ben-dz · 2026-06-11T14:46:36Z

+	}
+}
+
+func TestLinkHealthCriterion_Impairment_QueryError_Fails(t *testing.T) {


[HIGH — companion to the checkImpairment error-path comment] TestLinkHealthCriterion_Impairment_QueryError_Fails asserts the demote-on-error behavior flagged in link_health.go — if the error path changes to hold health, this test should assert passed == true instead.

ben-dz · 2026-06-11T14:46:36Z

+	r, found, err := c.checker.LinkHealthWindowAllClean(ctx, pubkey, start, now, c.lossThreshold)
+	if err != nil {
+		c.log.Error("Failed to query link health recovery window",
+			"link", pubkey, "code", link.Code, "error", err)
+		return false, fmt.Sprintf("clickhouse query failed: %v", err)
+	}
+	if !found {
+		return false, "no rollup data in recovery window"
+	}


[MEDIUM] The recovery gate requires every present bucket in the window to be clean, but doesn't require the window to be covered or recent:

Frozen pipeline: if rollups stop flowing after an impairment, the dirty buckets age out of the trailing [DrainedStart, Now] window while older clean buckets remain → the link recovers on stale data, and then stays RFS because checkImpairment treats the stale latest bucket as no-data. That contradicts the "neither demote nor recover on stale data" intent documented on LinkHealthRecentResult.

Sparse ingestion: a single clean bucket in the 30-minute window satisfies "all clean", so the anti-flap dwell can effectively shrink to one bucket.

expectedMinutes is already computed here; consider requiring approximate coverage (e.g. Total >= expectedMinutes/5 - 1) and/or having the window query also return max(bucket_ts) and applying the same linkHealthRecentMaxAge floor to it.

ben-dz · 2026-06-11T14:46:36Z

@@ -164,6 +180,8 @@ func main() {
 	}
 	linkEvaluator := &worker.LinkHealthEvaluator{
 		ReadyForServiceCriteria: nil,


[LOW/suggestion] ReadyForServiceCriteria stays nil, so Pending/Unknown links vacuously promote to RFS even when the latest bucket already shows isis_down or 100% loss — and then demote one tick later (two onchain writes plus a misleading intermediate RFS). Since the impairment criterion already passes on no-data, wiring linkImpairmentCriteria in here would block promotion of known-impaired links without changing behavior for links with no telemetry. The issue spec calls the current promotion path "existing behavior", so fine to defer to a follow-up.

nikw9944 added 2 commits May 6, 2026 16:41

nikw9944 mentioned this pull request May 6, 2026

device-health-oracle: detect link impairment from monitoring data and update link.health #2652

Open

ben-dz requested changes Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

device-health-oracle: detect link impairment from monitoring data#3678

device-health-oracle: detect link impairment from monitoring data#3678
nikw9944 wants to merge 2 commits into
mainfrom
nikw9944/doublezero-2652

nikw9944 commented May 6, 2026

Uh oh!

ben-dz left a comment

Uh oh!

ben-dz Jun 11, 2026

Uh oh!

ben-dz Jun 11, 2026

Uh oh!

ben-dz Jun 11, 2026

Uh oh!

ben-dz Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nikw9944 commented May 6, 2026

Summary of Changes

Implementation notes

Testing Verification

Uh oh!

ben-dz left a comment

Choose a reason for hiding this comment

Uh oh!

ben-dz Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

ben-dz Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

ben-dz Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

ben-dz Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants