Skip to content

device-health-oracle: detect link impairment from monitoring data#3678

Draft
nikw9944 wants to merge 2 commits into
mainfrom
nikw9944/doublezero-2652
Draft

device-health-oracle: detect link impairment from monitoring data#3678
nikw9944 wants to merge 2 commits into
mainfrom
nikw9944/doublezero-2652

Conversation

@nikw9944

@nikw9944 nikw9944 commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary of Changes

  • Add bidirectional LinkHealth transitions to the device-health-oracle. The DHO previously had zero LinkCriterions configured — links auto-advanced from Pending to ReadyForService and stayed there forever, even when ISIS was down or every packet was being dropped. This PR adds the demotion path (and a slow recovery path back).
  • ReadyForService → Impaired triggers on the most recent link_rollup_5m bucket (fast detection, single bucket). Impaired → ReadyForService requires every distinct bucket in the recovery window to be clean (slow recovery, anti-flap).
  • Recovery window reuses the existing --drained-slot-count (~30 min default) — no new burn-in flag introduced. New --link-loss-threshold flag (default 5.0) controls the per-direction loss cutoff.
  • LinkHealth is reported as a signal only; the serviceability program does not gate link.status on it.
  • Fixes device-health-oracle: detect link impairment from monitoring data and update link.health #2652

Implementation notes

  • Stale-data floor: a "latest" bucket older than 15 min (3× rollup cadence) is treated as no data. Without this, a frozen telemetry pipeline at the moment of an ISIS flap would keep the link Impaired forever even after the link recovered.
  • Recovery query deduplicates late-arriving rows by argMax(..., ingested_at) per bucket, so a corrected re-write doesn't keep a link Impaired after every distinct bucket reads as clean.
  • --link-loss-threshold is validated at startup (must be finite and in [0, 100]) — a misconfiguration would otherwise produce an onchain write storm.
  • MetricCriterionResults is now incremented for link criteria too (parity with device criteria) so impairment/recovery rates are graphable.
  • Backwards compatible: deployments without ClickHouse continue to behave exactly as before — no demotion, no recovery, no impairment criterion is wired up.

Testing Verification

  • New unit tests cover all six transition cases (Pending/RFS/Impaired × pass/fail), the impairment-mode and recovery-mode criterion paths, the stale-bucket floor, threshold boundaries, no-data handling in both modes, ClickHouse error propagation, and the LinkBurnIn helper.
  • New ClickHouse tests assert the actual SQL shape: argMax-per-bucket dedupe, provisioning = false filter, double-quoted database name, and ? bind parameters.
  • Manually validated both queries against mainnet link_rollup_5m data via the DoubleZero MCP — known impaired link DzHDqj3cdi77eMLWKemdhfr6YZJeHHGxuysvAdekniC returns bad=5/total=5 over the last 30 min and the latest bucket reads isis_down=true, a_loss=100, z_loss=100, matching what the issue reports.
  • make go-test, golangci-lint, and go vet all clean.

nikw9944 added 2 commits May 6, 2026 16:41
Add bidirectional LinkHealth transitions backed by the existing
link_rollup_5m ClickHouse table:

- ReadyForService -> Impaired when the most recent rollup bucket has
  ISIS down or per-direction packet loss above --link-loss-threshold
  (default 5%).
- Impaired -> ReadyForService only after every bucket in the recovery
  window (reuses --drained-slot-count) is clean.

The asymmetry — fast demote, slow recover — keeps borderline links from
flapping while still surfacing real impairment quickly. LinkHealth is a
signal only; the serviceability program does not gate link.status on it.

Refs #2652
Architecture review (HIGH):
- Recovery SQL counts duplicate ingested_at rows rather than distinct
  buckets. Wrap in argMax-per-bucket subquery so a corrected late row
  cannot keep an Impaired link stuck even after every distinct bucket
  reads as clean.
- LinkHealthRecent had no recency floor — a stale latest bucket
  (telemetry pipeline broken) could indefinitely demote/keep a link
  impaired. Return bucket_ts and treat anything older than 15 minutes
  (3x rollup cadence) as no data.

Architecture review (MEDIUM):
- Increment MetricCriterionResults symmetrically in checkAllLink so
  link impairment/recovery rates are graphable.

Security review (LOW):
- Validate --link-loss-threshold is finite and within [0, 100] at
  startup; fail fast on misconfiguration.

Plus debuggability: include bucket timestamps in impairment fail
reasons; surface bad/total bucket counts in recovery debug logs and
the failure reason.

Refs #2652

@ben-dz ben-dz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid implementation that matches #2652. What looks good:

  • Evaluator state machine is clean and well-documented; the no-criteria guards preserve existing behavior for deployments without ClickHouse.
  • argMax(..., ingested_at) dedupe per bucket is the right call for late-arriving rewrites, and the SQL-shape tests actually pin it.
  • Stale-bucket floor on the impairment side is well-reasoned (15 min = 3× cadence) and tested with an adversarial fixture (stale bucket showing 100% loss must not demote).
  • Startup validation of --link-loss-threshold (NaN/Inf/range) is a nice touch.
  • Test coverage is genuinely thorough: all six transition cases, threshold boundaries (at-threshold passes, above fails), both no-data modes, bind-arg ordering, db-name quoting.

Main ask is the ClickHouse error-handling comment on checkImpairment (one-line fix plus a test flip); the recovery freshness/coverage comment is worth addressing or explicitly accepting; the other two are minor.

Comment on lines +78 to +82
if err != nil {
c.log.Error("Failed to query link health recent",
"link", pubkey, "code", link.Code, "error", err)
return false, fmt.Sprintf("clickhouse query failed: %v", err)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] A query error here fails the impairment criterion, and in the ReadyForService branch of LinkHealthEvaluator.Evaluate a failed criterion demotes the link to Impaired. A transient ClickHouse outage (network blip, CH restart, query timeout — anything after the successful startup ping) would therefore demote every RFS link onchain in a single tick, which is the write storm the --link-loss-threshold validation guards against. Once ClickHouse recovers, the recovery check immediately passes (the window's buckets are genuinely clean), so you also get a matching wave of RFS re-writes plus false operator signal in between.

Errors here should be treated like the !found case below: hold current health (return true, ""), keep the error log, and ideally bump a distinct metric so outages don't show up as impairment in MetricCriterionResults{link_health_impairment, fail}.

(The mirrored error path in checkRecovery is correct as-is — there a fail means "stay Impaired", which is the right behavior on error.)

}
}

func TestLinkHealthCriterion_Impairment_QueryError_Fails(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH — companion to the checkImpairment error-path comment] TestLinkHealthCriterion_Impairment_QueryError_Fails asserts the demote-on-error behavior flagged in link_health.go — if the error path changes to hold health, this test should assert passed == true instead.

Comment on lines +132 to +140
r, found, err := c.checker.LinkHealthWindowAllClean(ctx, pubkey, start, now, c.lossThreshold)
if err != nil {
c.log.Error("Failed to query link health recovery window",
"link", pubkey, "code", link.Code, "error", err)
return false, fmt.Sprintf("clickhouse query failed: %v", err)
}
if !found {
return false, "no rollup data in recovery window"
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] The recovery gate requires every present bucket in the window to be clean, but doesn't require the window to be covered or recent:

  1. Frozen pipeline: if rollups stop flowing after an impairment, the dirty buckets age out of the trailing [DrainedStart, Now] window while older clean buckets remain → the link recovers on stale data, and then stays RFS because checkImpairment treats the stale latest bucket as no-data. That contradicts the "neither demote nor recover on stale data" intent documented on LinkHealthRecentResult.
  2. Sparse ingestion: a single clean bucket in the 30-minute window satisfies "all clean", so the anti-flap dwell can effectively shrink to one bucket.

expectedMinutes is already computed here; consider requiring approximate coverage (e.g. Total >= expectedMinutes/5 - 1) and/or having the window query also return max(bucket_ts) and applying the same linkHealthRecentMaxAge floor to it.

@@ -164,6 +180,8 @@ func main() {
}
linkEvaluator := &worker.LinkHealthEvaluator{
ReadyForServiceCriteria: nil,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW/suggestion] ReadyForServiceCriteria stays nil, so Pending/Unknown links vacuously promote to RFS even when the latest bucket already shows isis_down or 100% loss — and then demote one tick later (two onchain writes plus a misleading intermediate RFS). Since the impairment criterion already passes on no-data, wiring linkImpairmentCriteria in here would block promotion of known-impaired links without changing behavior for links with no telemetry. The issue spec calls the current promotion path "existing behavior", so fine to defer to a follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

device-health-oracle: detect link impairment from monitoring data and update link.health

2 participants