Skip to content

feat(app): implement health#498

Open
iamquang95 wants to merge 4 commits into
mainfrom
feat/app-health
Open

feat(app): implement health#498
iamquang95 wants to merge 4 commits into
mainfrom
feat/app-health

Conversation

@iamquang95

Copy link
Copy Markdown
Collaborator

Port Charon app/health to Rust

Ports Charon's app/health package to crates/app/src/health — a background
service that periodically scrapes process metrics, evaluates a fixed set of
health checks over a rolling window, and publishes per-check pass/fail as the
app_health_checks{severity,name} gauge (plus app_health_metrics_high_cardinality).

What's included

  • Engine (1:1 with Go): metric model, label selectors, series reducers, the 9 checks, and the query function — model.rs, select.rs, reducers.rs, checks.rs, checker.rs. Error strings match Charon's.
  • vise bridge (gatherer.rs): vise/prometheus-client exposes no structured gather, so the gatherer encodes the registry to text in the exact format the exporter serves (OpenMetricsForPrometheus) and parses it back. Queried counter/gauge families are name-validated so stray lines can't be folded in.
  • Metric-name parity: reshaped crates/tracing log metrics to app_log_error_total{topic} / app_log_warn_total{topic} so they match Charon and are picked up by Charon's Grafana dashboards.

Deviations from Charon

  • high_warning_log_rate queries app_log_warn_total (the emitted name) rather than Charon's app_log_warning_total — Charon's check name doesn't match its own producer (this will be removed on later version of Charon)

@iamquang95

Copy link
Copy Markdown
Collaborator Author

@claude review this PR

@claude

claude Bot commented Jun 24, 2026

Copy link
Copy Markdown

Claude finished @iamquang95's task in 3m 54s —— View job


Review: PR #498 — port Charon app/health to Rust

  • Gather context (diff, Go source)
  • Review engine (model, select, reducers, checks, checker)
  • Review vise bridge (gatherer)
  • Review tracing metric-name changes
  • Check parity with Charon app/health
  • Post review feedback

Summary

This is a careful, high-fidelity port. The engine (model/select/reducers/checks/checker) is essentially 1:1 with Charon: error strings, thresholds, selector/reducer semantics, the unanchored regex label matching, the "first scrape after one full period" timer behavior, gauge-clearing on check error, and the high-cardinality re-gather all match. The gatherer.rs text-exposition bridge is a reasonable and well-tested workaround for vise lacking a structured gather API, and the _total / app_log_warn_total parity reasoning checks out against the Go source. Tests are comprehensive and mirror Charon's table tests. Findings below are minor; nothing blocking.

Findings

  • [Low] MetricsLayer reads topic only from spans, not from event fields
    Impact: Existing call sites set topic directly on the event — e.g. tracing::warn!(..., topic = HEAD_EVENT, ...) in crates/app/src/sse/mod.rs:226,262,301,326. event_topic walks only the enclosing span scope (TopicVisitor runs solely in on_new_span), so those events fall into the empty-topic bucket. The total count is still correct (so high_warning_log_rate/high_error_log_rate are unaffected), but the per-topic label is lost for event-field topics, which weakens the Grafana breakdown this PR is aligning to.
    Evidence: crates/tracing/src/layers/metrics.rs:44-66 (span-only scan) vs crates/app/src/sse/mod.rs:226.
    Go reference: Charon derives topic from log.WithTopic(ctx, …) (span-scoped), so this only bites where Pluto attaches topic to the event instead of a span.
    Recommendation: also run a TopicVisitor over the event's own fields in on_event and prefer it over the span topic, or standardize topic-bearing logs onto WithTopic-style spans.

  • [Low] logFilter rate-limiting not ported → noisier failure logs + self-feedback
    Impact: Charon shares one log.Filter() (rate.Every(12s), burst 1) across all 9 checks, so at most one "Health check failed" warn is emitted per scrape. The Rust instrument logs every failing check every tick (acknowledged in the comment at checker.rs:128). Because MetricsLayer now counts every WARN into app_log_warn_total, these self-generated warnings (topic health) feed back into the high_warning_log_rate check — same loop exists in Charon but is dampened by the filter.
    Evidence: crates/app/src/health/checker.rs:121-138.
    Go reference: charon/app/health/checker.go:76-91 + charon/app/log/filter.go:45.
    Recommendation: optional — gate the warn behind a per-checker rate limiter to match Charon's volume.

  • [Low] Checker is not wired into app startup
    Impact: Checker/ViseGatherer are not referenced anywhere outside the module (lib.rs only adds pub mod health), so the service never runs yet. Functionally inert until a follow-up starts it. Flagging so it isn't assumed live.
    Evidence: no Checker::new / ViseGatherer usage outside crates/app/src/health/.
    Recommendation: confirm a follow-up PR spawns Checker::run on a CancellationToken from app wiring.

  • [Low] Regex recompiled per label, per series, per scrape
    Impact: labels_contain calls Regex::new for every candidate label on every query. This matches Go's regexp.MatchString (also recompiles), so parity is preserved, but the checks run over a 10-scrape window every 30s and this is easy to avoid.
    Evidence: crates/app/src/health/select.rs:89.
    Recommendation: optional — precompile the selector regexes once when the check list is built.

Parity matrix

Component Go Rust Match Notes
9 checks (names/severities/thresholds) present present yes identical thresholds incl. quorum-1, 2*numValidators
Error strings expected exactly one metric, bug: … same yes error.rs matches verbatim
high_warning_log_rate metric name app_log_warning_total app_log_warn_total intentional Charon's check never matches its own producer (warn_total); deviation is correct & documented
Log metric names app_log_error_total{topic}, app_log_warn_total{topic} same yes verified against charon/app/log/metrics.go
Timer: first scrape after one period time.NewTicker interval + skip first tick yes
Gauge cleared (set 0) on check error yes yes yes checker.rs:124-137
High-cardinality re-gather yes yes yes incl. skipping app_health_metrics_high_cardinality
Histogram le exclusion in cardinality count implicit (protobuf) explicit filter yes sound adaptation for the text model
logFilter rate-limiting yes no no see findings

Tests

Not executed — cargo test requires an approval not granted in this run. Reviewed statically: the check tests mirror Charon's checks_internal_test.go table cases, and gatherer.rs adds good coverage (format pin, histogram non-pollution, name-mismatch rejection, escape handling, ±Inf/NaN). Please confirm cargo test -p app --lib health and the full gates (fmt, clippy -D warnings) pass in CI.

Open questions

  1. Is wiring Checker into app startup a planned follow-up? (Currently the service never runs.)
  2. Should topic-as-event-field logs (sse module) be captured for the per-topic label, or is the empty-topic bucket acceptable for now?
    · feat/app-health

@varex83 varex83 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Solid work!

The only part I'm a little bit unsure about it gatherer.rs - I think it's good as it is, but curious if it's worth using some library for parsing?

Comment on lines +38 to +40
pub counter: Option<f64>,
/// Gauge value, if this is a gauge sample.
pub gauge: Option<f64>,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can either use enum here with two variants and implement .value() for it, or, based on usage, we can store those values in one field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants