feat(app): implement health#498
Conversation
|
@claude review this PR |
|
Claude finished @iamquang95's task in 3m 54s —— View job Review: PR #498 — port Charon
|
| Component | Go | Rust | Match | Notes |
|---|---|---|---|---|
| 9 checks (names/severities/thresholds) | present | present | yes | identical thresholds incl. quorum-1, 2*numValidators |
| Error strings | expected exactly one metric, bug: … |
same | yes | error.rs matches verbatim |
high_warning_log_rate metric name |
app_log_warning_total |
app_log_warn_total |
intentional | Charon's check never matches its own producer (warn_total); deviation is correct & documented |
| Log metric names | app_log_error_total{topic}, app_log_warn_total{topic} |
same | yes | verified against charon/app/log/metrics.go |
| Timer: first scrape after one period | time.NewTicker |
interval + skip first tick |
yes | |
| Gauge cleared (set 0) on check error | yes | yes | yes | checker.rs:124-137 |
| High-cardinality re-gather | yes | yes | yes | incl. skipping app_health_metrics_high_cardinality |
Histogram le exclusion in cardinality count |
implicit (protobuf) | explicit filter | yes | sound adaptation for the text model |
| logFilter rate-limiting | yes | no | no | see findings |
Tests
Not executed — cargo test requires an approval not granted in this run. Reviewed statically: the check tests mirror Charon's checks_internal_test.go table cases, and gatherer.rs adds good coverage (format pin, histogram non-pollution, name-mismatch rejection, escape handling, ±Inf/NaN). Please confirm cargo test -p app --lib health and the full gates (fmt, clippy -D warnings) pass in CI.
Open questions
- Is wiring
Checkerinto app startup a planned follow-up? (Currently the service never runs.) - Should
topic-as-event-field logs (sse module) be captured for the per-topic label, or is the empty-topic bucket acceptable for now?
·feat/app-health
varex83
left a comment
There was a problem hiding this comment.
LGTM! Solid work!
The only part I'm a little bit unsure about it gatherer.rs - I think it's good as it is, but curious if it's worth using some library for parsing?
| pub counter: Option<f64>, | ||
| /// Gauge value, if this is a gauge sample. | ||
| pub gauge: Option<f64>, |
There was a problem hiding this comment.
nit: we can either use enum here with two variants and implement .value() for it, or, based on usage, we can store those values in one field.
Port Charon
app/healthto RustPorts Charon's
app/healthpackage tocrates/app/src/health— a backgroundservice that periodically scrapes process metrics, evaluates a fixed set of
health checks over a rolling window, and publishes per-check pass/fail as the
app_health_checks{severity,name}gauge (plusapp_health_metrics_high_cardinality).What's included
model.rs,select.rs,reducers.rs,checks.rs,checker.rs. Error strings match Charon's.gatherer.rs): vise/prometheus-clientexposes no structured gather, so the gatherer encodes the registry to text in the exact format the exporter serves (OpenMetricsForPrometheus) and parses it back. Queried counter/gauge families are name-validated so stray lines can't be folded in.crates/tracinglog metrics toapp_log_error_total{topic}/app_log_warn_total{topic}so they match Charon and are picked up by Charon's Grafana dashboards.Deviations from Charon
high_warning_log_ratequeriesapp_log_warn_total(the emitted name) rather than Charon'sapp_log_warning_total— Charon's check name doesn't match its own producer (this will be removed on later version of Charon)