Implement OTLP metrics bridge#6465
Conversation
|
What's the rationale for doing this? |
|
@guilload Right now, metrics-rs doesn’t support OTLP exporters by default. There is an open PR for it, but I’m not sure when it will be merged. In the previous PR, I used metrics-exporter-otel instead, but it doesn’t seem to be very widely used, so I’m concerned that relying on it could introduce supply chain risk. There is another similar crate, but that also doesn’t seem to be widely adopted. The exporter implementation is relatively simple, so I was thinking it might be better for us to just write it ourselves. What do you think? |
|
Apologies for butting in as a (relative) outsider, but why not use opentelemetry-otlp for metrics export instead? It's already in Cargo.toml, is maintained under official OTel auspices, and already supports metrics export (with a quick demo on its docs.rs page and more examples here). |
fe128d0 to
3e3ccdd
Compare
|
because we need a StatsD exporter for our fork. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3e3ccdd130
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Copyright 2021-Present Datadog, Inc. | ||
| // | ||
| // Licensed under the Apache License, Version 2.0 (the "License"); |
There was a problem hiding this comment.
Retain the vendored MIT attribution
When distributing builds or source from this commit, the in-tree OTLP recorder/storage code appears to vendor substantial portions of metrics-exporter-otel (same recorder, metadata, storage, and instrument structure) while the third-party license entry was removed and these new files carry only the Datadog Apache header. The upstream crate is MIT-licensed, which requires preserving its copyright and permission notice in copies/substantial portions, so please retain the upstream attribution or keep it listed as vendored third-party code.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
@shuheiktgw let's just fork the repo under quickwit-oss, it's cleaner.
Summary
In the previous PR, I used metrics-exporter-otel to bridge the OTel exporter and
metrics-rs. However, the library has a few issues, described below, so I would like to implement the bridge ourselves.Problems (human-written)
1. Delta Temporality for Counters Is Incorrectly Exported as Cumulative
Counter metrics exported by
metrics-exporter-otelcan be incorrect when using delta temporality: they appear to be exported as cumulative values instead of delta values. This is not necessarily a bug inmetrics-exporter-otelitself, but rather appears to be an issue in the OTel SDK.metrics-exporter-oteluses ObservableCounter, which registers a callback that is invoked whenever metrics are exported. However, there seems to be a bug in the OTel SDK when usingObservableCounterwith delta temporality.The issue occurs when metrics are recorded, or exported in the case of
ObservableCounter. TheValueMap#measuremethod is called, and the same tracker is inserted into thetrackershashmap twice: once using the original attributes as the key, and once using the sorted attributes as the key.For example, consider the
quickwit_indexing_processed_bytesmetric. Its labels look like this:The original attribute order is:
Since
docs_processed_statuscomes beforeindexalphabetically, the sorted order is:As a result, the
trackershashmap ends up looking like this:This becomes problematic when computing deltas for
ObservableCounter.Note that the OTel SDK has to compute the delta itself by remembering previously exported values, because
ObservableCounteronly provides the current cumulative value. Here, it computes the delta by subtracting the previously reported value from the current value:However, the problem occurs in collect_and_reset. It drains all entries from
trackers, then uses the attributes from the first occurrence of a given tracker as the key to look up the previously exported value. This can be incorrect because hashmap drain order is not guaranteed.In the example above, we do not know whether the attributes will be returned as:
or:
What makes this worse is that
collect_and_resetswaps the hashmaps betweenself.trackers_for_collectandself.trackersevery time it is called. I believe this is what causes these two attributes to be used alternately, although I am not completely sure about this part and would appreciate additional input.As a result, the previously reported value may not be found, so it falls back to
0. The computed delta therefore becomes the full cumulative value, rather than the actual delta.This actually explains why the same problem does not occur with other counter metrics, such as
quickwit_ingest_docs_bytes_total. It only has one attribute,validity, so the attribute order does not change and the SDK always receives the same attribute set.2. Supply Chain Risk
I used metrics-exporter-otel, but it doesn’t seem to be very widely used, so I’m concerned that relying on it could introduce supply chain risk. There is another similar crate, but that also doesn’t seem to be widely adopted.
Proposed Solution
Implement the bridge ourselves. This new bridge simply uses
self.counter.addinstead ofObservableCounter, so it does not have the same issue described in Problem 1. Since the SDK does not need to store previously exported values internally, it can just export the current value and reset it each time it exports metrics. I have already tested this and confirmed that the delta works correctly.An alternative would be to use another library such as DoumanAsh/metrics-opentelemetry, which uses the
addmethod. However, it does not solve the second problem. (+ DoumanAsh/metrics-opentelemetry misses some interfaces like configuring bucket size so we have to update it)