feat(data-pipeline): ensure large span strings get truncated before encoding#2161
feat(data-pipeline): ensure large span strings get truncated before encoding#2161brettlangdon wants to merge 8 commits into
Conversation
🎉 All green!🧪 All tests passed 🎯 Code Coverage (details) 🔗 Commit SHA: c7caf7f | Docs | Datadog PR Page | Give us feedback! |
📚 Documentation Check Results📦
|
Clippy Allow Annotation ReportComparing clippy allow annotations between branches:
Summary by Rule
Annotation Counts by File
Annotation Stats by Crate
About This ReportThis report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality. |
🔒 Cargo Deny Results📦
|
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-unknown-linux-gnu
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1c59a9a1a4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| /// Fields whose Unicode code-point count exceeds this threshold are truncated. | ||
| pub const MAX_SPAN_STRING_LEN: usize = 25_000; | ||
| /// Length (in Unicode code points) to which over-long fields are truncated, including the suffix. | ||
| pub const TRUNCATED_SPAN_STRING_LEN: usize = 2_500; |
There was a problem hiding this comment.
Why is this way shorter than MAX_SPAN_STRING_LEN, we cut if it's more tham 25000 codepoints but keep only the 1st 2500.
Or is this missing a zero maybe?
There was a problem hiding this comment.
This is what we do in Python, I don't remember the original reason
original PR
There was a problem hiding this comment.
The PR mentions the agent truncation policy... But it doesn't truncate the same 😅
It truncates at 25K bytes (at the nearest utf8 boundary actually), and not unicode codepoints, which can be significantly smaller for non latin characters and adds "..." after the 25K th character
There was a problem hiding this comment.
yeah, I don't mind diverging a bit from what dd-trace-py does, as long as it aligns with what the trace agent does 👍🏻
What does this PR do?
Add truncation of Span strings that are larger than 25k characters long before encoding.
Motivation
In dd-trace-py, this truncation currently happens when we encode Span chunks to hand off to the trace exporter, but as we remove the Python based encoders and give Spans directly to the trace exporter we need to maintain this behavior.
Additional Notes
Anything else we should know when reviewing?
How to test the change?
Describe here in detail how the change can be validated.