[FSTORE-1938] Support chaining of Transformation Functions using a DAG#580
Merged
Conversation
5ed6dcb to
b770050
Compare
6eacba8 to
cbf2ed3
Compare
ff87ced to
4db4444
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Adds documentation for chaining Transformation Functions into a dependency graph (DAG) in the Hopsworks Feature Store docs, including how execution order is resolved, how to visualize the DAG, and how parallel execution behaves for independent branches.
Changes:
- Documented chaining semantics for Transformation Functions (ODT + MDT), including cycle/duplicate-output rejection behavior.
- Added guidance on visualizing the transformation execution DAG from UI and SDK.
- Added performance/parallelism tuning details via
n_processes, including defaults and serving-time pool pre-spawn.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| docs/user_guides/fs/transformation_functions.md | Introduces chained transformation DAG concept, DAG visualization, and performance tuning/parallelism behavior. |
| docs/user_guides/fs/feature_view/model-dependent-transformations.md | Adds a section describing chaining model-dependent transformations and links to performance tuning guidance. |
| docs/user_guides/fs/feature_group/on_demand_transformations.md | Adds a section describing chaining on-demand transformations and the cross-DAG path into feature views/MDTs. |
efcea35 to
83a8a2a
Compare
bubriks
approved these changes
Jun 12, 2026
…xecution DAG https://hopsworks.atlassian.net/browse/FSTORE-1938 Document chaining of transformation functions across the user guides: how the output of one function feeds another, how the execution DAG resolves the order, how cycles and duplicate output columns are rejected, and how the DAG is rendered from the UI and from the SDK with visualize_transformations(). A Transformation Functions Performance Tuning subsection in the transformation functions guide covers the node-parallel execution model: the n_processes argument and its defaults per input shape, pool pre-spawning through init_serving and init_batch_scoring, Arrow shared-memory staging, and the HSFS_TF_POOL_START_METHOD override. The model-dependent transformations guide notes that statistics for chained functions are fit in dependency order on the data each function sees. The on-demand transformations guide covers chains whose intermediate output is dropped from the feature group. No migration entry is included since the changes are backwards compatible. Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@logicalclocks.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…xecution DAG https://hopsworks.atlassian.net/browse/FSTORE-1938 Restructure the performance tuning section so it reads in order: what the n_processes argument is, how parallelism maps to the DAG, when it pays off, online serving specifics, implementation notes. The previous version stated the sequential default three times across the first three paragraphs and placed the practical guidance after the implementation internals. Content changes: a call-shape distinction in the guidance (batch and offline calls benefit from worker processes, single feature vectors rarely do because the per-call dispatch cost usually exceeds the work), and a note that pre-spawning the pool removes the startup cost but not the per-call dispatch cost. Both reflect the measured behavior of the online batch chaining benchmark in the loadtest repository. Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@hopsworks.ai> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xecution DAG https://hopsworks.atlassian.net/browse/FSTORE-1938 Rework the chaining documentation for reading order on all three pages. The hub page now flows what chaining is, example, uniform offline and online behavior, statistics over chains with a link to the model-dependent page, cross-type chaining, and invalid configurations last instead of interleaved. The model-dependent page gives the statistics-over-chains behavior its own subsection instead of a single dangling sentence after the example, and states that statistics are fit on the train split, each transformation executes once, and the fitted values are persisted for serving. The on-demand page leads with the example like the other pages, and the example now demonstrates the dropped-column claims it previously only stated: both the raw input and the intermediate are dropped, leaving one stored output. Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@hopsworks.ai> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.