Skip to content

Add TermLayout -- term-to-coefficient slice registry#1007

Open
Philipp Kopper (PhilippKopperQC) wants to merge 1 commit into
Quantco:mainfrom
PhilippKopperQC:terms
Open

Add TermLayout -- term-to-coefficient slice registry#1007
Philipp Kopper (PhilippKopperQC) wants to merge 1 commit into
Quantco:mainfrom
PhilippKopperQC:terms

Conversation

@PhilippKopperQC

@PhilippKopperQC Philipp Kopper (PhilippKopperQC) commented Jun 14, 2026

Copy link
Copy Markdown

Why

Splines (mgcv-style s(...) / te(...) smooths) need to map each smooth term to a contiguous block of coefficients so a single lambda can penalize a whole block. glum currently exposes only a flat per-coefficient term_names_ list, with no way to ask "which coefficients does term X own?" without re-scanning that list. This PR adds the registry; later PRs migrate consumers (penalty assembly, Wald-test name lookup, categorical penalty expansion) onto it.
This is the first PR in the splines roadmap -- purely additive, no behavior change.

What

  • New module src/glum/_term_layout.py with two frozen, slotted dataclasses:
    • TermSlice(name, start, stop, kind) -- one term, half-open coef range over the full coefficient vector (intercept-aware).
    • TermLayout -- ordered tuple of TermSlice + has_intercept, with O(1) name lookup, iteration, and membership.
  • GeneralizedLinearRegressorBase._set_up_for_fit populates model.term_layout_ next to the existing term_names_ assignment. Both fit branches (formula and array/DataFrame) flow through the same point.
  • New test file tests/glm/test_term_layout.py -- 20 tests, all AAA-structured and strictly behavioral (partition property, slice -> column round-trip after .fit(), determinism, lookup contracts, edge cases).

Consumers

There are exactly two non-test references to TermLayout after this PR: the import at src/glum/_glm.py:51 and the construction at src/glum/_glm.py:2166. The execution graph during .fit() is:

GeneralizedLinearRegressor.fit(X, y)
  -> _set_up_for_fit(X, y, ...)
       ...
       self.feature_names_ = X.get_names("column")
       self.term_names_    = X.get_names("term")
       self.term_layout_   = TermLayout.from_term_names(...)   # WRITE (this PR)
  -> _solve(...)
       # IRLS / CD inner loops
       # Reads X, y, weights, offset, P1, P2, alpha
       # Does NOT read self.term_layout_
  -> self.coef_, self.intercept_ = ...

The layout is constructed once at the end of fit setup and then sits unread for the entire IRLS/CD inner loop. Zero hot-path cost.

Planned consumers, all outside the IRLS inner loop:

Future consumer Hot path? Lookup pattern
expand_categorical_penalties migration (closes the _glm.py:2039 TODO) No -- once during fit setup layout[term_name] for each user-supplied per-term penalty
Wald-test name lookup at _glm.py:1349 No -- once at model.wald_test(...) time layout[term_name].start:.stop to slice coef_ and the covariance
Penalty / PenaltyTerm assembly (next PR) No -- once per outer lambda update iterate layout.terms, write each S_m block into the global penalty matrix
Smooth construction (s(...) / te(...)) No -- once at fit setup iterate layout.terms, build basis + penalty per smooth term
Smooth-term partial-effect prediction No -- once at predict_smooth(...) time layout[term_name] to slice coef_ for one smooth

TermLayout is metadata that lives at the boundary between user-facing API (term names, formulas) and the numerical solver (column indices), not inside the solver. The hot path (build_hessian_delta, the Cython CD sweep, distribution gradient/hessian calls) keeps operating on assembled numpy / sparse arrays of length n_features, exactly as it does today.

Performance

How to read these numbers

Two things are being measured, and they sit in different places:

  • Build cost (TermLayout.from_term_names) is paid exactly once per .fit(), in fit setup, before the solver runs. It is the only cost this PR adds to a fit.
  • Lookup cost is paid zero times during .fit() because no consumer in this PR reads the layout. It will be paid by future consumers listed in the section above, all of which run outside the IRLS inner loop. The lookup number is reported as a forward-looking budget for those consumers.

So the "Build / fit ratio" rows below are the worst-case overhead this PR introduces -- a closed bound on the fit-time cost.

Source of the numbers

The script that produced these numbers is reproduced verbatim at the bottom of this section. It uses time.perf_counter_ns, runs on Python 3.14 / M-series Mac, with GC disabled inside each timed loop, plus a 1 000-call warm-up before measurement. Each row is n=200 independent repeats; each repeat times an inner loop of inner calls and reports the inner-loop-averaged seconds-per-call. n=80 for .fit() (slower, fewer repeats). 95% CIs are normal-approximation on the mean; the empirical 2.5 / 97.5% columns are direct percentiles of the 200 per-repeat means.

TermLayout.from_term_names build cost

Workload Mean 95% CI on mean Median Std Min / Max Empirical 2.5 / 97.5%
50 single-coef terms (w/ intercept) 32.07 us [32.00, 32.13] 31.99 us 0.48 us 31.43 / 36.69 [31.63, 32.59]
500 cols / 5 cat terms 15.56 us [15.54, 15.58] 15.54 us 0.18 us 15.18 / 16.10 [15.26, 15.90]
500 single-coef terms 314.54 us [313.93, 315.15] 313.63 us 4.41 us 308.07 / 335.57 [309.07, 325.49]
1 000 single-coef terms 639.55 us [637.61, 641.49] 636.60 us 13.97 us 627.48 / 781.24 [628.24, 667.79]
10 000 single-coef terms 6.50 ms [6.48, 6.52] 6.47 ms 0.13 ms 6.35 / 7.24 [6.37, 6.82]

Linear scaling in n_terms confirmed: ~0.65 us per term across all scales. Categorical case is sub-linear because the inner-loop count is n_terms, not n_features.

Lookup by name (forward-looking budget for future consumers)

Workload Mean 95% CI Median Std
50-term layout, name -> slice 0.06 us [0.06, 0.06] 0.06 us <0.01 us

~60 ns. Constant in n_terms (dict lookup).

Reference cost it competes with

Workload Mean 95% CI Median Std
.fit() 1000 x 50, normal 1.04 ms [1.03, 1.05] 1.04 ms 0.06 ms

Overhead in context

Workload Build / fit ratio
50 features (typical small fit) 32 us / 1 040 us = ~3.1%
500 cols / 5 cat terms 16 us / 1 040 us = ~1.5%
10 000 features 6.5 ms / ~3.8 s = <0.2%

The 3% on the small case is the floor and is dominated by n_terms TermSlice dataclass allocations -- intrinsic given the public API. A perf-regression guard test (5 ms ceiling at 1 000 terms vs measured 0.64 ms) fails loudly on accidental quadratic regressions.

Benchmark script

import gc, statistics, time
import numpy as np
from glum._glm import GeneralizedLinearRegressor
from glum._term_layout import TermLayout


def measure(fn, *, inner, repeats):
    times = []
    for _ in range(repeats):
        gc.collect(); gc.disable()
        try:
            t0 = time.perf_counter_ns()
            for _ in range(inner):
                fn()
            t1 = time.perf_counter_ns()
        finally:
            gc.enable()
        times.append((t1 - t0) / inner / 1e9)
    return times

# Warm up
for _ in range(1000):
    TermLayout.from_term_names(["x"], False)

# Build cost
small = [f"x{i}" for i in range(50)]
times = measure(lambda: TermLayout.from_term_names(small, True),
                inner=2000, repeats=200)
# ... see PR for full set of workloads ...

# Lookup cost
layout = TermLayout.from_term_names(small, True)
times = measure(lambda: layout["x42"], inner=20000, repeats=200)

# Reference .fit()
rng = np.random.default_rng(0)
X = rng.standard_normal((1000, 50)); y = rng.standard_normal(1000)
times = measure(lambda: GeneralizedLinearRegressor(family="normal", alpha=0.1).fit(X, y),
                inner=2, repeats=80)

Compatibility

  • Public API: additive only. New attribute term_layout_ after fit; no existing attribute removed or changed.
  • 284-test formula + base regression suite: unchanged, all green.
  • Pickle: TermLayout is a stdlib dataclass, pickles cleanly. Older pickles that predate this PR will still load -- term_layout_ is just absent.

Follow-ups (not in this PR)

  • Migrate expand_categorical_penalties to use TermLayout, closing the formula+categorical TODO at _glm.py:2039.
  • Migrate Wald-test name lookup at _glm.py:1349 onto TermLayout.
  • Introduce PenaltyTerm / Penalty (per-term lambda vectors, block S matrices) on top of this registry -- the actual prerequisite for splines.

@PhilippKopperQC

Copy link
Copy Markdown
Author

The windows failure is unrelated to this PR.

@PhilippKopperQC

Copy link
Copy Markdown
Author

Martin Stancsics (@stanmart) as discussed many times, wdyt?

@stanmart Martin Stancsics (stanmart) left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for adding more GAM-like functionality and better spline handling to glum. Defining penalties for blocks/terms instead of coefficients is indeed a good step in that direction. With that said, I'm less sure about this specific PR.

My main issue is that we expand the public API with a new attribute without any concrete consumer. I.e., we can't be sure that this is the precisely correct interface for what we want to do later on. Additionally, there is nothing in from_term_names that can only be performed at fit time -- it operates on attributes that are available any time after fitting. We could just do it on the fly (benchmark also seem to imply that it will be cheap to do so) if/when we add functionality that needs it.

Comment thread src/glum/_glm.py

self.feature_names_ = X.get_names(type="column", missing_prefix="_col_") # type: ignore
self.term_names_ = X.get_names(type="term", missing_prefix="_col_")
self.term_layout_ = TermLayout.from_term_names(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only depends on attributes that are always available after fit. Why do we want to store it instead of just computing it on the fly when we have a use case?

@MarcAntoineSchmidtQC

Copy link
Copy Markdown
Member

The CI failure on the windows runner is due to a switch of OS for the windows runner (windows-latest). I'm working on a fix. In the meantime, if you want to run the CI, you can change "windows-latest" to "windows-2022" in the CI config files.

@stanmart

Copy link
Copy Markdown
Collaborator

Philipp Kopper (@PhilippKopperQC) as a concrete way forward: how would you feel about turning adding TermLayout without exporting it publicly and assigning it to an attribute for now? The class/concept/tests are valuable, and that would let us make this first step without committing to any public change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to move forward with this, I would like to minimize the memory footprint of this feature. It seems we are storing the terms twice.

It would also be good (but not absolutely necessary) to be able to turn it off. If we want to create a fast solver for the case where K > N, I think the performance hit of this may become a problem.

Comment thread src/glum/_term_layout.py
name: str
start: int
stop: int
kind: str

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does kind mean? What value can it take? Please add a docstring for the attributes that are not obvious.

Comment thread src/glum/_term_layout.py

terms: tuple[TermSlice, ...]
has_intercept: bool
_by_name: dict[str, TermSlice]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we storing the data for the Terms twice? (once in terms and once in _by_name?

@MarcAntoineSchmidtQC

Copy link
Copy Markdown
Member

Also, if the goal is to support splines, let's hold off on merging this to main. I am very open to having a long-lasting branch that sub-branches merge into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants