Add `TermLayout` -- term-to-coefficient slice registry by PhilippKopperQC · Pull Request #1007 · Quantco/glum

Philipp Kopper (PhilippKopperQC) · 2026-06-14T16:48:22Z

Why

Splines (mgcv-style s(...) / te(...) smooths) need to map each smooth term to a contiguous block of coefficients so a single lambda can penalize a whole block. glum currently exposes only a flat per-coefficient term_names_ list, with no way to ask "which coefficients does term X own?" without re-scanning that list. This PR adds the registry; later PRs migrate consumers (penalty assembly, Wald-test name lookup, categorical penalty expansion) onto it.
This is the first PR in the splines roadmap -- purely additive, no behavior change.

What

New module src/glum/_term_layout.py with two frozen, slotted dataclasses:
- TermSlice(name, start, stop, kind) -- one term, half-open coef range over the full coefficient vector (intercept-aware).
- TermLayout -- ordered tuple of TermSlice + has_intercept, with O(1) name lookup, iteration, and membership.
GeneralizedLinearRegressorBase._set_up_for_fit populates model.term_layout_ next to the existing term_names_ assignment. Both fit branches (formula and array/DataFrame) flow through the same point.
New test file tests/glm/test_term_layout.py -- 20 tests, all AAA-structured and strictly behavioral (partition property, slice -> column round-trip after .fit(), determinism, lookup contracts, edge cases).

Consumers

There are exactly two non-test references to TermLayout after this PR: the import at src/glum/_glm.py:51 and the construction at src/glum/_glm.py:2166. The execution graph during .fit() is:

GeneralizedLinearRegressor.fit(X, y)
  -> _set_up_for_fit(X, y, ...)
       ...
       self.feature_names_ = X.get_names("column")
       self.term_names_    = X.get_names("term")
       self.term_layout_   = TermLayout.from_term_names(...)   # WRITE (this PR)
  -> _solve(...)
       # IRLS / CD inner loops
       # Reads X, y, weights, offset, P1, P2, alpha
       # Does NOT read self.term_layout_
  -> self.coef_, self.intercept_ = ...

The layout is constructed once at the end of fit setup and then sits unread for the entire IRLS/CD inner loop. Zero hot-path cost.

Planned consumers, all outside the IRLS inner loop:

Future consumer	Hot path?	Lookup pattern
`expand_categorical_penalties` migration (closes the `_glm.py:2039` TODO)	No -- once during fit setup	`layout[term_name]` for each user-supplied per-term penalty
Wald-test name lookup at `_glm.py:1349`	No -- once at `model.wald_test(...)` time	`layout[term_name].start:.stop` to slice `coef_` and the covariance
`Penalty` / `PenaltyTerm` assembly (next PR)	No -- once per outer lambda update	iterate `layout.terms`, write each `S_m` block into the global penalty matrix
Smooth construction (`s(...)` / `te(...)`)	No -- once at fit setup	iterate `layout.terms`, build basis + penalty per smooth term
Smooth-term partial-effect prediction	No -- once at `predict_smooth(...)` time	`layout[term_name]` to slice `coef_` for one smooth

TermLayout is metadata that lives at the boundary between user-facing API (term names, formulas) and the numerical solver (column indices), not inside the solver. The hot path (build_hessian_delta, the Cython CD sweep, distribution gradient/hessian calls) keeps operating on assembled numpy / sparse arrays of length n_features, exactly as it does today.

Performance

How to read these numbers

Two things are being measured, and they sit in different places:

Build cost (TermLayout.from_term_names) is paid exactly once per .fit(), in fit setup, before the solver runs. It is the only cost this PR adds to a fit.
Lookup cost is paid zero times during .fit() because no consumer in this PR reads the layout. It will be paid by future consumers listed in the section above, all of which run outside the IRLS inner loop. The lookup number is reported as a forward-looking budget for those consumers.

So the "Build / fit ratio" rows below are the worst-case overhead this PR introduces -- a closed bound on the fit-time cost.

Source of the numbers

The script that produced these numbers is reproduced verbatim at the bottom of this section. It uses time.perf_counter_ns, runs on Python 3.14 / M-series Mac, with GC disabled inside each timed loop, plus a 1 000-call warm-up before measurement. Each row is n=200 independent repeats; each repeat times an inner loop of inner calls and reports the inner-loop-averaged seconds-per-call. n=80 for .fit() (slower, fewer repeats). 95% CIs are normal-approximation on the mean; the empirical 2.5 / 97.5% columns are direct percentiles of the 200 per-repeat means.

`TermLayout.from_term_names` build cost

Workload	Mean	95% CI on mean	Median	Std	Min / Max	Empirical 2.5 / 97.5%
50 single-coef terms (w/ intercept)	32.07 us	[32.00, 32.13]	31.99 us	0.48 us	31.43 / 36.69	[31.63, 32.59]
500 cols / 5 cat terms	15.56 us	[15.54, 15.58]	15.54 us	0.18 us	15.18 / 16.10	[15.26, 15.90]
500 single-coef terms	314.54 us	[313.93, 315.15]	313.63 us	4.41 us	308.07 / 335.57	[309.07, 325.49]
1 000 single-coef terms	639.55 us	[637.61, 641.49]	636.60 us	13.97 us	627.48 / 781.24	[628.24, 667.79]
10 000 single-coef terms	6.50 ms	[6.48, 6.52]	6.47 ms	0.13 ms	6.35 / 7.24	[6.37, 6.82]

Linear scaling in n_terms confirmed: ~0.65 us per term across all scales. Categorical case is sub-linear because the inner-loop count is n_terms, not n_features.

Lookup by name (forward-looking budget for future consumers)

Workload	Mean	95% CI	Median	Std
50-term layout, name -> slice	0.06 us	[0.06, 0.06]	0.06 us	<0.01 us

~60 ns. Constant in n_terms (dict lookup).

Reference cost it competes with

Workload	Mean	95% CI	Median	Std
`.fit()` 1000 x 50, normal	1.04 ms	[1.03, 1.05]	1.04 ms	0.06 ms

Overhead in context

Workload	Build / fit ratio
50 features (typical small fit)	32 us / 1 040 us = ~3.1%
500 cols / 5 cat terms	16 us / 1 040 us = ~1.5%
10 000 features	6.5 ms / ~3.8 s = <0.2%

The 3% on the small case is the floor and is dominated by n_terms TermSlice dataclass allocations -- intrinsic given the public API. A perf-regression guard test (5 ms ceiling at 1 000 terms vs measured 0.64 ms) fails loudly on accidental quadratic regressions.

Benchmark script

import gc, statistics, time
import numpy as np
from glum._glm import GeneralizedLinearRegressor
from glum._term_layout import TermLayout


def measure(fn, *, inner, repeats):
    times = []
    for _ in range(repeats):
        gc.collect(); gc.disable()
        try:
            t0 = time.perf_counter_ns()
            for _ in range(inner):
                fn()
            t1 = time.perf_counter_ns()
        finally:
            gc.enable()
        times.append((t1 - t0) / inner / 1e9)
    return times

# Warm up
for _ in range(1000):
    TermLayout.from_term_names(["x"], False)

# Build cost
small = [f"x{i}" for i in range(50)]
times = measure(lambda: TermLayout.from_term_names(small, True),
                inner=2000, repeats=200)
# ... see PR for full set of workloads ...

# Lookup cost
layout = TermLayout.from_term_names(small, True)
times = measure(lambda: layout["x42"], inner=20000, repeats=200)

# Reference .fit()
rng = np.random.default_rng(0)
X = rng.standard_normal((1000, 50)); y = rng.standard_normal(1000)
times = measure(lambda: GeneralizedLinearRegressor(family="normal", alpha=0.1).fit(X, y),
                inner=2, repeats=80)

Compatibility

Public API: additive only. New attribute term_layout_ after fit; no existing attribute removed or changed.
284-test formula + base regression suite: unchanged, all green.
Pickle: TermLayout is a stdlib dataclass, pickles cleanly. Older pickles that predate this PR will still load -- term_layout_ is just absent.

Follow-ups (not in this PR)

Migrate expand_categorical_penalties to use TermLayout, closing the formula+categorical TODO at _glm.py:2039.
Migrate Wald-test name lookup at _glm.py:1349 onto TermLayout.
Introduce PenaltyTerm / Penalty (per-term lambda vectors, block S matrices) on top of this registry -- the actual prerequisite for splines.

Philipp Kopper (PhilippKopperQC) · 2026-06-14T17:05:39Z

The windows failure is unrelated to this PR.

Philipp Kopper (PhilippKopperQC) · 2026-06-14T17:05:56Z

Martin Stancsics (@stanmart) as discussed many times, wdyt?

Martin Stancsics (stanmart)

I'm all for adding more GAM-like functionality and better spline handling to glum. Defining penalties for blocks/terms instead of coefficients is indeed a good step in that direction. With that said, I'm less sure about this specific PR.

My main issue is that we expand the public API with a new attribute without any concrete consumer. I.e., we can't be sure that this is the precisely correct interface for what we want to do later on. Additionally, there is nothing in from_term_names that can only be performed at fit time -- it operates on attributes that are available any time after fitting. We could just do it on the fly (benchmark also seem to imply that it will be cheap to do so) if/when we add functionality that needs it.

Martin Stancsics (stanmart) · 2026-06-15T09:23:53Z


        self.feature_names_ = X.get_names(type="column", missing_prefix="_col_")  # type: ignore
        self.term_names_ = X.get_names(type="term", missing_prefix="_col_")
+        self.term_layout_ = TermLayout.from_term_names(


This only depends on attributes that are always available after fit. Why do we want to store it instead of just computing it on the fly when we have a use case?

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) · 2026-06-15T14:55:50Z

The CI failure on the windows runner is due to a switch of OS for the windows runner (windows-latest). I'm working on a fix. In the meantime, if you want to run the CI, you can change "windows-latest" to "windows-2022" in the CI config files.

Martin Stancsics (stanmart) · 2026-06-16T11:13:30Z

Philipp Kopper (@PhilippKopperQC) as a concrete way forward: how would you feel about turning adding TermLayout without exporting it publicly and assigning it to an attribute for now? The class/concept/tests are valuable, and that would let us make this first step without committing to any public change.

Marc-Antoine Schmidt (MarcAntoineSchmidtQC)

If we want to move forward with this, I would like to minimize the memory footprint of this feature. It seems we are storing the terms twice.

It would also be good (but not absolutely necessary) to be able to turn it off. If we want to create a fast solver for the case where K > N, I think the performance hit of this may become a problem.

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) · 2026-06-16T13:59:21Z

+    name: str
+    start: int
+    stop: int
+    kind: str


What does kind mean? What value can it take? Please add a docstring for the attributes that are not obvious.

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) · 2026-06-16T14:05:26Z

+
+    terms: tuple[TermSlice, ...]
+    has_intercept: bool
+    _by_name: dict[str, TermSlice]


Are we storing the data for the Terms twice? (once in terms and once in _by_name?

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) · 2026-06-16T14:13:57Z

Also, if the goal is to support splines, let's hold off on merging this to main. I am very open to having a long-lasting branch that sub-branches merge into.

Terms implementation

f637273

Philipp Kopper (PhilippKopperQC) requested review from Marc-Antoine Schmidt (MarcAntoineSchmidtQC), Jan Tilly (jtilly) and Martin Stancsics (stanmart) as code owners June 14, 2026 16:48

Martin Stancsics (stanmart) reviewed Jun 15, 2026

View reviewed changes

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `TermLayout` -- term-to-coefficient slice registry#1007

Add `TermLayout` -- term-to-coefficient slice registry#1007
Philipp Kopper (PhilippKopperQC) wants to merge 1 commit into
Quantco:mainfrom
PhilippKopperQC:terms

Philipp Kopper (PhilippKopperQC) commented Jun 14, 2026 •

edited

Loading

Uh oh!

Philipp Kopper (PhilippKopperQC) commented Jun 14, 2026

Uh oh!

Philipp Kopper (PhilippKopperQC) commented Jun 14, 2026

Uh oh!

Martin Stancsics (stanmart) left a comment •

edited by lbittarello

Loading

Uh oh!

Martin Stancsics (stanmart) Jun 15, 2026

Uh oh!

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) commented Jun 15, 2026

Uh oh!

Martin Stancsics (stanmart) commented Jun 16, 2026

Uh oh!

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) left a comment

Uh oh!

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) Jun 16, 2026

Uh oh!

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) Jun 16, 2026

Uh oh!

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Philipp Kopper (PhilippKopperQC) commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Consumers

Performance

How to read these numbers

Source of the numbers

TermLayout.from_term_names build cost

Lookup by name (forward-looking budget for future consumers)

Reference cost it competes with

Overhead in context

Benchmark script

Compatibility

Follow-ups (not in this PR)

Uh oh!

Philipp Kopper (PhilippKopperQC) commented Jun 14, 2026

Uh oh!

Philipp Kopper (PhilippKopperQC) commented Jun 14, 2026

Uh oh!

Martin Stancsics (stanmart) left a comment • edited by lbittarello Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Martin Stancsics (stanmart) Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) commented Jun 15, 2026

Uh oh!

Martin Stancsics (stanmart) commented Jun 16, 2026

Uh oh!

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) left a comment

Choose a reason for hiding this comment

Uh oh!

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Philipp Kopper (PhilippKopperQC) commented Jun 14, 2026 •

edited

Loading

`TermLayout.from_term_names` build cost

Martin Stancsics (stanmart) left a comment •

edited by lbittarello

Loading