Add TermLayout -- term-to-coefficient slice registry#1007
Add TermLayout -- term-to-coefficient slice registry#1007Philipp Kopper (PhilippKopperQC) wants to merge 1 commit into
TermLayout -- term-to-coefficient slice registry#1007Conversation
|
The windows failure is unrelated to this PR. |
|
Martin Stancsics (@stanmart) as discussed many times, wdyt? |
There was a problem hiding this comment.
I'm all for adding more GAM-like functionality and better spline handling to glum. Defining penalties for blocks/terms instead of coefficients is indeed a good step in that direction. With that said, I'm less sure about this specific PR.
My main issue is that we expand the public API with a new attribute without any concrete consumer. I.e., we can't be sure that this is the precisely correct interface for what we want to do later on. Additionally, there is nothing in from_term_names that can only be performed at fit time -- it operates on attributes that are available any time after fitting. We could just do it on the fly (benchmark also seem to imply that it will be cheap to do so) if/when we add functionality that needs it.
|
|
||
| self.feature_names_ = X.get_names(type="column", missing_prefix="_col_") # type: ignore | ||
| self.term_names_ = X.get_names(type="term", missing_prefix="_col_") | ||
| self.term_layout_ = TermLayout.from_term_names( |
There was a problem hiding this comment.
This only depends on attributes that are always available after fit. Why do we want to store it instead of just computing it on the fly when we have a use case?
|
The CI failure on the windows runner is due to a switch of OS for the windows runner (windows-latest). I'm working on a fix. In the meantime, if you want to run the CI, you can change "windows-latest" to "windows-2022" in the CI config files. |
|
Philipp Kopper (@PhilippKopperQC) as a concrete way forward: how would you feel about turning adding |
Marc-Antoine Schmidt (MarcAntoineSchmidtQC)
left a comment
There was a problem hiding this comment.
If we want to move forward with this, I would like to minimize the memory footprint of this feature. It seems we are storing the terms twice.
It would also be good (but not absolutely necessary) to be able to turn it off. If we want to create a fast solver for the case where K > N, I think the performance hit of this may become a problem.
| name: str | ||
| start: int | ||
| stop: int | ||
| kind: str |
There was a problem hiding this comment.
What does kind mean? What value can it take? Please add a docstring for the attributes that are not obvious.
|
|
||
| terms: tuple[TermSlice, ...] | ||
| has_intercept: bool | ||
| _by_name: dict[str, TermSlice] |
There was a problem hiding this comment.
Are we storing the data for the Terms twice? (once in terms and once in _by_name?
|
Also, if the goal is to support splines, let's hold off on merging this to main. I am very open to having a long-lasting branch that sub-branches merge into. |
Why
Splines (mgcv-style
s(...)/te(...)smooths) need to map each smooth term to a contiguous block of coefficients so a single lambda can penalize a whole block. glum currently exposes only a flat per-coefficientterm_names_list, with no way to ask "which coefficients does term X own?" without re-scanning that list. This PR adds the registry; later PRs migrate consumers (penalty assembly, Wald-test name lookup, categorical penalty expansion) onto it.This is the first PR in the splines roadmap -- purely additive, no behavior change.
What
src/glum/_term_layout.pywith two frozen, slotted dataclasses:TermSlice(name, start, stop, kind)-- one term, half-open coef range over the full coefficient vector (intercept-aware).TermLayout-- ordered tuple ofTermSlice+has_intercept, with O(1) name lookup, iteration, and membership.GeneralizedLinearRegressorBase._set_up_for_fitpopulatesmodel.term_layout_next to the existingterm_names_assignment. Both fit branches (formula and array/DataFrame) flow through the same point.tests/glm/test_term_layout.py-- 20 tests, all AAA-structured and strictly behavioral (partition property, slice -> column round-trip after.fit(), determinism, lookup contracts, edge cases).Consumers
There are exactly two non-test references to
TermLayoutafter this PR: the import atsrc/glum/_glm.py:51and the construction atsrc/glum/_glm.py:2166. The execution graph during.fit()is:The layout is constructed once at the end of fit setup and then sits unread for the entire IRLS/CD inner loop. Zero hot-path cost.
Planned consumers, all outside the IRLS inner loop:
expand_categorical_penaltiesmigration (closes the_glm.py:2039TODO)layout[term_name]for each user-supplied per-term penalty_glm.py:1349model.wald_test(...)timelayout[term_name].start:.stopto slicecoef_and the covariancePenalty/PenaltyTermassembly (next PR)layout.terms, write eachS_mblock into the global penalty matrixs(...)/te(...))layout.terms, build basis + penalty per smooth termpredict_smooth(...)timelayout[term_name]to slicecoef_for one smoothTermLayoutis metadata that lives at the boundary between user-facing API (term names, formulas) and the numerical solver (column indices), not inside the solver. The hot path (build_hessian_delta, the Cython CD sweep, distribution gradient/hessian calls) keeps operating on assemblednumpy/ sparse arrays of lengthn_features, exactly as it does today.Performance
How to read these numbers
Two things are being measured, and they sit in different places:
TermLayout.from_term_names) is paid exactly once per.fit(), in fit setup, before the solver runs. It is the only cost this PR adds to a fit..fit()because no consumer in this PR reads the layout. It will be paid by future consumers listed in the section above, all of which run outside the IRLS inner loop. The lookup number is reported as a forward-looking budget for those consumers.So the "Build / fit ratio" rows below are the worst-case overhead this PR introduces -- a closed bound on the fit-time cost.
Source of the numbers
The script that produced these numbers is reproduced verbatim at the bottom of this section. It uses
time.perf_counter_ns, runs on Python 3.14 / M-series Mac, with GC disabled inside each timed loop, plus a 1 000-call warm-up before measurement. Each row isn=200independent repeats; each repeat times an inner loop ofinnercalls and reports the inner-loop-averaged seconds-per-call.n=80for.fit()(slower, fewer repeats). 95% CIs are normal-approximation on the mean; the empirical 2.5 / 97.5% columns are direct percentiles of the 200 per-repeat means.TermLayout.from_term_namesbuild costLinear scaling in
n_termsconfirmed: ~0.65 us per term across all scales. Categorical case is sub-linear because the inner-loop count isn_terms, notn_features.Lookup by name (forward-looking budget for future consumers)
~60 ns. Constant in
n_terms(dict lookup).Reference cost it competes with
.fit()1000 x 50, normalOverhead in context
The 3% on the small case is the floor and is dominated by
n_termsTermSlicedataclass allocations -- intrinsic given the public API. A perf-regression guard test (5 ms ceiling at 1 000 terms vs measured 0.64 ms) fails loudly on accidental quadratic regressions.Benchmark script
Compatibility
term_layout_after fit; no existing attribute removed or changed.TermLayoutis a stdlib dataclass, pickles cleanly. Older pickles that predate this PR will still load --term_layout_is just absent.Follow-ups (not in this PR)
expand_categorical_penaltiesto useTermLayout, closing the formula+categorical TODO at_glm.py:2039._glm.py:1349ontoTermLayout.PenaltyTerm/Penalty(per-term lambda vectors, block S matrices) on top of this registry -- the actual prerequisite for splines.