Fix null pointer deref in KVCache::ToPtr() - missing .cache = this by qiuku30 · Pull Request #940 · google/gemma.cpp

qiuku30 · 2026-06-23T14:17:03Z

Problem

Gemma 2B model crashes with a segfault during the prefill phase on a 6GB VM:

[ Reading prompt ] [ BEGIN PHASE: prefill ]
...............Segmentation fault (core dumped)

Root Cause

KVCache::ToPtr() constructs a KVCachePtr but omits the cache back-reference field, leaving it as the default nullptr. When attention.cc:183 calls cache->KOrVDefaultCols(), it dereferences a null pointer.

The bug was introduced during the tiled KV cache refactoring that added the cache field to KVCachePtr for accessing tiled_seq_len.

Debugging Process

Confirmed the crash is not purely memory-related (reproduced with seq_len=128, num_threads=1)
Confirmed the crash is not SIMD-specific (reproduced on both SSE2 and AVX3_ZEN4 code paths)
Compiled with -fsanitize=address and got the exact crash point:

==206353==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000
#0 gcpp::KVCache::KOrVDefaultCols() const /gemma/kv_cache.h:97:12
#1 gcpp::N_AVX3_ZEN4::ComputeQKV /gemma/attention.cc:183:44

Fix

Added the missing .cache = this field initialization in KVCache::ToPtr():

KVCachePtr ToPtr() {
    return KVCachePtr{
        .kv_cache = kv_cache,
        .k_cache = k_cache,
        .v_cache = v_cache,
        .cache = this,   // ← was missing
    };
}
Verification
The fix was tested on a 6GB VM running Gemma 2B Instruct (SFP-compressed):


./gemma --weights 2.0-2b-it-sfp.sbs --tokenizer tokenizer.spm \
  --prompt "Hello" --max_generated_tokens 30

[ Reading prompt ] .........
Hello! My name is Gemma. How can I help you today?

[ Timing info ] Prefill: 583 ms for 15 prompt tokens
[ Timing info ] Generate: 5653 ms for 19 tokens (3.36 tokens/sec)

PiperOrigin-RevId: 823388833

Remove ParallelizeOne/TwoRange, use ParallelForAcross/WithinCluster instead. PiperOrigin-RevId: 823388890

PiperOrigin-RevId: 823512377

Also print auto-tune info for verbosity 3. PiperOrigin-RevId: 823555008

Group M=4..7 into same config. Add configs for power of two sizes. Allow odd mc to enable a single range for odd M. io.cc: warning fix(cast). IsBlock -> !IsOneMC benchmark_helper: best for verbosity 3, all configs for 4 ops_test: remove unused includes PiperOrigin-RevId: 824475104

* Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`. * Updates `QBatch` to hold non-owning `MatPtr`s to the kv caches. * Enables the `MatPtrT` default constructor for simpler initializations. * Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor. PiperOrigin-RevId: 824584177

This could lead to stack overflow in B_storage. Also do not require specific type for query_norm_scale, update batch sizes for attention tensors, more verbose Mat shape/type checks. PiperOrigin-RevId: 824987689

PiperOrigin-RevId: 825014334

PiperOrigin-RevId: 825433929

PiperOrigin-RevId: 825582368

PiperOrigin-RevId: 825588241

PiperOrigin-RevId: 825627406

PiperOrigin-RevId: 825966142

PiperOrigin-RevId: 826011137

PiperOrigin-RevId: 826524997

PiperOrigin-RevId: 827415258

…assertion. Shared kU64PerLine constant PiperOrigin-RevId: 828072451

PiperOrigin-RevId: 828936578

PiperOrigin-RevId: 828971705

…behavior of the code. It only extract two functions that will later be called for adding continuous batching. PiperOrigin-RevId: 829104661

…oogle#762. PiperOrigin-RevId: 829308896

PiperOrigin-RevId: 831002073

Uses the key_norm and query_norm layers to disambiguate between the Gemma2-2B and Gemma3-1B models. Since Gemma3-1B is not multimodal, ViT is not an effective disambiguator. KQ normalization is a structural disambiguator between gemma2 and gemma3. PiperOrigin-RevId: 833213331

PiperOrigin-RevId: 834173319

PiperOrigin-RevId: 835115693

PiperOrigin-RevId: 835244205

(1) A function GenerateTWithContinuousBatching is added to use continuous batching when enabled. (2) The ContinuousQBatch is added as a subclass of QBatch to manage prefill, insert, used-kv-cache-collection. (3) Also expanded the unit test to more diverse cases. PiperOrigin-RevId: 836090261

PiperOrigin-RevId: 836106478

PiperOrigin-RevId: 836235539

PiperOrigin-RevId: 837001762

PiperOrigin-RevId: 921554497

Adds first-class support for text-only Gemma 3 checkpoints — TranslateGemma 4B and similar variants that share the Gemma 3 architecture but lack the SigLIP vision tower. Previously such checkpoints could not be loaded: the canonical Gemma 3 4B config carried a non-empty vit_config, so the model loader required vision tensors (enc_norm_bias, img_emb_*, etc.) that the checkpoint didn't contain. Highlights: * Three new Model enum values: GEMMA3_4B_LM, GEMMA3_12B_LM, GEMMA3_27B_LM (placed after CUSTOM to preserve enum values for existing serialized .sbs files). * Pre-existing ConfigGemma3_*_LM() helpers, which were defined but unreachable, are now wired through ConfigFromModel(), ModelPrefix(), and the canonical-config loop. They identify themselves as GEMMA3_*_LM with wrapping = GEMMA_IT and vit_config left empty, so WeightsPtrs::ForEachTensor skips the entire ViT block (it already gates on vit_config.layer_configs.empty()) and no vision tensors are required at load time. * DeduceModel() now returns the LM variant for 34/48/62-layer checkpoints when no ViT tensors are detected, matching the existing pattern used by 27 (PaliGemma) and 42 (PaliGemma2_10B vs Gemma2_9B). * FindModel() now picks the longest matching prefix, so "gemma3-4b-lm-sfp-it" resolves to GEMMA3_4B_LM rather than colliding with the "gemma3-4b-" prefix of GEMMA3_4B. * Python: enum values exposed in python/configs.cc, plus a new export_gemma3_lm_sbs() in convert_from_safetensors.py that drops vision_tower.*/multi_modal_projector.* tensors, uses vocab=262144 with no -64 trim, handles both `language_model.model.*` and `model.*` key prefixes, and writes q_norm/k_norm per layer. Tests: * tensor_info_test now exercises every GEMMA3_*_LM variant through its existing ForEachModel sweep, plus two new cases: - LmConfigsHaveNoVit: WeightsPtrs::ForEachTensor reports zero enc_norm_*/img_*/mm_embed_norm tensors for each LM model and wrapping is GEMMA_IT. - FindModelLongestMatch: ModelConfig("gemma3-4b-lm-sfp-it") yields GEMMA3_4B_LM and ModelConfig("gemma3-4b-sfp") still yields GEMMA3_4B. * ctest run: 128/128 tests pass on Apple Silicon arm64. Build infrastructure fixes required to validate the change (and pre-existing breakage on dev that the same CMakeLists touches): * Bump pinned Highway commit from c971dbe6 (2026-03-02) to 30770269 so HWY_REGISTERS and Lookup8 used in ops/fast_ops-inl.h resolve. The previous pin predates both symbols (added 2026-03-18 and 2026-03-23 respectively). * Compile Highway's hwy/stats.cc into the hwy target: Highway's CMake config does not include it though its Bazel BUILD does, leaving threading_test with undefined hwy::Stats::ToString. * Add gemma/kv_transcoding.{cc,h} and paligemma/paligemma_helper.{cc,h} to libgemma SOURCES (both files exist on dev but were not in the library, causing flash_attention_test and paligemma_test link failures). * Add PackedSpan(ptr, num) constructor in compression/types.h — dot_test.cc parenthesizes its initialization, which C++17 doesn't allow on pure aggregates. * Relax one dot_test L1 mean bound (5.8E-4 -> 6.5E-4, measured 5.88e-4 on Apple Silicon NEON_BF16) and skip CheckRel/CheckBwd/CheckUlps on aarch64 (consistent with the existing "aarch64 has higher error" comments further down the same file). * Move gemma_test, paligemma_test, and flash_attention_test into a new GEMMA_INTEGRATION_TEST_FILES list: they build (so `--target` works) but are not auto-discovered. gemma_test/paligemma_test require --weights at runtime, and flash_attention_test segfaults during AttentionActivations setup on pristine upstream/dev (verified by stashing all non-CMake changes and re-running) — pre-existing fallout from the "old" attention removal in commit d58a23d, not introduced here. * Set WORKING_DIRECTORY ${CMAKE_SOURCE_DIR} on gtest_discover_tests so image_test's relative testdata path resolves under ctest. * Pre-includes find_package(GTest REQUIRED) and target_compile_definitions(libgemma PRIVATE HWY_IS_TEST=1) (also in PR google#917) so this branch builds standalone if google#917 lands later.

… kCompensated/kKahan rel bounds in dot_test to track Highway's vectorized hash RNG shift.

PiperOrigin-RevId: 922127425

…t-link-error PiperOrigin-RevId: 922154211

PiperOrigin-RevId: 923303424

Use uint8_t SIMD loads followed by type-level BitCast to satisfy C++ alignment constraints for UBSan. Fix tile attention test PiperOrigin-RevId: 923382925

* Updates NUMA binding guard in allocator.cc to match the include guard. * Initializes vectors in flash_attention.cc. * Excludes SCALAR for HWY_RISCV in compression/types.h. PiperOrigin-RevId: 923412884

…ader MapAll (weights.cc:557) and MakeBatches (weights.cc:645) both assert that a tensor blob's on-disk size equals the buffer size derived from the tensor's declared shape (range.bytes == mat.PackedBytes()). ReadAllToBF16 was the only load mode missing this check. Without it, a crafted .sbs weight file with correct Rows/Cols metadata but an inflated blob bytes field passes all structural validation and reaches the keep_type branch, where tensor.range.bytes bytes are written into a mat.PackedBytes()-sized buffer — a heap OOB write with attacker- controlled content. The DecompressToBF16 branch has a sibling OOB read when the blob is undersized. Add the same HWY_ASSERT_M at the top of the per-tensor lambda, before either read path executes, mirroring the existing checks in MapAll and MakeBatches.

…ssing-blob-size-assert PiperOrigin-RevId: 925467040

…gemm PiperOrigin-RevId: 925952741

============= Description ========== This PR integrates OneDNN BRGeMM (Batch-Reduced General Matrix Multiply) micro-kernels as an alternative compute path for BF16 MatMul on Intel Xeon platforms with AMX or AVX-512 BF16 support. ## What When enabled via the `GEMMA_ONEDNN_BRGEMM` compile-time flag, BF16×BF16 MatMul operations are dispatched to JIT-compiled BRGeMM kernels instead of the Highway SIMD path. This targets Gemma model workloads (FFW projections, attention) on Intel Xeon Scalable (SPR/EMR) processors. At this point support has been added to both CMake and Bazel build systems. ### How to Enable ```bash # CMake cmake -DGEMMA_ONEDNN_BRGEMM=ON .. # Bazel bazel build --define gemma_onednn_brgemm=1 ... ``` ### Runtime Fallback When `GEMMA_ONEDNN_BRGEMM` is enabled at compile time, the BRGeMM path activates for BF16×BF16 operations whose dimensions meet AMX tile constraints (M, N, K ≥ 32 and K % 32 == 0). All other cases — non-BF16 types, smaller or non-aligned dimensions, mixed precision — fall through to the standard Highway SIMD MatMul path automatically. ## Changes | File | Description | |---|---| | `ops/brgemm.h` | Types, caches, thread-local buffers, `UseOneDnnBrgemm()`, autotuning candidates | | `ops/brgemm-inl.h` | `DoMatMul_BRGeMM()`: kernel JIT/caching, B-packing with hugepages, tiled parallel execution | | `ops/matmul-inl.h` | BRGeMM dispatch block in `MatMul()` guarded by `#if GEMMA_ONEDNN_BRGEMM` | | `ops/matmul.h` | `#include "ops/brgemm.h"`, `brgemm_autotune` field in `MMPerKey` | | `ops/bench_matmul.cc` | Check `brgemm_autotune.Best()` to avoid infinite loop when BRGeMM handles dispatch | | `CMakeLists.txt` | `GEMMA_ONEDNN_BRGEMM` option, FetchContent for OneDNN v3.11, conditional target linking | | `BUILD.bazel` | `config_setting` for `gemma_onednn_brgemm`, conditional OneDNN dep and defines for x86_64 | | `MODULE.bazel` | OneDNN v3.11 `http_archive` dependency | | `bazel/onednn.BUILD` | Bazel build rules for OneDNN | | `util/zones.h` | `kBRGeMM` caller enum for thread pool dispatch | | `util/zones.cc` | `CallerName` mapping for `kBRGeMM` | ## Testing - `matmul_test` passes with and without `GEMMA_ONEDNN_BRGEMM` (all original test shapes, types, and correctness checks preserved) - `bench_matmul` runs successfully with BRGeMM enabled - No changes to existing tests; zero impact when OneDNN is not enabled or on non-x86 platforms ============= Commits ============== -- 09ddbf4 by Bibek Bhattarai <bibek.bhattarai@intel.com>: Tested and benchmarked OneDNN BRGeMM integration against dev branch -- 1308355 by Bibek Bhattarai <bibek.bhattarai@intel.com>: fixing the copyright info -- 656444f by Bibek Bhattarai <bibek.bhattarai@intel.com>: Removing OneTBB dependency -- f8527a1 by Bibek Bhattarai <bibek.bhattarai@intel.com>: Fixed the compile time flag to designate BRGEMM path -- 0dde315 by Bibek Bhattarai <bibek.bhattarai@intel.com>: Adding the cmake based build support for oneDNN BGGeMM -- fd3b119 by Bibek Bhattarai <bibek.bhattarai@intel.com>: fixed dtypes and syntax divergence from codebase -- 6640021 by Bibek Bhattarai <bibek.bhattarai@intel.com>: changed lda and ldb to size_t. Added conversions inplace for brgemm and transform inits -- 9d6bbee by Bibek Bhattarai <bibek.bhattarai@intel.com>: Replaced / and % with Divide and Remainder utils from hwy::Divisor -- 45708ea by Bibek Bhattarai <bibek.bhattarai@intel.com>: Moved the BRGeMM Kernel inits to a separate HWY_NOINLINE helper function -- acf7592 by Bibek Bhattarai <bibek.bhattarai@intel.com>: Added HWY_WARN and fallback instead of exiting -- 7bdf4c6 by Bibek Bhattarai <bibek.bhattarai@intel.com>: using hwy::AlignedVector instead of std::vector for scratch and tc_storage ==================================== Resolves google#903. PiperOrigin-RevId: 925971432

PiperOrigin-RevId: 926653068

…ay that is testable on other platofrms PiperOrigin-RevId: 929840936

PiperOrigin-RevId: 929893389

…e version PiperOrigin-RevId: 930346184

…e into matmul-inl.h. PiperOrigin-RevId: 930635801

Also check for too-small SBS and fix arg order in gemma_py, prevent dangling refs and ensure consistent lock order in api_server. PiperOrigin-RevId: 931062118

PiperOrigin-RevId: 933791869

When KVCache::ToPtr() constructs a KVCachePtr, it assigns kv_cache, k_cache, and v_cache but omits the 'cache' back-reference. This causes a segfault at attention.cc:183 where cache->KOrVDefaultCols() is dereferenced through the null pointer. Root cause identified via AddressSanitizer (SEGV on address 0x0 at kv_cache.h:97). Fix adds the missing .cache = this field. Tested: Gemma 2B model runs successfully on 6GB VM after fix. Co-Authored-By: Claude <noreply@anthropic.com>

Annotated 6 core files (~2500 lines) with educational Chinese comments covering the complete inference pipeline: - run.cc: entry point, REPL loop, token callback - gemma_args.h: LoaderArgs, InferenceArgs, RuntimeConfig - gemma.cc: GenerateT, TransformerLayer, Prefill/Decode, SampleAndStream - kv_cache.h: KVCache/Ptr, default vs tiled format - query.h: PerQuery, AllQueries, QBatch abstraction - configs.cc: Gemma2_2B model configuration Also adds docs/LEARNING_PLAN.md with a 3-week structured study plan.

google-cla · 2026-06-23T14:20:01Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

jan-wassenberg · 2026-06-23T15:05:30Z

Hi, thanks for pointing this out. We're unable to merge to the main branch, please retarget to the dev branch and make sure the PR diff only includes the desired change.

niting and others added 30 commits October 24, 2025 00:18

Update README since backprop and Adam optimizer has been deleted.

085a349

PiperOrigin-RevId: 823388833

1.02x speedup: improve load balance and simplify parallelFor

a48e614

Remove ParallelizeOne/TwoRange, use ParallelForAcross/WithinCluster instead. PiperOrigin-RevId: 823388890

Add config flag for global timescale & rely on config to deduce wrapping

1bdde1a

PiperOrigin-RevId: 823512377

Batch bench: 4 runs to give autotuning more time

8198e71

Also print auto-tune info for verbosity 3. PiperOrigin-RevId: 823555008

Fix excessive KC/MC from prior change

3cc0139

This could lead to stack overflow in B_storage. Also do not require specific type for query_norm_scale, update batch sizes for attention tensors, more verbose Mat shape/type checks. PiperOrigin-RevId: 824987689

Also update attention.h to type-erased query_norm_scale

4bd465f

PiperOrigin-RevId: 825014334

BF16 mixed-mode flash attention

116cd6e

PiperOrigin-RevId: 825433929

Add Gemma 3 270M to gemma_test

036f91f

PiperOrigin-RevId: 825582368

Update README with Gemma 3 support and contributor acknowledgments

ecab0ce

PiperOrigin-RevId: 825588241

Fix PaliGemma matmul warning

0069990

PiperOrigin-RevId: 825627406

Add Decompress2AndCompressInplace helper

ee7d79c

PiperOrigin-RevId: 825966142

Added access to flash attention internals to TileFlashAttention4

8a100c1

PiperOrigin-RevId: 826011137

Pre-compress query activations to BF16 before FlashAttention.

ab87807

PiperOrigin-RevId: 826524997

Allow prefill only run by allowing max_prompt_size == seq_len

3a63a12

PiperOrigin-RevId: 827415258

Change (old) attention behavior to disallow wraparound, enforced via …

a344a70

…assertion. Shared kU64PerLine constant PiperOrigin-RevId: 828072451

Minor: ParallelismStrategy->Parallelism

091b456

PiperOrigin-RevId: 828936578

Introduce attention implementation configurability.

35e9f9f

PiperOrigin-RevId: 828971705

Refactor for continous batching. This cl does not change the current …

f813133

…behavior of the code. It only extract two functions that will later be called for adding continuous batching. PiperOrigin-RevId: 829104661

Avoid hard-coding kPatchSize. Thanks @Somet2mes for reporting. Fixes g…

3e18db1

…oogle#762. PiperOrigin-RevId: 829308896

Fix NibbleCodec for AVX3_{ZEN4,DL,SPR}

7c1656f

PiperOrigin-RevId: 831002073

Add some comments.

49d420a

PiperOrigin-RevId: 834173319

Internal change

5a50087

PiperOrigin-RevId: 835115693

Added access to softmax attention internals to regular attention

88a03b7

PiperOrigin-RevId: 835244205

Fix warning (signed vs unsigned)

37a25c9

PiperOrigin-RevId: 836106478

Clarify indices

8696f6d

PiperOrigin-RevId: 836235539

Internal changes

c153d52

PiperOrigin-RevId: 837001762

bbhattar and others added 28 commits May 20, 2026 13:19

Merge branch 'dev' into feature/onednn-brgemm

b2df643

Minor fix to include what is used in flash_attention.h.

5c05eca

PiperOrigin-RevId: 921554497

the guard around CheckRel/CheckBwd/CheckUlps was unneecssary

ab0271c

Bump Highway pin to 1a7d0ac1, drop stats.cc workaround, and raise the…

a8497c0

… kCompensated/kKahan rel bounds in dot_test to track Highway's vectorized hash RNG shift.

Re-pin Highway to the merged master commit 783beecea2 of PR #3061.

24ba018

Correct copyright year

43843c6

PiperOrigin-RevId: 922127425

Merge pull request google#917 from plawanrath:fix/flash-attention-tes…

53bcd7a

…t-link-error PiperOrigin-RevId: 922154211

Merge upstream/dev: resolve CMakeLists.txt comment conflict

e8605a8

Merge branch 'dev' into feature/onednn-brgemm

22875ca

Merge pull request google#918 from plawanrath:feat/gemma3-lm-only

e58e56c

PiperOrigin-RevId: 923303424

Remove UB when used with not aligned data

ec17ae5

Use uint8_t SIMD loads followed by type-level BitCast to satisfy C++ alignment constraints for UBSan. Fix tile attention test PiperOrigin-RevId: 923382925

Miscellaneous cleanup.

f3e57ec

* Updates NUMA binding guard in allocator.cc to match the include guard. * Initializes vectors in flash_attention.cc. * Excludes SCALAR for HWY_RISCV in compression/types.h. PiperOrigin-RevId: 923412884

Merge branch 'dev' into feature/onednn-brgemm

c3787f8

Merge branch 'dev' into fix/weights-readallbf16-missing-blob-size-assert

3455221

Merge pull request google#924 from 0xASTRA:fix/weights-readallbf16-mi…

4013e52

…ssing-blob-size-assert PiperOrigin-RevId: 925467040

Merge pull request google#903 from Intel-tensorflow:feature/onednn-br…

f7b2d1d

…gemm PiperOrigin-RevId: 925952741

Re-land BRGEMM

f8b430a

PiperOrigin-RevId: 926653068

Helper functions to use arm matrix multiplication instructions in a w…

f5e78cc

…ay that is testable on other platofrms PiperOrigin-RevId: 929840936

Cleanup of unit tests

0b0418b

PiperOrigin-RevId: 929893389

Fix Windows build: shell was passing 3 instead of 3.5 to CMake, updat…

970403a

…e version PiperOrigin-RevId: 930346184

[MatMul] Add SFC(Space Filling Curves), specifically the hilbert curv…

6e23806

…e into matmul-inl.h. PiperOrigin-RevId: 930635801

Fix mul overflow and strncpy bounds check

b9492de

Also check for too-small SBS and fix arg order in gemma_py, prevent dangling refs and ensure consistent lock order in api_server. PiperOrigin-RevId: 931062118

Clear unconsumed args after help message to avoid aborting message

5dd1769

PiperOrigin-RevId: 933791869

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix null pointer deref in KVCache::ToPtr() - missing .cache = this#940

Fix null pointer deref in KVCache::ToPtr() - missing .cache = this#940
qiuku30 wants to merge 192 commits into
google:mainfrom
qiuku30:fix/kvcache-topt-missing-cache-ptr

qiuku30 commented Jun 23, 2026

Uh oh!

google-cla Bot commented Jun 23, 2026

Uh oh!

jan-wassenberg commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Conversation

qiuku30 commented Jun 23, 2026

Problem

Root Cause

Debugging Process

Fix

Uh oh!

google-cla Bot commented Jun 23, 2026

Uh oh!

jan-wassenberg commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants