diffblue · tautschnig · Jun 15, 2026 · Jun 15, 2026 · Jun 15, 2026 · Jun 15, 2026
@@ -59,3 +59,17 @@ jobs:
           rustup toolchain install stable --profile minimal --no-self-update -c clippy -c rustfmt
       - name: Run `cargo fmt` on top of Rust API project
         run: cd src/libcprover-rust; cargo fmt --all -- --check
+
+  # This job should take under a minute (est)
+  check-generated-intrinsic-models:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+      - name: Fetch dependencies
+        env:
+          DEBIAN_FRONTEND: noninteractive
+        run: |
+          sudo apt-get update
+          sudo apt-get install --no-install-recommends -yq clang-format-15
+      - name: Check x86 SIMD intrinsic models are in sync with their generator
+        run: ./scripts/check_intrinsic_models_sync.sh
@@ -0,0 +1,167 @@
+# Generating ARM/AArch64 NEON intrinsic models
+
+This document describes how CBMC models the ARM/AArch64 NEON SIMD builtins, how
+the models are generated, and the design decisions behind the choice of
+semantic source. It is the companion to `scripts/generate_neon_models.py`,
+which emits `src/ansi-c/library/arm_neon.c`.
+
+## Background: how NEON reaches CBMC
+
+Clang's `<arm_neon.h>` implements the public NEON intrinsics (`vabdq_s8`, ...)
+on top of a smaller set of *polymorphic* compiler builtins
+(`__builtin_neon_vabdq_v`, ...). At each call site the header casts every
+operand to a byte-representative lane type (`int8x16_t`, i.e. `__gcc_v16qi`)
+and passes a `NeonTypeFlags` integer "type code" that selects the actual lane
+interpretation. So one builtin such as `__builtin_neon_vabdq_v` backs
+`vabdq_s8`, `vabdq_s16`, `vabdq_u8`, ... and a model must switch on the type
+code to reinterpret the representative bytes.
+
+For CBMC to verify such code it needs three things, each handled by a separate
+piece of the front-end:
+
+1. **Declarations** for the `__builtin_neon_*` builtins
+   (`src/ansi-c/compiler_headers/gcc_builtin_headers_aarch64.h`, generated by
+   `clang_builtins.py` from `clang-tblgen -gen-arm-neon-sema`).
+2. **The `neon_vector_type` attribute** so the `<arm_neon.h>` typedefs are real
+   vectors (handled in the scanner/parser and `ansi_c_convert_type`).
+3. **Library models** giving the builtins a body — the subject of this
+   document.
+
+Intrinsics that Clang *open-codes* (the `*OpInst` records in `arm_neon.td`,
+e.g. `vaddq_s8` → `a + b`) lower to native C operators, which CBMC already
+handles, so they need **no** model. Only the opaque builtins (the `SInst` /
+`IInst` / ... records) need one.
+
+## Where do the model bodies come from?
+
+`arm_neon.td` carries **no semantics** for the opaque builtins: their
+`Operation` field is `OP_NONE`. It tells us *which* builtins exist, the element
+types each supports, and the type codes — i.e. the model's signature and
+`switch` skeleton — but not what each one computes. The per-lane computation
+must come from elsewhere. Two ARM-published machine-readable sources were
+considered.
+
+### ARM intrinsics JSON (Intrinsics Guide) vs ARM ASL (Architecture Spec)
+
+**ARM-JSON** — the database behind the online Neon/ACLE Intrinsics Guide. Keyed
+by *typed intrinsic* (`vabdq_s8`), with an `Operation` field giving high-level
+per-element pseudocode.
+
+- *Advantages.* Keying matches our pipeline (`arm_neon.td` already gives
+  intrinsic → builtin + type code), so an entry maps to exactly one `switch`
+  case. The pseudocode is close to the per-lane C we emit. It is plain JSON, so
+  trivial to consume, and it describes the net per-element effect (handy for
+  intrinsics that lower to instruction *sequences*).
+- *Disadvantages.* The pseudocode is written for humans: notation varies and
+  detail (FP rounding modes, NaN propagation, saturation-flag side effects) is
+  often elided. Coverage/quality is uneven, and it is a derived presentation,
+  not the spec ARM validates against, so corners can be wrong.
+
+**ARM-ASL** — the Architecture Reference Manual's machine-readable spec, keyed
+by *instruction* (`SABD`), with rigorous executable decode+execute pseudocode.
+
+- *Advantages.* Authoritative and exact (saturation, rounding, flags, edge
+  cases). Formal/executable, so mechanically translatable in principle (ASLi,
+  Sail, isla). Parameterised by element size, mapping naturally to "compute for
+  this lane width".
+- *Disadvantages.* Wrong keying for us: it is per *instruction*, so it needs an
+  intrinsic → instruction mnemonic map that is not in `arm_neon.td`. It is
+  heavy to translate — it references a large shared-function library and
+  architectural state (`FPCR`, `FPSR.QC`, `PSTATE`, the register file) — and is
+  often *more* precise than CBMC can use, so faithfully consuming it would bloat
+  models the solver then chokes on. It is also large and under specific Arm
+  license terms.
+
+**Summary.** JSON wins on integration fit and effort; ASL wins on rigor. Since
+CBMC models should be simple and self-contained, JSON-style per-element
+semantics are the better fit for the bulk, with ASL reserved as a targeted
+correctness backstop for the cases where exactness matters and is tractable.
+For the hardest tier (FP estimate/reciprocal, crypto, table lookups) neither
+source yields a clean, solver-friendly C model automatically; those need
+hand-written models or constrained nondeterminism regardless of source.
+
+### What is actually available, and the resulting design
+
+Empirically (June 2026):
+
+- The Intrinsics Guide **`Operation` pseudocode JSON is access-gated** (the
+  `developer.arm.com/.../intrinsics/data/intrinsics.json` endpoint returns 403)
+  and is under Arm license terms, so it cannot be fetched here nor vendored
+  into CBMC.
+- ARM's **ACLE repository is openly licensed and fetchable**:
+  `ARM-software/acle`'s `neon_intrinsics/advsimd.md` is a 2.1 MB structured
+  reference listing **4689 intrinsics**. It does **not** contain per-element
+  pseudocode, but it does contain, for each intrinsic, the **AArch64
+  instruction mnemonic** it maps to (`vabdq_s8` → `SABD`, `vqaddq_s8` →
+  `SQADD`, `vhaddq_s8` → `SHADD`, ...) — **356 distinct mnemonics** in total.
+
+This reshapes the plan favourably. The instruction mnemonic is the *true*
+semantic identity of an intrinsic (it is also the ASL key), and it is a far
+smaller, well-understood set than the ~2500 intrinsics or their pseudocode.
+So instead of translating per-intrinsic pseudocode, we:
+
+1. take **structure** from `arm_neon.td` (builtins, type codes — as before);
+2. take the authoritative **intrinsic → instruction mnemonic** mapping from
+   ACLE `advsimd.md`; and
+3. supply **semantics** via a compact, auditable *mnemonic → per-lane C*
+   table in the generator (`SABD`/`UABD` → absolute difference, `SMAX`/`UMAX`
+   → maximum, `SQADD`/`UQADD` → saturating add, ...).
+
+This is authoritative about *which* operation each builtin is (no guessing from
+names), keeps the hand-written part tiny (one entry per instruction family, not
+per intrinsic), and lines up with the ASL key should we later want ASL-grade
+rigor for specific instructions. The gated `Operation` JSON would only be
+needed to avoid writing the mnemonic→C table at all; given how small that table
+is, it is not worth the licensing and translation cost.
+
+Both inputs are **external and not vendored** (mirroring how the x86 generator
+reads Intel's XML and how the declaration generator reads `arm_neon.td` from an
+LLVM checkout): `arm_neon.td` comes from an LLVM checkout and `advsimd.md` from
+the openly-licensed ACLE repo. Neither is committed to CBMC.
+
+## Running the generator
+
+```sh
+# structure: clang's arm_neon.td (from an llvm-project checkout)
+TD=llvm-project/clang/include/clang/Basic/arm_neon.td
+# semantics key: ARM ACLE advsimd.md (openly licensed; do not vendor)
+curl -sO https://raw.githubusercontent.com/ARM-software/acle/main/neon_intrinsics/advsimd.md
+
+python3 scripts/generate_neon_models.py "$TD" --acle advsimd.md \
+    -o src/ansi-c/library/arm_neon.c
+```
+
+Without `--acle` the generator falls back to a small intrinsic-name-keyed
+operation table (no ARM data required), which is enough to regenerate the ops
+it already knows. With `--acle` it keys semantics on the ARM instruction
+mnemonic, annotates each model with that mnemonic for provenance, and prints a
+coverage audit: how many opaque builtins map to mnemonics the table covers, and
+a histogram of the mnemonics it does not yet cover (the modeling roadmap).
+
+The output is run through `clang-format-15`, so regeneration is idempotent.
+
+## Current coverage and roadmap
+
+The generator models the mechanically-translatable integer families:
+absolute difference (`SABD`/`UABD`), min/max (`SMAX`/`UMAX`/`SMIN`/`UMIN`),
+saturating add/subtract (`SQADD`/`UQADD`/`SQSUB`/`UQSUB`), halving and
+rounding-halving add/subtract (`SHADD`/`UHADD`/`SHSUB`/`UHSUB`/`SRHADD`/`URHADD`),
+the pairwise add/min/max reductions (`ADDP`/`SMAXP`/`UMAXP`/`SMINP`/`UMINP`),
+test-bits (`CMTST`) and bitwise select (`BSL`).
+Saturating and halving arithmetic is widened to avoid signed-overflow undefined
+behaviour; pairwise add is computed unsigned so its modular wrap is well
+defined; bitwise select operates on the raw bits and so is independent of the
+lane type code.
+
+The relational compares (`vceq`/`vcge`/`vcgt`/`vcle`/`vclt`) and the plain
+arithmetic/logical ops (`vadd`/`vsub`/`vmul`, `vand`/`vorr`/`veor`/...) are
+open-coded by `<arm_neon.h>` into native C operators, which CBMC handles
+directly, so they need no model.
+
+The `--acle` audit classifies the remaining opaque builtins by mnemonic. The
+next tractable arithmetic tiers are `EXT` (vector extract by immediate) and the
+saturating-shift group (`SQSHL`/`UQSHL`/...). Loads/stores (`LD1`/`ST1`/...),
+permutes (`ZIP`/`UZP`/`TRN`/`TBL`), `DUP`, `INS` and the `NOP` (reinterpret)
+group are structural rather than arithmetic and are handled separately or
+natively. Floating-point (`FABD`/`FMAX`/...) and crypto/estimate instructions
+need dedicated modeling and are out of scope for the mechanical generator.
diff --git a/regression/ansi-c/gcc_neon_vector_type/main.c b/regression/ansi-c/gcc_neon_vector_type/main.c
@@ -0,0 +1,18 @@
+// The neon_vector_type attribute (used by Clang's <arm_neon.h>) gives the
+// vector size as a lane count rather than in bytes, unlike vector_size.
+typedef __attribute__((neon_vector_type(16))) signed char int8x16_t;
+typedef __attribute__((neon_vector_type(8))) short int16x8_t;
+typedef __attribute__((neon_vector_type(4))) int int32x4_t;
+typedef __attribute__((neon_vector_type(2))) double float64x2_t;
+
+int main()
+{
+  int8x16_t a = {0};
+  a[3] = 7;
+  __CPROVER_assert(a[3] == 7, "lane indexing works");
+  __CPROVER_assert(sizeof(int8x16_t) == 16, "16 lanes of signed char");
+  __CPROVER_assert(sizeof(int16x8_t) == 16, "8 lanes of short");
+  __CPROVER_assert(sizeof(int32x4_t) == 16, "4 lanes of int");
+  __CPROVER_assert(sizeof(float64x2_t) == 16, "2 lanes of double");
+  return 0;
+}
diff --git a/regression/ansi-c/gcc_neon_vector_type/test.desc b/regression/ansi-c/gcc_neon_vector_type/test.desc
@@ -0,0 +1,9 @@
+CORE gcc-only
+main.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
+^CONVERSION ERROR$
diff --git a/...cbmc-library/__builtin_ia32_lfence/main.c → ...sion/cbmc-library/__builtin_ia32/lfence.c b/...cbmc-library/__builtin_ia32_lfence/main.c → ...sion/cbmc-library/__builtin_ia32/lfence.c
diff --git a/...c-library/__builtin_ia32_lfence/test.desc → ...n/cbmc-library/__builtin_ia32/lfence.desc b/...c-library/__builtin_ia32_lfence/test.desc → ...n/cbmc-library/__builtin_ia32/lfence.desc
@@ -1,5 +1,5 @@
 KNOWNBUG
-main.c
+lfence.c
 --pointer-check --bounds-check
 ^EXIT=0$
 ^SIGNAL=0$

diff --git a/...cbmc-library/__builtin_ia32_mfence/main.c → ...sion/cbmc-library/__builtin_ia32/mfence.c b/...cbmc-library/__builtin_ia32_mfence/main.c → ...sion/cbmc-library/__builtin_ia32/mfence.c
diff --git a/...c-library/__builtin_ia32_mfence/test.desc → ...n/cbmc-library/__builtin_ia32/mfence.desc b/...c-library/__builtin_ia32_mfence/test.desc → ...n/cbmc-library/__builtin_ia32/mfence.desc
@@ -1,5 +1,5 @@
 KNOWNBUG
-main.c
+mfence.c
 --pointer-check --bounds-check
 ^EXIT=0$
 ^SIGNAL=0$

diff --git a/regression/cbmc-library/__builtin_ia32/pabsb128.c b/regression/cbmc-library/__builtin_ia32/pabsb128.c
@@ -0,0 +1,15 @@
+#include <limits.h>
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+__gcc_v16qi __builtin_ia32_pabsb128(__gcc_v16qi);
+
+int main()
+{
+  // Lane 0 is the interesting hardware case: pabsb leaves SCHAR_MIN unchanged
+  // (its absolute value is not representable as a signed byte).
+  __gcc_v16qi a = (__gcc_v16qi){
+    SCHAR_MIN, -2, 3, -4, 5, -6, 7, -8, 9, -10, 11, -12, 13, -14, 15, -16};
+  __gcc_v16qi r = __builtin_ia32_pabsb128(a);
+  __CPROVER_assert(r[0] == SCHAR_MIN && r[1] == 2 && r[15] == 16, "abs epi8");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pabsb128.desc b/regression/cbmc-library/__builtin_ia32/pabsb128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pabsb128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pabsd128.c b/regression/cbmc-library/__builtin_ia32/pabsd128.c
@@ -0,0 +1,15 @@
+#include <limits.h>
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+__gcc_v4si __builtin_ia32_pabsd128(__gcc_v4si);
+
+int main()
+{
+  // Lane 0 is the interesting hardware case: pabsd leaves INT_MIN unchanged
+  // (its absolute value is not representable), and it is also the input that
+  // exposed the -INT_MIN signed-overflow UB in the previous model.
+  __gcc_v4si a = (__gcc_v4si){INT_MIN, -2, 3, -4};
+  __gcc_v4si r = __builtin_ia32_pabsd128(a);
+  __CPROVER_assert(r[0] == INT_MIN && r[1] == 2 && r[3] == 4, "abs epi32");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pabsd128.desc b/regression/cbmc-library/__builtin_ia32/pabsd128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pabsd128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pabsd256.c b/regression/cbmc-library/__builtin_ia32/pabsd256.c
@@ -0,0 +1,14 @@
+#include <limits.h>
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+__gcc_v8si __builtin_ia32_pabsd256(__gcc_v8si);
+
+int main()
+{
+  // Lane 0: pabsd leaves INT_MIN unchanged (no UB in the model).
+  __gcc_v8si a = (__gcc_v8si){INT_MIN, -2, 3, -4, 5, -6, 7, -8};
+  __gcc_v8si r = __builtin_ia32_pabsd256(a);
+  __CPROVER_assert(
+    r[0] == INT_MIN && r[1] == 2 && r[7] == 8, "abs epi32 (256)");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pabsd256.desc b/regression/cbmc-library/__builtin_ia32/pabsd256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pabsd256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pabsw128.c b/regression/cbmc-library/__builtin_ia32/pabsw128.c
@@ -0,0 +1,14 @@
+#include <limits.h>
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+__gcc_v8hi __builtin_ia32_pabsw128(__gcc_v8hi);
+
+int main()
+{
+  // Lane 0 is the interesting hardware case: pabsw leaves SHRT_MIN unchanged
+  // (its absolute value is not representable as a signed 16-bit value).
+  __gcc_v8hi a = (__gcc_v8hi){SHRT_MIN, -2, 3, -4, 5, -6, 7, -8};
+  __gcc_v8hi r = __builtin_ia32_pabsw128(a);
+  __CPROVER_assert(r[0] == SHRT_MIN && r[1] == 2 && r[7] == 8, "abs epi16");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pabsw128.desc b/regression/cbmc-library/__builtin_ia32/pabsw128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pabsw128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddb.c b/regression/cbmc-library/__builtin_ia32/paddb.c
@@ -0,0 +1,18 @@
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
+__gcc_v8qi __builtin_ia32_paddb(__gcc_v8qi, __gcc_v8qi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native +) for all inputs.
+  __gcc_v8qi a, b;
+  __gcc_v8qi r = __builtin_ia32_paddb(a, b);
+  __gcc_v8qi_u ref = (__gcc_v8qi_u)a + (__gcc_v8qi_u)b;
+  __CPROVER_assert(
+    r[0] == (char)ref[0] && r[1] == (char)ref[1] && r[2] == (char)ref[2] &&
+      r[3] == (char)ref[3] && r[4] == (char)ref[4] && r[5] == (char)ref[5] &&
+      r[6] == (char)ref[6] && r[7] == (char)ref[7],
+    "__builtin_ia32_paddb == native +");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddb.desc b/regression/cbmc-library/__builtin_ia32/paddb.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddb.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddb128.c b/regression/cbmc-library/__builtin_ia32/paddb128.c
@@ -0,0 +1,22 @@
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+__gcc_v16qi __builtin_ia32_paddb128(__gcc_v16qi, __gcc_v16qi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native +) for all inputs.
+  __gcc_v16qi a, b;
+  __gcc_v16qi r = __builtin_ia32_paddb128(a, b);
+  __gcc_v16qi_u ref = (__gcc_v16qi_u)a + (__gcc_v16qi_u)b;
+  __CPROVER_assert(
+    r[0] == (char)ref[0] && r[1] == (char)ref[1] && r[2] == (char)ref[2] &&
+      r[3] == (char)ref[3] && r[4] == (char)ref[4] && r[5] == (char)ref[5] &&
+      r[6] == (char)ref[6] && r[7] == (char)ref[7] && r[8] == (char)ref[8] &&
+      r[9] == (char)ref[9] && r[10] == (char)ref[10] &&
+      r[11] == (char)ref[11] && r[12] == (char)ref[12] &&
+      r[13] == (char)ref[13] && r[14] == (char)ref[14] &&
+      r[15] == (char)ref[15],
+    "__builtin_ia32_paddb128 == native +");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddb128.desc b/regression/cbmc-library/__builtin_ia32/paddb128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddb128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring