Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .github/workflows/syntax-checks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,3 +59,17 @@ jobs:
rustup toolchain install stable --profile minimal --no-self-update -c clippy -c rustfmt
- name: Run `cargo fmt` on top of Rust API project
run: cd src/libcprover-rust; cargo fmt --all -- --check

# This job should take under a minute (est)
check-generated-intrinsic-models:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- name: Fetch dependencies
env:
DEBIAN_FRONTEND: noninteractive
run: |
sudo apt-get update
sudo apt-get install --no-install-recommends -yq clang-format-15
- name: Check x86 SIMD intrinsic models are in sync with their generator
run: ./scripts/check_intrinsic_models_sync.sh
167 changes: 167 additions & 0 deletions doc/neon-intrinsic-models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Generating ARM/AArch64 NEON intrinsic models

This document describes how CBMC models the ARM/AArch64 NEON SIMD builtins, how
the models are generated, and the design decisions behind the choice of
semantic source. It is the companion to `scripts/generate_neon_models.py`,
which emits `src/ansi-c/library/arm_neon.c`.

## Background: how NEON reaches CBMC

Clang's `<arm_neon.h>` implements the public NEON intrinsics (`vabdq_s8`, ...)
on top of a smaller set of *polymorphic* compiler builtins
(`__builtin_neon_vabdq_v`, ...). At each call site the header casts every
operand to a byte-representative lane type (`int8x16_t`, i.e. `__gcc_v16qi`)
and passes a `NeonTypeFlags` integer "type code" that selects the actual lane
interpretation. So one builtin such as `__builtin_neon_vabdq_v` backs
`vabdq_s8`, `vabdq_s16`, `vabdq_u8`, ... and a model must switch on the type
code to reinterpret the representative bytes.

For CBMC to verify such code it needs three things, each handled by a separate
piece of the front-end:

1. **Declarations** for the `__builtin_neon_*` builtins
(`src/ansi-c/compiler_headers/gcc_builtin_headers_aarch64.h`, generated by
`clang_builtins.py` from `clang-tblgen -gen-arm-neon-sema`).
2. **The `neon_vector_type` attribute** so the `<arm_neon.h>` typedefs are real
vectors (handled in the scanner/parser and `ansi_c_convert_type`).
3. **Library models** giving the builtins a body — the subject of this
document.

Intrinsics that Clang *open-codes* (the `*OpInst` records in `arm_neon.td`,
e.g. `vaddq_s8` → `a + b`) lower to native C operators, which CBMC already
handles, so they need **no** model. Only the opaque builtins (the `SInst` /
`IInst` / ... records) need one.

## Where do the model bodies come from?

`arm_neon.td` carries **no semantics** for the opaque builtins: their
`Operation` field is `OP_NONE`. It tells us *which* builtins exist, the element
types each supports, and the type codes — i.e. the model's signature and
`switch` skeleton — but not what each one computes. The per-lane computation
must come from elsewhere. Two ARM-published machine-readable sources were
considered.

### ARM intrinsics JSON (Intrinsics Guide) vs ARM ASL (Architecture Spec)

**ARM-JSON** — the database behind the online Neon/ACLE Intrinsics Guide. Keyed
by *typed intrinsic* (`vabdq_s8`), with an `Operation` field giving high-level
per-element pseudocode.

- *Advantages.* Keying matches our pipeline (`arm_neon.td` already gives
intrinsic → builtin + type code), so an entry maps to exactly one `switch`
case. The pseudocode is close to the per-lane C we emit. It is plain JSON, so
trivial to consume, and it describes the net per-element effect (handy for
intrinsics that lower to instruction *sequences*).
- *Disadvantages.* The pseudocode is written for humans: notation varies and
detail (FP rounding modes, NaN propagation, saturation-flag side effects) is
often elided. Coverage/quality is uneven, and it is a derived presentation,
not the spec ARM validates against, so corners can be wrong.

**ARM-ASL** — the Architecture Reference Manual's machine-readable spec, keyed
by *instruction* (`SABD`), with rigorous executable decode+execute pseudocode.

- *Advantages.* Authoritative and exact (saturation, rounding, flags, edge
cases). Formal/executable, so mechanically translatable in principle (ASLi,
Sail, isla). Parameterised by element size, mapping naturally to "compute for
this lane width".
- *Disadvantages.* Wrong keying for us: it is per *instruction*, so it needs an
intrinsic → instruction mnemonic map that is not in `arm_neon.td`. It is
heavy to translate — it references a large shared-function library and
architectural state (`FPCR`, `FPSR.QC`, `PSTATE`, the register file) — and is
often *more* precise than CBMC can use, so faithfully consuming it would bloat
models the solver then chokes on. It is also large and under specific Arm
license terms.

**Summary.** JSON wins on integration fit and effort; ASL wins on rigor. Since
CBMC models should be simple and self-contained, JSON-style per-element
semantics are the better fit for the bulk, with ASL reserved as a targeted
correctness backstop for the cases where exactness matters and is tractable.
For the hardest tier (FP estimate/reciprocal, crypto, table lookups) neither
source yields a clean, solver-friendly C model automatically; those need
hand-written models or constrained nondeterminism regardless of source.

### What is actually available, and the resulting design

Empirically (June 2026):

- The Intrinsics Guide **`Operation` pseudocode JSON is access-gated** (the
`developer.arm.com/.../intrinsics/data/intrinsics.json` endpoint returns 403)
and is under Arm license terms, so it cannot be fetched here nor vendored
into CBMC.
- ARM's **ACLE repository is openly licensed and fetchable**:
`ARM-software/acle`'s `neon_intrinsics/advsimd.md` is a 2.1 MB structured
reference listing **4689 intrinsics**. It does **not** contain per-element
pseudocode, but it does contain, for each intrinsic, the **AArch64
instruction mnemonic** it maps to (`vabdq_s8` → `SABD`, `vqaddq_s8` →
`SQADD`, `vhaddq_s8` → `SHADD`, ...) — **356 distinct mnemonics** in total.

This reshapes the plan favourably. The instruction mnemonic is the *true*
semantic identity of an intrinsic (it is also the ASL key), and it is a far
smaller, well-understood set than the ~2500 intrinsics or their pseudocode.
So instead of translating per-intrinsic pseudocode, we:

1. take **structure** from `arm_neon.td` (builtins, type codes — as before);
2. take the authoritative **intrinsic → instruction mnemonic** mapping from
ACLE `advsimd.md`; and
3. supply **semantics** via a compact, auditable *mnemonic → per-lane C*
table in the generator (`SABD`/`UABD` → absolute difference, `SMAX`/`UMAX`
→ maximum, `SQADD`/`UQADD` → saturating add, ...).

This is authoritative about *which* operation each builtin is (no guessing from
names), keeps the hand-written part tiny (one entry per instruction family, not
per intrinsic), and lines up with the ASL key should we later want ASL-grade
rigor for specific instructions. The gated `Operation` JSON would only be
needed to avoid writing the mnemonic→C table at all; given how small that table
is, it is not worth the licensing and translation cost.

Both inputs are **external and not vendored** (mirroring how the x86 generator
reads Intel's XML and how the declaration generator reads `arm_neon.td` from an
LLVM checkout): `arm_neon.td` comes from an LLVM checkout and `advsimd.md` from
the openly-licensed ACLE repo. Neither is committed to CBMC.

## Running the generator

```sh
# structure: clang's arm_neon.td (from an llvm-project checkout)
TD=llvm-project/clang/include/clang/Basic/arm_neon.td
# semantics key: ARM ACLE advsimd.md (openly licensed; do not vendor)
curl -sO https://raw.githubusercontent.com/ARM-software/acle/main/neon_intrinsics/advsimd.md

python3 scripts/generate_neon_models.py "$TD" --acle advsimd.md \
-o src/ansi-c/library/arm_neon.c
```

Without `--acle` the generator falls back to a small intrinsic-name-keyed
operation table (no ARM data required), which is enough to regenerate the ops
it already knows. With `--acle` it keys semantics on the ARM instruction
mnemonic, annotates each model with that mnemonic for provenance, and prints a
coverage audit: how many opaque builtins map to mnemonics the table covers, and
a histogram of the mnemonics it does not yet cover (the modeling roadmap).

The output is run through `clang-format-15`, so regeneration is idempotent.

## Current coverage and roadmap

The generator models the mechanically-translatable integer families:
absolute difference (`SABD`/`UABD`), min/max (`SMAX`/`UMAX`/`SMIN`/`UMIN`),
saturating add/subtract (`SQADD`/`UQADD`/`SQSUB`/`UQSUB`), halving and
rounding-halving add/subtract (`SHADD`/`UHADD`/`SHSUB`/`UHSUB`/`SRHADD`/`URHADD`),
the pairwise add/min/max reductions (`ADDP`/`SMAXP`/`UMAXP`/`SMINP`/`UMINP`),
test-bits (`CMTST`) and bitwise select (`BSL`).
Saturating and halving arithmetic is widened to avoid signed-overflow undefined
behaviour; pairwise add is computed unsigned so its modular wrap is well
defined; bitwise select operates on the raw bits and so is independent of the
lane type code.

The relational compares (`vceq`/`vcge`/`vcgt`/`vcle`/`vclt`) and the plain
arithmetic/logical ops (`vadd`/`vsub`/`vmul`, `vand`/`vorr`/`veor`/...) are
open-coded by `<arm_neon.h>` into native C operators, which CBMC handles
directly, so they need no model.

The `--acle` audit classifies the remaining opaque builtins by mnemonic. The
next tractable arithmetic tiers are `EXT` (vector extract by immediate) and the
saturating-shift group (`SQSHL`/`UQSHL`/...). Loads/stores (`LD1`/`ST1`/...),
permutes (`ZIP`/`UZP`/`TRN`/`TBL`), `DUP`, `INS` and the `NOP` (reinterpret)
group are structural rather than arithmetic and are handled separately or
natively. Floating-point (`FABD`/`FMAX`/...) and crypto/estimate instructions
need dedicated modeling and are out of scope for the mechanical generator.
18 changes: 18 additions & 0 deletions regression/ansi-c/gcc_neon_vector_type/main.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
// The neon_vector_type attribute (used by Clang's <arm_neon.h>) gives the
// vector size as a lane count rather than in bytes, unlike vector_size.
typedef __attribute__((neon_vector_type(16))) signed char int8x16_t;
typedef __attribute__((neon_vector_type(8))) short int16x8_t;
typedef __attribute__((neon_vector_type(4))) int int32x4_t;
typedef __attribute__((neon_vector_type(2))) double float64x2_t;

int main()
{
int8x16_t a = {0};
a[3] = 7;
__CPROVER_assert(a[3] == 7, "lane indexing works");
__CPROVER_assert(sizeof(int8x16_t) == 16, "16 lanes of signed char");
__CPROVER_assert(sizeof(int16x8_t) == 16, "8 lanes of short");
__CPROVER_assert(sizeof(int32x4_t) == 16, "4 lanes of int");
__CPROVER_assert(sizeof(float64x2_t) == 16, "2 lanes of double");
return 0;
}
9 changes: 9 additions & 0 deletions regression/ansi-c/gcc_neon_vector_type/test.desc
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
CORE gcc-only
main.c

^EXIT=0$
^SIGNAL=0$
^VERIFICATION SUCCESSFUL$
--
^warning: ignoring
^CONVERSION ERROR$
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
KNOWNBUG
main.c
lfence.c
--pointer-check --bounds-check
^EXIT=0$
^SIGNAL=0$
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
KNOWNBUG
main.c
mfence.c
--pointer-check --bounds-check
^EXIT=0$
^SIGNAL=0$
Expand Down
15 changes: 15 additions & 0 deletions regression/cbmc-library/__builtin_ia32/pabsb128.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#include <limits.h>

typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
__gcc_v16qi __builtin_ia32_pabsb128(__gcc_v16qi);

int main()
{
// Lane 0 is the interesting hardware case: pabsb leaves SCHAR_MIN unchanged
// (its absolute value is not representable as a signed byte).
__gcc_v16qi a = (__gcc_v16qi){
SCHAR_MIN, -2, 3, -4, 5, -6, 7, -8, 9, -10, 11, -12, 13, -14, 15, -16};
__gcc_v16qi r = __builtin_ia32_pabsb128(a);
__CPROVER_assert(r[0] == SCHAR_MIN && r[1] == 2 && r[15] == 16, "abs epi8");
return 0;
}
8 changes: 8 additions & 0 deletions regression/cbmc-library/__builtin_ia32/pabsb128.desc
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
CORE gcc-only
pabsb128.c

^EXIT=0$
^SIGNAL=0$
^VERIFICATION SUCCESSFUL$
--
^warning: ignoring
15 changes: 15 additions & 0 deletions regression/cbmc-library/__builtin_ia32/pabsd128.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#include <limits.h>

typedef int __gcc_v4si __attribute__((__vector_size__(16)));
__gcc_v4si __builtin_ia32_pabsd128(__gcc_v4si);

int main()
{
// Lane 0 is the interesting hardware case: pabsd leaves INT_MIN unchanged
// (its absolute value is not representable), and it is also the input that
// exposed the -INT_MIN signed-overflow UB in the previous model.
__gcc_v4si a = (__gcc_v4si){INT_MIN, -2, 3, -4};
__gcc_v4si r = __builtin_ia32_pabsd128(a);
__CPROVER_assert(r[0] == INT_MIN && r[1] == 2 && r[3] == 4, "abs epi32");
return 0;
}
8 changes: 8 additions & 0 deletions regression/cbmc-library/__builtin_ia32/pabsd128.desc
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
CORE gcc-only
pabsd128.c

^EXIT=0$
^SIGNAL=0$
^VERIFICATION SUCCESSFUL$
--
^warning: ignoring
14 changes: 14 additions & 0 deletions regression/cbmc-library/__builtin_ia32/pabsd256.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#include <limits.h>

typedef int __gcc_v8si __attribute__((__vector_size__(32)));
__gcc_v8si __builtin_ia32_pabsd256(__gcc_v8si);

int main()
{
// Lane 0: pabsd leaves INT_MIN unchanged (no UB in the model).
__gcc_v8si a = (__gcc_v8si){INT_MIN, -2, 3, -4, 5, -6, 7, -8};
__gcc_v8si r = __builtin_ia32_pabsd256(a);
__CPROVER_assert(
r[0] == INT_MIN && r[1] == 2 && r[7] == 8, "abs epi32 (256)");
return 0;
}
8 changes: 8 additions & 0 deletions regression/cbmc-library/__builtin_ia32/pabsd256.desc
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
CORE gcc-only
pabsd256.c

^EXIT=0$
^SIGNAL=0$
^VERIFICATION SUCCESSFUL$
--
^warning: ignoring
14 changes: 14 additions & 0 deletions regression/cbmc-library/__builtin_ia32/pabsw128.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#include <limits.h>

typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
__gcc_v8hi __builtin_ia32_pabsw128(__gcc_v8hi);

int main()
{
// Lane 0 is the interesting hardware case: pabsw leaves SHRT_MIN unchanged
// (its absolute value is not representable as a signed 16-bit value).
__gcc_v8hi a = (__gcc_v8hi){SHRT_MIN, -2, 3, -4, 5, -6, 7, -8};
__gcc_v8hi r = __builtin_ia32_pabsw128(a);
__CPROVER_assert(r[0] == SHRT_MIN && r[1] == 2 && r[7] == 8, "abs epi16");
return 0;
}
8 changes: 8 additions & 0 deletions regression/cbmc-library/__builtin_ia32/pabsw128.desc
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
CORE gcc-only
pabsw128.c

^EXIT=0$
^SIGNAL=0$
^VERIFICATION SUCCESSFUL$
--
^warning: ignoring
18 changes: 18 additions & 0 deletions regression/cbmc-library/__builtin_ia32/paddb.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
__gcc_v8qi __builtin_ia32_paddb(__gcc_v8qi, __gcc_v8qi);

int main()
{
// Exhaustive equivalence: the model must agree with CBMC's own
// vector semantics (native +) for all inputs.
__gcc_v8qi a, b;
__gcc_v8qi r = __builtin_ia32_paddb(a, b);
__gcc_v8qi_u ref = (__gcc_v8qi_u)a + (__gcc_v8qi_u)b;
__CPROVER_assert(
r[0] == (char)ref[0] && r[1] == (char)ref[1] && r[2] == (char)ref[2] &&
r[3] == (char)ref[3] && r[4] == (char)ref[4] && r[5] == (char)ref[5] &&
r[6] == (char)ref[6] && r[7] == (char)ref[7],
"__builtin_ia32_paddb == native +");
return 0;
}
8 changes: 8 additions & 0 deletions regression/cbmc-library/__builtin_ia32/paddb.desc
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
CORE gcc-only
paddb.c

^EXIT=0$
^SIGNAL=0$
^VERIFICATION SUCCESSFUL$
--
^warning: ignoring
22 changes: 22 additions & 0 deletions regression/cbmc-library/__builtin_ia32/paddb128.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
__gcc_v16qi __builtin_ia32_paddb128(__gcc_v16qi, __gcc_v16qi);

int main()
{
// Exhaustive equivalence: the model must agree with CBMC's own
// vector semantics (native +) for all inputs.
__gcc_v16qi a, b;
__gcc_v16qi r = __builtin_ia32_paddb128(a, b);
__gcc_v16qi_u ref = (__gcc_v16qi_u)a + (__gcc_v16qi_u)b;
__CPROVER_assert(
r[0] == (char)ref[0] && r[1] == (char)ref[1] && r[2] == (char)ref[2] &&
r[3] == (char)ref[3] && r[4] == (char)ref[4] && r[5] == (char)ref[5] &&
r[6] == (char)ref[6] && r[7] == (char)ref[7] && r[8] == (char)ref[8] &&
r[9] == (char)ref[9] && r[10] == (char)ref[10] &&
r[11] == (char)ref[11] && r[12] == (char)ref[12] &&
r[13] == (char)ref[13] && r[14] == (char)ref[14] &&
r[15] == (char)ref[15],
"__builtin_ia32_paddb128 == native +");
return 0;
}
8 changes: 8 additions & 0 deletions regression/cbmc-library/__builtin_ia32/paddb128.desc
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
CORE gcc-only
paddb128.c

^EXIT=0$
^SIGNAL=0$
^VERIFICATION SUCCESSFUL$
--
^warning: ignoring
Loading
Loading