diff --git a/.github/workflows/syntax-checks.yaml b/.github/workflows/syntax-checks.yaml
index 27badf441d3..eb19393268c 100644
--- a/.github/workflows/syntax-checks.yaml
+++ b/.github/workflows/syntax-checks.yaml
@@ -59,3 +59,17 @@ jobs:
           rustup toolchain install stable --profile minimal --no-self-update -c clippy -c rustfmt
       - name: Run `cargo fmt` on top of Rust API project
         run: cd src/libcprover-rust; cargo fmt --all -- --check
+
+  # This job should take under a minute (est)
+  check-generated-intrinsic-models:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+      - name: Fetch dependencies
+        env:
+          DEBIAN_FRONTEND: noninteractive
+        run: |
+          sudo apt-get update
+          sudo apt-get install --no-install-recommends -yq clang-format-15
+      - name: Check x86 SIMD intrinsic models are in sync with their generator
+        run: ./scripts/check_intrinsic_models_sync.sh
diff --git a/doc/neon-intrinsic-models.md b/doc/neon-intrinsic-models.md
new file mode 100644
index 00000000000..a4c9ae348b2
--- /dev/null
+++ b/doc/neon-intrinsic-models.md
@@ -0,0 +1,167 @@
+# Generating ARM/AArch64 NEON intrinsic models
+
+This document describes how CBMC models the ARM/AArch64 NEON SIMD builtins, how
+the models are generated, and the design decisions behind the choice of
+semantic source. It is the companion to `scripts/generate_neon_models.py`,
+which emits `src/ansi-c/library/arm_neon.c`.
+
+## Background: how NEON reaches CBMC
+
+Clang's `<arm_neon.h>` implements the public NEON intrinsics (`vabdq_s8`, ...)
+on top of a smaller set of *polymorphic* compiler builtins
+(`__builtin_neon_vabdq_v`, ...). At each call site the header casts every
+operand to a byte-representative lane type (`int8x16_t`, i.e. `__gcc_v16qi`)
+and passes a `NeonTypeFlags` integer "type code" that selects the actual lane
+interpretation. So one builtin such as `__builtin_neon_vabdq_v` backs
+`vabdq_s8`, `vabdq_s16`, `vabdq_u8`, ... and a model must switch on the type
+code to reinterpret the representative bytes.
+
+For CBMC to verify such code it needs three things, each handled by a separate
+piece of the front-end:
+
+1. **Declarations** for the `__builtin_neon_*` builtins
+   (`src/ansi-c/compiler_headers/gcc_builtin_headers_aarch64.h`, generated by
+   `clang_builtins.py` from `clang-tblgen -gen-arm-neon-sema`).
+2. **The `neon_vector_type` attribute** so the `<arm_neon.h>` typedefs are real
+   vectors (handled in the scanner/parser and `ansi_c_convert_type`).
+3. **Library models** giving the builtins a body — the subject of this
+   document.
+
+Intrinsics that Clang *open-codes* (the `*OpInst` records in `arm_neon.td`,
+e.g. `vaddq_s8` → `a + b`) lower to native C operators, which CBMC already
+handles, so they need **no** model. Only the opaque builtins (the `SInst` /
+`IInst` / ... records) need one.
+
+## Where do the model bodies come from?
+
+`arm_neon.td` carries **no semantics** for the opaque builtins: their
+`Operation` field is `OP_NONE`. It tells us *which* builtins exist, the element
+types each supports, and the type codes — i.e. the model's signature and
+`switch` skeleton — but not what each one computes. The per-lane computation
+must come from elsewhere. Two ARM-published machine-readable sources were
+considered.
+
+### ARM intrinsics JSON (Intrinsics Guide) vs ARM ASL (Architecture Spec)
+
+**ARM-JSON** — the database behind the online Neon/ACLE Intrinsics Guide. Keyed
+by *typed intrinsic* (`vabdq_s8`), with an `Operation` field giving high-level
+per-element pseudocode.
+
+- *Advantages.* Keying matches our pipeline (`arm_neon.td` already gives
+  intrinsic → builtin + type code), so an entry maps to exactly one `switch`
+  case. The pseudocode is close to the per-lane C we emit. It is plain JSON, so
+  trivial to consume, and it describes the net per-element effect (handy for
+  intrinsics that lower to instruction *sequences*).
+- *Disadvantages.* The pseudocode is written for humans: notation varies and
+  detail (FP rounding modes, NaN propagation, saturation-flag side effects) is
+  often elided. Coverage/quality is uneven, and it is a derived presentation,
+  not the spec ARM validates against, so corners can be wrong.
+
+**ARM-ASL** — the Architecture Reference Manual's machine-readable spec, keyed
+by *instruction* (`SABD`), with rigorous executable decode+execute pseudocode.
+
+- *Advantages.* Authoritative and exact (saturation, rounding, flags, edge
+  cases). Formal/executable, so mechanically translatable in principle (ASLi,
+  Sail, isla). Parameterised by element size, mapping naturally to "compute for
+  this lane width".
+- *Disadvantages.* Wrong keying for us: it is per *instruction*, so it needs an
+  intrinsic → instruction mnemonic map that is not in `arm_neon.td`. It is
+  heavy to translate — it references a large shared-function library and
+  architectural state (`FPCR`, `FPSR.QC`, `PSTATE`, the register file) — and is
+  often *more* precise than CBMC can use, so faithfully consuming it would bloat
+  models the solver then chokes on. It is also large and under specific Arm
+  license terms.
+
+**Summary.** JSON wins on integration fit and effort; ASL wins on rigor. Since
+CBMC models should be simple and self-contained, JSON-style per-element
+semantics are the better fit for the bulk, with ASL reserved as a targeted
+correctness backstop for the cases where exactness matters and is tractable.
+For the hardest tier (FP estimate/reciprocal, crypto, table lookups) neither
+source yields a clean, solver-friendly C model automatically; those need
+hand-written models or constrained nondeterminism regardless of source.
+
+### What is actually available, and the resulting design
+
+Empirically (June 2026):
+
+- The Intrinsics Guide **`Operation` pseudocode JSON is access-gated** (the
+  `developer.arm.com/.../intrinsics/data/intrinsics.json` endpoint returns 403)
+  and is under Arm license terms, so it cannot be fetched here nor vendored
+  into CBMC.
+- ARM's **ACLE repository is openly licensed and fetchable**:
+  `ARM-software/acle`'s `neon_intrinsics/advsimd.md` is a 2.1 MB structured
+  reference listing **4689 intrinsics**. It does **not** contain per-element
+  pseudocode, but it does contain, for each intrinsic, the **AArch64
+  instruction mnemonic** it maps to (`vabdq_s8` → `SABD`, `vqaddq_s8` →
+  `SQADD`, `vhaddq_s8` → `SHADD`, ...) — **356 distinct mnemonics** in total.
+
+This reshapes the plan favourably. The instruction mnemonic is the *true*
+semantic identity of an intrinsic (it is also the ASL key), and it is a far
+smaller, well-understood set than the ~2500 intrinsics or their pseudocode.
+So instead of translating per-intrinsic pseudocode, we:
+
+1. take **structure** from `arm_neon.td` (builtins, type codes — as before);
+2. take the authoritative **intrinsic → instruction mnemonic** mapping from
+   ACLE `advsimd.md`; and
+3. supply **semantics** via a compact, auditable *mnemonic → per-lane C*
+   table in the generator (`SABD`/`UABD` → absolute difference, `SMAX`/`UMAX`
+   → maximum, `SQADD`/`UQADD` → saturating add, ...).
+
+This is authoritative about *which* operation each builtin is (no guessing from
+names), keeps the hand-written part tiny (one entry per instruction family, not
+per intrinsic), and lines up with the ASL key should we later want ASL-grade
+rigor for specific instructions. The gated `Operation` JSON would only be
+needed to avoid writing the mnemonic→C table at all; given how small that table
+is, it is not worth the licensing and translation cost.
+
+Both inputs are **external and not vendored** (mirroring how the x86 generator
+reads Intel's XML and how the declaration generator reads `arm_neon.td` from an
+LLVM checkout): `arm_neon.td` comes from an LLVM checkout and `advsimd.md` from
+the openly-licensed ACLE repo. Neither is committed to CBMC.
+
+## Running the generator
+
+```sh
+# structure: clang's arm_neon.td (from an llvm-project checkout)
+TD=llvm-project/clang/include/clang/Basic/arm_neon.td
+# semantics key: ARM ACLE advsimd.md (openly licensed; do not vendor)
+curl -sO https://raw.githubusercontent.com/ARM-software/acle/main/neon_intrinsics/advsimd.md
+
+python3 scripts/generate_neon_models.py "$TD" --acle advsimd.md \
+    -o src/ansi-c/library/arm_neon.c
+```
+
+Without `--acle` the generator falls back to a small intrinsic-name-keyed
+operation table (no ARM data required), which is enough to regenerate the ops
+it already knows. With `--acle` it keys semantics on the ARM instruction
+mnemonic, annotates each model with that mnemonic for provenance, and prints a
+coverage audit: how many opaque builtins map to mnemonics the table covers, and
+a histogram of the mnemonics it does not yet cover (the modeling roadmap).
+
+The output is run through `clang-format-15`, so regeneration is idempotent.
+
+## Current coverage and roadmap
+
+The generator models the mechanically-translatable integer families:
+absolute difference (`SABD`/`UABD`), min/max (`SMAX`/`UMAX`/`SMIN`/`UMIN`),
+saturating add/subtract (`SQADD`/`UQADD`/`SQSUB`/`UQSUB`), halving and
+rounding-halving add/subtract (`SHADD`/`UHADD`/`SHSUB`/`UHSUB`/`SRHADD`/`URHADD`),
+the pairwise add/min/max reductions (`ADDP`/`SMAXP`/`UMAXP`/`SMINP`/`UMINP`),
+test-bits (`CMTST`) and bitwise select (`BSL`).
+Saturating and halving arithmetic is widened to avoid signed-overflow undefined
+behaviour; pairwise add is computed unsigned so its modular wrap is well
+defined; bitwise select operates on the raw bits and so is independent of the
+lane type code.
+
+The relational compares (`vceq`/`vcge`/`vcgt`/`vcle`/`vclt`) and the plain
+arithmetic/logical ops (`vadd`/`vsub`/`vmul`, `vand`/`vorr`/`veor`/...) are
+open-coded by `<arm_neon.h>` into native C operators, which CBMC handles
+directly, so they need no model.
+
+The `--acle` audit classifies the remaining opaque builtins by mnemonic. The
+next tractable arithmetic tiers are `EXT` (vector extract by immediate) and the
+saturating-shift group (`SQSHL`/`UQSHL`/...). Loads/stores (`LD1`/`ST1`/...),
+permutes (`ZIP`/`UZP`/`TRN`/`TBL`), `DUP`, `INS` and the `NOP` (reinterpret)
+group are structural rather than arithmetic and are handled separately or
+natively. Floating-point (`FABD`/`FMAX`/...) and crypto/estimate instructions
+need dedicated modeling and are out of scope for the mechanical generator.
diff --git a/regression/ansi-c/gcc_neon_vector_type/main.c b/regression/ansi-c/gcc_neon_vector_type/main.c
new file mode 100644
index 00000000000..44d26419b66
--- /dev/null
+++ b/regression/ansi-c/gcc_neon_vector_type/main.c
@@ -0,0 +1,18 @@
+// The neon_vector_type attribute (used by Clang's <arm_neon.h>) gives the
+// vector size as a lane count rather than in bytes, unlike vector_size.
+typedef __attribute__((neon_vector_type(16))) signed char int8x16_t;
+typedef __attribute__((neon_vector_type(8))) short int16x8_t;
+typedef __attribute__((neon_vector_type(4))) int int32x4_t;
+typedef __attribute__((neon_vector_type(2))) double float64x2_t;
+
+int main()
+{
+  int8x16_t a = {0};
+  a[3] = 7;
+  __CPROVER_assert(a[3] == 7, "lane indexing works");
+  __CPROVER_assert(sizeof(int8x16_t) == 16, "16 lanes of signed char");
+  __CPROVER_assert(sizeof(int16x8_t) == 16, "8 lanes of short");
+  __CPROVER_assert(sizeof(int32x4_t) == 16, "4 lanes of int");
+  __CPROVER_assert(sizeof(float64x2_t) == 16, "2 lanes of double");
+  return 0;
+}
diff --git a/regression/ansi-c/gcc_neon_vector_type/test.desc b/regression/ansi-c/gcc_neon_vector_type/test.desc
new file mode 100644
index 00000000000..75cc69573e8
--- /dev/null
+++ b/regression/ansi-c/gcc_neon_vector_type/test.desc
@@ -0,0 +1,9 @@
+CORE gcc-only
+main.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
+^CONVERSION ERROR$
diff --git a/regression/cbmc-library/__builtin_ia32_lfence/main.c b/regression/cbmc-library/__builtin_ia32/lfence.c
similarity index 100%
rename from regression/cbmc-library/__builtin_ia32_lfence/main.c
rename to regression/cbmc-library/__builtin_ia32/lfence.c
diff --git a/regression/cbmc-library/__builtin_ia32_lfence/test.desc b/regression/cbmc-library/__builtin_ia32/lfence.desc
similarity index 92%
rename from regression/cbmc-library/__builtin_ia32_lfence/test.desc
rename to regression/cbmc-library/__builtin_ia32/lfence.desc
index 9542d988e8d..0f195a73692 100644
--- a/regression/cbmc-library/__builtin_ia32_lfence/test.desc
+++ b/regression/cbmc-library/__builtin_ia32/lfence.desc
@@ -1,5 +1,5 @@
 KNOWNBUG
-main.c
+lfence.c
 --pointer-check --bounds-check
 ^EXIT=0$
 ^SIGNAL=0$
diff --git a/regression/cbmc-library/__builtin_ia32_mfence/main.c b/regression/cbmc-library/__builtin_ia32/mfence.c
similarity index 100%
rename from regression/cbmc-library/__builtin_ia32_mfence/main.c
rename to regression/cbmc-library/__builtin_ia32/mfence.c
diff --git a/regression/cbmc-library/__builtin_ia32_mfence/test.desc b/regression/cbmc-library/__builtin_ia32/mfence.desc
similarity index 92%
rename from regression/cbmc-library/__builtin_ia32_mfence/test.desc
rename to regression/cbmc-library/__builtin_ia32/mfence.desc
index 9542d988e8d..02f1f6d6d1f 100644
--- a/regression/cbmc-library/__builtin_ia32_mfence/test.desc
+++ b/regression/cbmc-library/__builtin_ia32/mfence.desc
@@ -1,5 +1,5 @@
 KNOWNBUG
-main.c
+mfence.c
 --pointer-check --bounds-check
 ^EXIT=0$
 ^SIGNAL=0$
diff --git a/regression/cbmc-library/__builtin_ia32/pabsb128.c b/regression/cbmc-library/__builtin_ia32/pabsb128.c
new file mode 100644
index 00000000000..4ded3178df6
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pabsb128.c
@@ -0,0 +1,15 @@
+#include <limits.h>
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+__gcc_v16qi __builtin_ia32_pabsb128(__gcc_v16qi);
+
+int main()
+{
+  // Lane 0 is the interesting hardware case: pabsb leaves SCHAR_MIN unchanged
+  // (its absolute value is not representable as a signed byte).
+  __gcc_v16qi a = (__gcc_v16qi){
+    SCHAR_MIN, -2, 3, -4, 5, -6, 7, -8, 9, -10, 11, -12, 13, -14, 15, -16};
+  __gcc_v16qi r = __builtin_ia32_pabsb128(a);
+  __CPROVER_assert(r[0] == SCHAR_MIN && r[1] == 2 && r[15] == 16, "abs epi8");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pabsb128.desc b/regression/cbmc-library/__builtin_ia32/pabsb128.desc
new file mode 100644
index 00000000000..0beb8e2bdc4
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pabsb128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pabsb128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pabsd128.c b/regression/cbmc-library/__builtin_ia32/pabsd128.c
new file mode 100644
index 00000000000..deb4ccf7bc6
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pabsd128.c
@@ -0,0 +1,15 @@
+#include <limits.h>
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+__gcc_v4si __builtin_ia32_pabsd128(__gcc_v4si);
+
+int main()
+{
+  // Lane 0 is the interesting hardware case: pabsd leaves INT_MIN unchanged
+  // (its absolute value is not representable), and it is also the input that
+  // exposed the -INT_MIN signed-overflow UB in the previous model.
+  __gcc_v4si a = (__gcc_v4si){INT_MIN, -2, 3, -4};
+  __gcc_v4si r = __builtin_ia32_pabsd128(a);
+  __CPROVER_assert(r[0] == INT_MIN && r[1] == 2 && r[3] == 4, "abs epi32");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pabsd128.desc b/regression/cbmc-library/__builtin_ia32/pabsd128.desc
new file mode 100644
index 00000000000..83873dca7ea
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pabsd128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pabsd128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pabsd256.c b/regression/cbmc-library/__builtin_ia32/pabsd256.c
new file mode 100644
index 00000000000..f3badb87b0b
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pabsd256.c
@@ -0,0 +1,14 @@
+#include <limits.h>
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+__gcc_v8si __builtin_ia32_pabsd256(__gcc_v8si);
+
+int main()
+{
+  // Lane 0: pabsd leaves INT_MIN unchanged (no UB in the model).
+  __gcc_v8si a = (__gcc_v8si){INT_MIN, -2, 3, -4, 5, -6, 7, -8};
+  __gcc_v8si r = __builtin_ia32_pabsd256(a);
+  __CPROVER_assert(
+    r[0] == INT_MIN && r[1] == 2 && r[7] == 8, "abs epi32 (256)");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pabsd256.desc b/regression/cbmc-library/__builtin_ia32/pabsd256.desc
new file mode 100644
index 00000000000..c6770eb23ce
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pabsd256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pabsd256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pabsw128.c b/regression/cbmc-library/__builtin_ia32/pabsw128.c
new file mode 100644
index 00000000000..82537fd245d
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pabsw128.c
@@ -0,0 +1,14 @@
+#include <limits.h>
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+__gcc_v8hi __builtin_ia32_pabsw128(__gcc_v8hi);
+
+int main()
+{
+  // Lane 0 is the interesting hardware case: pabsw leaves SHRT_MIN unchanged
+  // (its absolute value is not representable as a signed 16-bit value).
+  __gcc_v8hi a = (__gcc_v8hi){SHRT_MIN, -2, 3, -4, 5, -6, 7, -8};
+  __gcc_v8hi r = __builtin_ia32_pabsw128(a);
+  __CPROVER_assert(r[0] == SHRT_MIN && r[1] == 2 && r[7] == 8, "abs epi16");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pabsw128.desc b/regression/cbmc-library/__builtin_ia32/pabsw128.desc
new file mode 100644
index 00000000000..03a52862cd5
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pabsw128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pabsw128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddb.c b/regression/cbmc-library/__builtin_ia32/paddb.c
new file mode 100644
index 00000000000..37525198d9b
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddb.c
@@ -0,0 +1,18 @@
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
+__gcc_v8qi __builtin_ia32_paddb(__gcc_v8qi, __gcc_v8qi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native +) for all inputs.
+  __gcc_v8qi a, b;
+  __gcc_v8qi r = __builtin_ia32_paddb(a, b);
+  __gcc_v8qi_u ref = (__gcc_v8qi_u)a + (__gcc_v8qi_u)b;
+  __CPROVER_assert(
+    r[0] == (char)ref[0] && r[1] == (char)ref[1] && r[2] == (char)ref[2] &&
+      r[3] == (char)ref[3] && r[4] == (char)ref[4] && r[5] == (char)ref[5] &&
+      r[6] == (char)ref[6] && r[7] == (char)ref[7],
+    "__builtin_ia32_paddb == native +");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddb.desc b/regression/cbmc-library/__builtin_ia32/paddb.desc
new file mode 100644
index 00000000000..70f3ddd3ef9
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddb.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddb.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddb128.c b/regression/cbmc-library/__builtin_ia32/paddb128.c
new file mode 100644
index 00000000000..402a19e35b9
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddb128.c
@@ -0,0 +1,22 @@
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+__gcc_v16qi __builtin_ia32_paddb128(__gcc_v16qi, __gcc_v16qi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native +) for all inputs.
+  __gcc_v16qi a, b;
+  __gcc_v16qi r = __builtin_ia32_paddb128(a, b);
+  __gcc_v16qi_u ref = (__gcc_v16qi_u)a + (__gcc_v16qi_u)b;
+  __CPROVER_assert(
+    r[0] == (char)ref[0] && r[1] == (char)ref[1] && r[2] == (char)ref[2] &&
+      r[3] == (char)ref[3] && r[4] == (char)ref[4] && r[5] == (char)ref[5] &&
+      r[6] == (char)ref[6] && r[7] == (char)ref[7] && r[8] == (char)ref[8] &&
+      r[9] == (char)ref[9] && r[10] == (char)ref[10] &&
+      r[11] == (char)ref[11] && r[12] == (char)ref[12] &&
+      r[13] == (char)ref[13] && r[14] == (char)ref[14] &&
+      r[15] == (char)ref[15],
+    "__builtin_ia32_paddb128 == native +");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddb128.desc b/regression/cbmc-library/__builtin_ia32/paddb128.desc
new file mode 100644
index 00000000000..06c7d429816
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddb128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddb128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddb256.c b/regression/cbmc-library/__builtin_ia32/paddb256.c
new file mode 100644
index 00000000000..ab11bef4413
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddb256.c
@@ -0,0 +1,30 @@
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef unsigned char __gcc_v32qi_u __attribute__((__vector_size__(32)));
+__gcc_v32qi __builtin_ia32_paddb256(__gcc_v32qi, __gcc_v32qi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native +) for all inputs.
+  __gcc_v32qi a, b;
+  __gcc_v32qi r = __builtin_ia32_paddb256(a, b);
+  __gcc_v32qi_u ref = (__gcc_v32qi_u)a + (__gcc_v32qi_u)b;
+  __CPROVER_assert(
+    r[0] == (char)ref[0] && r[1] == (char)ref[1] && r[2] == (char)ref[2] &&
+      r[3] == (char)ref[3] && r[4] == (char)ref[4] && r[5] == (char)ref[5] &&
+      r[6] == (char)ref[6] && r[7] == (char)ref[7] && r[8] == (char)ref[8] &&
+      r[9] == (char)ref[9] && r[10] == (char)ref[10] &&
+      r[11] == (char)ref[11] && r[12] == (char)ref[12] &&
+      r[13] == (char)ref[13] && r[14] == (char)ref[14] &&
+      r[15] == (char)ref[15] && r[16] == (char)ref[16] &&
+      r[17] == (char)ref[17] && r[18] == (char)ref[18] &&
+      r[19] == (char)ref[19] && r[20] == (char)ref[20] &&
+      r[21] == (char)ref[21] && r[22] == (char)ref[22] &&
+      r[23] == (char)ref[23] && r[24] == (char)ref[24] &&
+      r[25] == (char)ref[25] && r[26] == (char)ref[26] &&
+      r[27] == (char)ref[27] && r[28] == (char)ref[28] &&
+      r[29] == (char)ref[29] && r[30] == (char)ref[30] &&
+      r[31] == (char)ref[31],
+    "__builtin_ia32_paddb256 == native +");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddb256.desc b/regression/cbmc-library/__builtin_ia32/paddb256.desc
new file mode 100644
index 00000000000..ab5d46ddab9
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddb256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddb256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddd.c b/regression/cbmc-library/__builtin_ia32/paddd.c
new file mode 100644
index 00000000000..92b8f1afe75
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddd.c
@@ -0,0 +1,16 @@
+typedef int __gcc_v2si __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+__gcc_v2si __builtin_ia32_paddd(__gcc_v2si, __gcc_v2si);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native +) for all inputs.
+  __gcc_v2si a, b;
+  __gcc_v2si r = __builtin_ia32_paddd(a, b);
+  __gcc_v2si_u ref = (__gcc_v2si_u)a + (__gcc_v2si_u)b;
+  __CPROVER_assert(
+    r[0] == (int)ref[0] && r[1] == (int)ref[1],
+    "__builtin_ia32_paddd == native +");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddd.desc b/regression/cbmc-library/__builtin_ia32/paddd.desc
new file mode 100644
index 00000000000..47a5730bb91
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddd.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddd.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddd128.c b/regression/cbmc-library/__builtin_ia32/paddd128.c
new file mode 100644
index 00000000000..ed3ba7ae970
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddd128.c
@@ -0,0 +1,17 @@
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+__gcc_v4si __builtin_ia32_paddd128(__gcc_v4si, __gcc_v4si);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native +) for all inputs.
+  __gcc_v4si a, b;
+  __gcc_v4si r = __builtin_ia32_paddd128(a, b);
+  __gcc_v4si_u ref = (__gcc_v4si_u)a + (__gcc_v4si_u)b;
+  __CPROVER_assert(
+    r[0] == (int)ref[0] && r[1] == (int)ref[1] && r[2] == (int)ref[2] &&
+      r[3] == (int)ref[3],
+    "__builtin_ia32_paddd128 == native +");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddd128.desc b/regression/cbmc-library/__builtin_ia32/paddd128.desc
new file mode 100644
index 00000000000..9ae4f9754bf
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddd128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddd128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddd128_mask.c b/regression/cbmc-library/__builtin_ia32/paddd128_mask.c
new file mode 100644
index 00000000000..0481c42ebf5
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddd128_mask.c
@@ -0,0 +1,15 @@
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+__gcc_v4si
+__builtin_ia32_paddd128_mask(__gcc_v4si, __gcc_v4si, __gcc_v4si, unsigned char);
+
+int main()
+{
+  __gcc_v4si a = {1, 1, 1, 1};
+  __gcc_v4si b = {2, 2, 2, 2};
+  __gcc_v4si src = {9, 9, 9, 9};
+  // Mask 0x5: bits 0 and 2 set -> a+b (3); lanes 1 and 3 keep the source (9).
+  __gcc_v4si r = __builtin_ia32_paddd128_mask(a, b, src, 0x5);
+  __CPROVER_assert(
+    r[0] == 3 && r[1] == 9 && r[2] == 3 && r[3] == 9, "paddd128 merge-masked");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddd128_mask.desc b/regression/cbmc-library/__builtin_ia32/paddd128_mask.desc
new file mode 100644
index 00000000000..caba91848eb
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddd128_mask.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddd128_mask.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddd256.c b/regression/cbmc-library/__builtin_ia32/paddd256.c
new file mode 100644
index 00000000000..38bb6aec7bd
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddd256.c
@@ -0,0 +1,18 @@
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+typedef unsigned int __gcc_v8si_u __attribute__((__vector_size__(32)));
+__gcc_v8si __builtin_ia32_paddd256(__gcc_v8si, __gcc_v8si);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native +) for all inputs.
+  __gcc_v8si a, b;
+  __gcc_v8si r = __builtin_ia32_paddd256(a, b);
+  __gcc_v8si_u ref = (__gcc_v8si_u)a + (__gcc_v8si_u)b;
+  __CPROVER_assert(
+    r[0] == (int)ref[0] && r[1] == (int)ref[1] && r[2] == (int)ref[2] &&
+      r[3] == (int)ref[3] && r[4] == (int)ref[4] && r[5] == (int)ref[5] &&
+      r[6] == (int)ref[6] && r[7] == (int)ref[7],
+    "__builtin_ia32_paddd256 == native +");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddd256.desc b/regression/cbmc-library/__builtin_ia32/paddd256.desc
new file mode 100644
index 00000000000..a7794dd351d
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddd256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddd256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddd512_mask.c b/regression/cbmc-library/__builtin_ia32/paddd512_mask.c
new file mode 100644
index 00000000000..99066b2e14f
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddd512_mask.c
@@ -0,0 +1,20 @@
+typedef int __gcc_v16si __attribute__((__vector_size__(64)));
+__gcc_v16si __builtin_ia32_paddd512_mask(
+  __gcc_v16si,
+  __gcc_v16si,
+  __gcc_v16si,
+  unsigned short);
+
+int main()
+{
+  __gcc_v16si a = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+  __gcc_v16si b = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+  __gcc_v16si src = {9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+  // Mask 0x0005: bits 0 and 2 set -> those lanes get a+b (3), the rest keep
+  // the merge source (9).
+  __gcc_v16si r = __builtin_ia32_paddd512_mask(a, b, src, 0x0005);
+  __CPROVER_assert(
+    r[0] == 3 && r[1] == 9 && r[2] == 3 && r[3] == 9 && r[15] == 9,
+    "paddd512 merge-masked add");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddd512_mask.desc b/regression/cbmc-library/__builtin_ia32/paddd512_mask.desc
new file mode 100644
index 00000000000..a33660c2449
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddd512_mask.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddd512_mask.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddq128.c b/regression/cbmc-library/__builtin_ia32/paddq128.c
new file mode 100644
index 00000000000..1146ad24d6f
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddq128.c
@@ -0,0 +1,16 @@
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+typedef unsigned long long __gcc_v2di_u __attribute__((__vector_size__(16)));
+__gcc_v2di __builtin_ia32_paddq128(__gcc_v2di, __gcc_v2di);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native +) for all inputs.
+  __gcc_v2di a, b;
+  __gcc_v2di r = __builtin_ia32_paddq128(a, b);
+  __gcc_v2di_u ref = (__gcc_v2di_u)a + (__gcc_v2di_u)b;
+  __CPROVER_assert(
+    r[0] == (long long)ref[0] && r[1] == (long long)ref[1],
+    "__builtin_ia32_paddq128 == native +");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddq128.desc b/regression/cbmc-library/__builtin_ia32/paddq128.desc
new file mode 100644
index 00000000000..de23d1d1065
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddq128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddq128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddq256.c b/regression/cbmc-library/__builtin_ia32/paddq256.c
new file mode 100644
index 00000000000..7105833741a
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddq256.c
@@ -0,0 +1,17 @@
+typedef long long __gcc_v4di __attribute__((__vector_size__(32)));
+typedef unsigned long long __gcc_v4di_u __attribute__((__vector_size__(32)));
+__gcc_v4di __builtin_ia32_paddq256(__gcc_v4di, __gcc_v4di);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native +) for all inputs.
+  __gcc_v4di a, b;
+  __gcc_v4di r = __builtin_ia32_paddq256(a, b);
+  __gcc_v4di_u ref = (__gcc_v4di_u)a + (__gcc_v4di_u)b;
+  __CPROVER_assert(
+    r[0] == (long long)ref[0] && r[1] == (long long)ref[1] &&
+      r[2] == (long long)ref[2] && r[3] == (long long)ref[3],
+    "__builtin_ia32_paddq256 == native +");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddq256.desc b/regression/cbmc-library/__builtin_ia32/paddq256.desc
new file mode 100644
index 00000000000..98a14966bb6
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddq256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddq256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddsb128.c b/regression/cbmc-library/__builtin_ia32/paddsb128.c
new file mode 100644
index 00000000000..3fabd00c815
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddsb128.c
@@ -0,0 +1,14 @@
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+__gcc_v16qi __builtin_ia32_paddsb128(__gcc_v16qi, __gcc_v16qi);
+
+int main()
+{
+  // Signed saturation: 100+50=150 clamps to 127; -100+-50=-150 clamps to -128.
+  __gcc_v16qi a =
+    (__gcc_v16qi){100, -100, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
+  __gcc_v16qi b =
+    (__gcc_v16qi){50, -50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+  __gcc_v16qi r = __builtin_ia32_paddsb128(a, b);
+  __CPROVER_assert(r[0] == 127 && r[1] == -128 && r[2] == 4, "adds epi8");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddsb128.desc b/regression/cbmc-library/__builtin_ia32/paddsb128.desc
new file mode 100644
index 00000000000..43e7c7d7f7f
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddsb128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddsb128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddsw128.c b/regression/cbmc-library/__builtin_ia32/paddsw128.c
new file mode 100644
index 00000000000..34f173e73e3
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddsw128.c
@@ -0,0 +1,13 @@
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+__gcc_v8hi __builtin_ia32_paddsw128(__gcc_v8hi, __gcc_v8hi);
+
+int main()
+{
+  // Signed 16-bit saturation: 30000+10000 clamps to 32767; -30000+-10000 to
+  // -32768.
+  __gcc_v8hi a = (__gcc_v8hi){30000, -30000, 3, 4, 5, 6, 7, 8};
+  __gcc_v8hi b = (__gcc_v8hi){10000, -10000, 1, 1, 1, 1, 1, 1};
+  __gcc_v8hi r = __builtin_ia32_paddsw128(a, b);
+  __CPROVER_assert(r[0] == 32767 && r[1] == -32768 && r[2] == 4, "adds epi16");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddsw128.desc b/regression/cbmc-library/__builtin_ia32/paddsw128.desc
new file mode 100644
index 00000000000..394390b5b55
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddsw128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddsw128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddusb128.c b/regression/cbmc-library/__builtin_ia32/paddusb128.c
new file mode 100644
index 00000000000..782afb36308
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddusb128.c
@@ -0,0 +1,15 @@
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+__gcc_v16qi __builtin_ia32_paddusb128(__gcc_v16qi, __gcc_v16qi);
+
+int main()
+{
+  // Unsigned saturation: the bytes 200 and 100 (written as their signed-char
+  // equivalents) sum to 300, which clamps to 255 == -1 as a signed byte.
+  __gcc_v16qi a =
+    (__gcc_v16qi){200, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
+  __gcc_v16qi b =
+    (__gcc_v16qi){100, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+  __gcc_v16qi r = __builtin_ia32_paddusb128(a, b);
+  __CPROVER_assert(r[0] == -1 && r[1] == 2, "adds epu8");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddusb128.desc b/regression/cbmc-library/__builtin_ia32/paddusb128.desc
new file mode 100644
index 00000000000..8b16926d005
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddusb128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddusb128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddw.c b/regression/cbmc-library/__builtin_ia32/paddw.c
new file mode 100644
index 00000000000..2b33555f7a4
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddw.c
@@ -0,0 +1,17 @@
+typedef short __gcc_v4hi __attribute__((__vector_size__(8)));
+typedef unsigned short __gcc_v4hi_u __attribute__((__vector_size__(8)));
+__gcc_v4hi __builtin_ia32_paddw(__gcc_v4hi, __gcc_v4hi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native +) for all inputs.
+  __gcc_v4hi a, b;
+  __gcc_v4hi r = __builtin_ia32_paddw(a, b);
+  __gcc_v4hi_u ref = (__gcc_v4hi_u)a + (__gcc_v4hi_u)b;
+  __CPROVER_assert(
+    r[0] == (short)ref[0] && r[1] == (short)ref[1] && r[2] == (short)ref[2] &&
+      r[3] == (short)ref[3],
+    "__builtin_ia32_paddw == native +");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddw.desc b/regression/cbmc-library/__builtin_ia32/paddw.desc
new file mode 100644
index 00000000000..4245f6710b8
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddw.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddw.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddw128.c b/regression/cbmc-library/__builtin_ia32/paddw128.c
new file mode 100644
index 00000000000..d3a65c4474c
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddw128.c
@@ -0,0 +1,18 @@
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+__gcc_v8hi __builtin_ia32_paddw128(__gcc_v8hi, __gcc_v8hi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native +) for all inputs.
+  __gcc_v8hi a, b;
+  __gcc_v8hi r = __builtin_ia32_paddw128(a, b);
+  __gcc_v8hi_u ref = (__gcc_v8hi_u)a + (__gcc_v8hi_u)b;
+  __CPROVER_assert(
+    r[0] == (short)ref[0] && r[1] == (short)ref[1] && r[2] == (short)ref[2] &&
+      r[3] == (short)ref[3] && r[4] == (short)ref[4] && r[5] == (short)ref[5] &&
+      r[6] == (short)ref[6] && r[7] == (short)ref[7],
+    "__builtin_ia32_paddw128 == native +");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddw128.desc b/regression/cbmc-library/__builtin_ia32/paddw128.desc
new file mode 100644
index 00000000000..7e83acf4cb3
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddw128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddw128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/paddw256.c b/regression/cbmc-library/__builtin_ia32/paddw256.c
new file mode 100644
index 00000000000..e4576b36589
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddw256.c
@@ -0,0 +1,22 @@
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+typedef unsigned short __gcc_v16hi_u __attribute__((__vector_size__(32)));
+__gcc_v16hi __builtin_ia32_paddw256(__gcc_v16hi, __gcc_v16hi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native +) for all inputs.
+  __gcc_v16hi a, b;
+  __gcc_v16hi r = __builtin_ia32_paddw256(a, b);
+  __gcc_v16hi_u ref = (__gcc_v16hi_u)a + (__gcc_v16hi_u)b;
+  __CPROVER_assert(
+    r[0] == (short)ref[0] && r[1] == (short)ref[1] && r[2] == (short)ref[2] &&
+      r[3] == (short)ref[3] && r[4] == (short)ref[4] && r[5] == (short)ref[5] &&
+      r[6] == (short)ref[6] && r[7] == (short)ref[7] && r[8] == (short)ref[8] &&
+      r[9] == (short)ref[9] && r[10] == (short)ref[10] &&
+      r[11] == (short)ref[11] && r[12] == (short)ref[12] &&
+      r[13] == (short)ref[13] && r[14] == (short)ref[14] &&
+      r[15] == (short)ref[15],
+    "__builtin_ia32_paddw256 == native +");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/paddw256.desc b/regression/cbmc-library/__builtin_ia32/paddw256.desc
new file mode 100644
index 00000000000..9a454eb7f0e
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/paddw256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+paddw256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pand128.c b/regression/cbmc-library/__builtin_ia32/pand128.c
new file mode 100644
index 00000000000..e4edad524a0
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pand128.c
@@ -0,0 +1,14 @@
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+__gcc_v2di __builtin_ia32_pand128(__gcc_v2di, __gcc_v2di);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native &) for all inputs.
+  __gcc_v2di a, b;
+  __gcc_v2di r = __builtin_ia32_pand128(a, b);
+  __gcc_v2di ref = a & b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1], "__builtin_ia32_pand128 == native &");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pand128.desc b/regression/cbmc-library/__builtin_ia32/pand128.desc
new file mode 100644
index 00000000000..c57840f1078
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pand128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pand128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pandn128.c b/regression/cbmc-library/__builtin_ia32/pandn128.c
new file mode 100644
index 00000000000..15182358781
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pandn128.c
@@ -0,0 +1,15 @@
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+__gcc_v2di __builtin_ia32_pandn128(__gcc_v2di, __gcc_v2di);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native ~a & b) for all inputs.
+  __gcc_v2di a, b;
+  __gcc_v2di r = __builtin_ia32_pandn128(a, b);
+  __gcc_v2di ref = ~a & b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1],
+    "__builtin_ia32_pandn128 == native ~a & b");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pandn128.desc b/regression/cbmc-library/__builtin_ia32/pandn128.desc
new file mode 100644
index 00000000000..3bbccccbef0
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pandn128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pandn128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pavgb128.c b/regression/cbmc-library/__builtin_ia32/pavgb128.c
new file mode 100644
index 00000000000..28bdb854dda
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pavgb128.c
@@ -0,0 +1,15 @@
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+__gcc_v16qi __builtin_ia32_pavgb128(__gcc_v16qi, __gcc_v16qi);
+
+int main()
+{
+  // Lane 0 distinguishes unsigned from signed: -1 is 255 unsigned, so the
+  // rounded unsigned average of {255, 1} is (255 + 1 + 1) >> 1 == 128, which
+  // is -128 as a signed byte (a signed average would give 0).
+  __gcc_v16qi a =
+    (__gcc_v16qi){-1, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32};
+  __gcc_v16qi b = (__gcc_v16qi){1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+  __gcc_v16qi r = __builtin_ia32_pavgb128(a, b);
+  __CPROVER_assert(r[0] == -128 && r[1] == 4, "avg epu8");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pavgb128.desc b/regression/cbmc-library/__builtin_ia32/pavgb128.desc
new file mode 100644
index 00000000000..4031b7619c2
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pavgb128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pavgb128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pavgw128.c b/regression/cbmc-library/__builtin_ia32/pavgw128.c
new file mode 100644
index 00000000000..de49e563f75
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pavgw128.c
@@ -0,0 +1,14 @@
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+__gcc_v8hi __builtin_ia32_pavgw128(__gcc_v8hi, __gcc_v8hi);
+
+int main()
+{
+  // Lane 0 distinguishes unsigned from signed: -1 is 65535 unsigned, so the
+  // rounded unsigned average of {65535, 1} is (65535 + 1 + 1) >> 1 == 32768,
+  // which is -32768 as a signed 16-bit value (a signed average would give 0).
+  __gcc_v8hi a = (__gcc_v8hi){-1, 4, 6, 8, 10, 12, 14, 16};
+  __gcc_v8hi b = (__gcc_v8hi){1, 4, 4, 4, 4, 4, 4, 4};
+  __gcc_v8hi r = __builtin_ia32_pavgw128(a, b);
+  __CPROVER_assert(r[0] == -32768 && r[1] == 4, "avg epu16");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pavgw128.desc b/regression/cbmc-library/__builtin_ia32/pavgw128.desc
new file mode 100644
index 00000000000..445c4ef3170
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pavgw128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pavgw128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpeqb128.c b/regression/cbmc-library/__builtin_ia32/pcmpeqb128.c
new file mode 100644
index 00000000000..f309031f310
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpeqb128.c
@@ -0,0 +1,19 @@
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+__gcc_v16qi __builtin_ia32_pcmpeqb128(__gcc_v16qi, __gcc_v16qi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native ==) for all inputs.
+  __gcc_v16qi a, b;
+  __gcc_v16qi r = __builtin_ia32_pcmpeqb128(a, b);
+  __gcc_v16qi ref = a == b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1] && r[2] == ref[2] && r[3] == ref[3] &&
+      r[4] == ref[4] && r[5] == ref[5] && r[6] == ref[6] && r[7] == ref[7] &&
+      r[8] == ref[8] && r[9] == ref[9] && r[10] == ref[10] &&
+      r[11] == ref[11] && r[12] == ref[12] && r[13] == ref[13] &&
+      r[14] == ref[14] && r[15] == ref[15],
+    "__builtin_ia32_pcmpeqb128 == native ==");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpeqb128.desc b/regression/cbmc-library/__builtin_ia32/pcmpeqb128.desc
new file mode 100644
index 00000000000..375804c4553
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpeqb128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pcmpeqb128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpeqb256.c b/regression/cbmc-library/__builtin_ia32/pcmpeqb256.c
new file mode 100644
index 00000000000..a5a3d10a968
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpeqb256.c
@@ -0,0 +1,24 @@
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+__gcc_v32qi __builtin_ia32_pcmpeqb256(__gcc_v32qi, __gcc_v32qi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native ==) for all inputs.
+  __gcc_v32qi a, b;
+  __gcc_v32qi r = __builtin_ia32_pcmpeqb256(a, b);
+  __gcc_v32qi ref = a == b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1] && r[2] == ref[2] && r[3] == ref[3] &&
+      r[4] == ref[4] && r[5] == ref[5] && r[6] == ref[6] && r[7] == ref[7] &&
+      r[8] == ref[8] && r[9] == ref[9] && r[10] == ref[10] &&
+      r[11] == ref[11] && r[12] == ref[12] && r[13] == ref[13] &&
+      r[14] == ref[14] && r[15] == ref[15] && r[16] == ref[16] &&
+      r[17] == ref[17] && r[18] == ref[18] && r[19] == ref[19] &&
+      r[20] == ref[20] && r[21] == ref[21] && r[22] == ref[22] &&
+      r[23] == ref[23] && r[24] == ref[24] && r[25] == ref[25] &&
+      r[26] == ref[26] && r[27] == ref[27] && r[28] == ref[28] &&
+      r[29] == ref[29] && r[30] == ref[30] && r[31] == ref[31],
+    "__builtin_ia32_pcmpeqb256 == native ==");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpeqb256.desc b/regression/cbmc-library/__builtin_ia32/pcmpeqb256.desc
new file mode 100644
index 00000000000..424832d97a0
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpeqb256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pcmpeqb256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpeqd128.c b/regression/cbmc-library/__builtin_ia32/pcmpeqd128.c
new file mode 100644
index 00000000000..7c725c15988
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpeqd128.c
@@ -0,0 +1,15 @@
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+__gcc_v4si __builtin_ia32_pcmpeqd128(__gcc_v4si, __gcc_v4si);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native ==) for all inputs.
+  __gcc_v4si a, b;
+  __gcc_v4si r = __builtin_ia32_pcmpeqd128(a, b);
+  __gcc_v4si ref = a == b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1] && r[2] == ref[2] && r[3] == ref[3],
+    "__builtin_ia32_pcmpeqd128 == native ==");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpeqd128.desc b/regression/cbmc-library/__builtin_ia32/pcmpeqd128.desc
new file mode 100644
index 00000000000..6380ab8ccc9
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpeqd128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pcmpeqd128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpeqd256.c b/regression/cbmc-library/__builtin_ia32/pcmpeqd256.c
new file mode 100644
index 00000000000..139f9b6c503
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpeqd256.c
@@ -0,0 +1,16 @@
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+__gcc_v8si __builtin_ia32_pcmpeqd256(__gcc_v8si, __gcc_v8si);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native ==) for all inputs.
+  __gcc_v8si a, b;
+  __gcc_v8si r = __builtin_ia32_pcmpeqd256(a, b);
+  __gcc_v8si ref = a == b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1] && r[2] == ref[2] && r[3] == ref[3] &&
+      r[4] == ref[4] && r[5] == ref[5] && r[6] == ref[6] && r[7] == ref[7],
+    "__builtin_ia32_pcmpeqd256 == native ==");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpeqd256.desc b/regression/cbmc-library/__builtin_ia32/pcmpeqd256.desc
new file mode 100644
index 00000000000..fdc80820996
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpeqd256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pcmpeqd256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpeqw128.c b/regression/cbmc-library/__builtin_ia32/pcmpeqw128.c
new file mode 100644
index 00000000000..669bb6d0b66
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpeqw128.c
@@ -0,0 +1,16 @@
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+__gcc_v8hi __builtin_ia32_pcmpeqw128(__gcc_v8hi, __gcc_v8hi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native ==) for all inputs.
+  __gcc_v8hi a, b;
+  __gcc_v8hi r = __builtin_ia32_pcmpeqw128(a, b);
+  __gcc_v8hi ref = a == b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1] && r[2] == ref[2] && r[3] == ref[3] &&
+      r[4] == ref[4] && r[5] == ref[5] && r[6] == ref[6] && r[7] == ref[7],
+    "__builtin_ia32_pcmpeqw128 == native ==");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpeqw128.desc b/regression/cbmc-library/__builtin_ia32/pcmpeqw128.desc
new file mode 100644
index 00000000000..c61fe25ed97
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpeqw128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pcmpeqw128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpeqw256.c b/regression/cbmc-library/__builtin_ia32/pcmpeqw256.c
new file mode 100644
index 00000000000..39d72b8e0f9
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpeqw256.c
@@ -0,0 +1,19 @@
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+__gcc_v16hi __builtin_ia32_pcmpeqw256(__gcc_v16hi, __gcc_v16hi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native ==) for all inputs.
+  __gcc_v16hi a, b;
+  __gcc_v16hi r = __builtin_ia32_pcmpeqw256(a, b);
+  __gcc_v16hi ref = a == b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1] && r[2] == ref[2] && r[3] == ref[3] &&
+      r[4] == ref[4] && r[5] == ref[5] && r[6] == ref[6] && r[7] == ref[7] &&
+      r[8] == ref[8] && r[9] == ref[9] && r[10] == ref[10] &&
+      r[11] == ref[11] && r[12] == ref[12] && r[13] == ref[13] &&
+      r[14] == ref[14] && r[15] == ref[15],
+    "__builtin_ia32_pcmpeqw256 == native ==");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpeqw256.desc b/regression/cbmc-library/__builtin_ia32/pcmpeqw256.desc
new file mode 100644
index 00000000000..15a3db92b38
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpeqw256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pcmpeqw256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpgtb128.c b/regression/cbmc-library/__builtin_ia32/pcmpgtb128.c
new file mode 100644
index 00000000000..d77ce9868f5
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpgtb128.c
@@ -0,0 +1,19 @@
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+__gcc_v16qi __builtin_ia32_pcmpgtb128(__gcc_v16qi, __gcc_v16qi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native >) for all inputs.
+  __gcc_v16qi a, b;
+  __gcc_v16qi r = __builtin_ia32_pcmpgtb128(a, b);
+  __gcc_v16qi ref = a > b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1] && r[2] == ref[2] && r[3] == ref[3] &&
+      r[4] == ref[4] && r[5] == ref[5] && r[6] == ref[6] && r[7] == ref[7] &&
+      r[8] == ref[8] && r[9] == ref[9] && r[10] == ref[10] &&
+      r[11] == ref[11] && r[12] == ref[12] && r[13] == ref[13] &&
+      r[14] == ref[14] && r[15] == ref[15],
+    "__builtin_ia32_pcmpgtb128 == native >");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpgtb128.desc b/regression/cbmc-library/__builtin_ia32/pcmpgtb128.desc
new file mode 100644
index 00000000000..68134593e5a
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpgtb128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pcmpgtb128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpgtb256.c b/regression/cbmc-library/__builtin_ia32/pcmpgtb256.c
new file mode 100644
index 00000000000..e0a35f5f8e4
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpgtb256.c
@@ -0,0 +1,24 @@
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+__gcc_v32qi __builtin_ia32_pcmpgtb256(__gcc_v32qi, __gcc_v32qi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native >) for all inputs.
+  __gcc_v32qi a, b;
+  __gcc_v32qi r = __builtin_ia32_pcmpgtb256(a, b);
+  __gcc_v32qi ref = a > b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1] && r[2] == ref[2] && r[3] == ref[3] &&
+      r[4] == ref[4] && r[5] == ref[5] && r[6] == ref[6] && r[7] == ref[7] &&
+      r[8] == ref[8] && r[9] == ref[9] && r[10] == ref[10] &&
+      r[11] == ref[11] && r[12] == ref[12] && r[13] == ref[13] &&
+      r[14] == ref[14] && r[15] == ref[15] && r[16] == ref[16] &&
+      r[17] == ref[17] && r[18] == ref[18] && r[19] == ref[19] &&
+      r[20] == ref[20] && r[21] == ref[21] && r[22] == ref[22] &&
+      r[23] == ref[23] && r[24] == ref[24] && r[25] == ref[25] &&
+      r[26] == ref[26] && r[27] == ref[27] && r[28] == ref[28] &&
+      r[29] == ref[29] && r[30] == ref[30] && r[31] == ref[31],
+    "__builtin_ia32_pcmpgtb256 == native >");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpgtb256.desc b/regression/cbmc-library/__builtin_ia32/pcmpgtb256.desc
new file mode 100644
index 00000000000..efe7db4e7a9
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpgtb256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pcmpgtb256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpgtd128.c b/regression/cbmc-library/__builtin_ia32/pcmpgtd128.c
new file mode 100644
index 00000000000..4f292a46afa
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpgtd128.c
@@ -0,0 +1,15 @@
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+__gcc_v4si __builtin_ia32_pcmpgtd128(__gcc_v4si, __gcc_v4si);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native >) for all inputs.
+  __gcc_v4si a, b;
+  __gcc_v4si r = __builtin_ia32_pcmpgtd128(a, b);
+  __gcc_v4si ref = a > b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1] && r[2] == ref[2] && r[3] == ref[3],
+    "__builtin_ia32_pcmpgtd128 == native >");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpgtd128.desc b/regression/cbmc-library/__builtin_ia32/pcmpgtd128.desc
new file mode 100644
index 00000000000..c98acf38b12
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpgtd128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pcmpgtd128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpgtd256.c b/regression/cbmc-library/__builtin_ia32/pcmpgtd256.c
new file mode 100644
index 00000000000..fdc03173c57
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpgtd256.c
@@ -0,0 +1,16 @@
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+__gcc_v8si __builtin_ia32_pcmpgtd256(__gcc_v8si, __gcc_v8si);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native >) for all inputs.
+  __gcc_v8si a, b;
+  __gcc_v8si r = __builtin_ia32_pcmpgtd256(a, b);
+  __gcc_v8si ref = a > b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1] && r[2] == ref[2] && r[3] == ref[3] &&
+      r[4] == ref[4] && r[5] == ref[5] && r[6] == ref[6] && r[7] == ref[7],
+    "__builtin_ia32_pcmpgtd256 == native >");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpgtd256.desc b/regression/cbmc-library/__builtin_ia32/pcmpgtd256.desc
new file mode 100644
index 00000000000..40838c745af
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpgtd256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pcmpgtd256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpgtw128.c b/regression/cbmc-library/__builtin_ia32/pcmpgtw128.c
new file mode 100644
index 00000000000..a6e1a85a5c1
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpgtw128.c
@@ -0,0 +1,16 @@
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+__gcc_v8hi __builtin_ia32_pcmpgtw128(__gcc_v8hi, __gcc_v8hi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native >) for all inputs.
+  __gcc_v8hi a, b;
+  __gcc_v8hi r = __builtin_ia32_pcmpgtw128(a, b);
+  __gcc_v8hi ref = a > b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1] && r[2] == ref[2] && r[3] == ref[3] &&
+      r[4] == ref[4] && r[5] == ref[5] && r[6] == ref[6] && r[7] == ref[7],
+    "__builtin_ia32_pcmpgtw128 == native >");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpgtw128.desc b/regression/cbmc-library/__builtin_ia32/pcmpgtw128.desc
new file mode 100644
index 00000000000..3666d2ea036
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpgtw128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pcmpgtw128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpgtw256.c b/regression/cbmc-library/__builtin_ia32/pcmpgtw256.c
new file mode 100644
index 00000000000..473493039c5
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpgtw256.c
@@ -0,0 +1,19 @@
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+__gcc_v16hi __builtin_ia32_pcmpgtw256(__gcc_v16hi, __gcc_v16hi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native >) for all inputs.
+  __gcc_v16hi a, b;
+  __gcc_v16hi r = __builtin_ia32_pcmpgtw256(a, b);
+  __gcc_v16hi ref = a > b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1] && r[2] == ref[2] && r[3] == ref[3] &&
+      r[4] == ref[4] && r[5] == ref[5] && r[6] == ref[6] && r[7] == ref[7] &&
+      r[8] == ref[8] && r[9] == ref[9] && r[10] == ref[10] &&
+      r[11] == ref[11] && r[12] == ref[12] && r[13] == ref[13] &&
+      r[14] == ref[14] && r[15] == ref[15],
+    "__builtin_ia32_pcmpgtw256 == native >");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pcmpgtw256.desc b/regression/cbmc-library/__builtin_ia32/pcmpgtw256.desc
new file mode 100644
index 00000000000..18e5aee5ed5
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pcmpgtw256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pcmpgtw256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pmaxsb128.c b/regression/cbmc-library/__builtin_ia32/pmaxsb128.c
new file mode 100644
index 00000000000..8d9056db084
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmaxsb128.c
@@ -0,0 +1,13 @@
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+__gcc_v16qi __builtin_ia32_pmaxsb128(__gcc_v16qi, __gcc_v16qi);
+
+int main()
+{
+  __gcc_v16qi a = (__gcc_v16qi){
+    1, -2, 3, -4, 5, -6, 7, -8, 9, -10, 11, -12, 13, -14, 15, -16};
+  __gcc_v16qi b = (__gcc_v16qi){
+    -1, 2, -3, 4, -5, 6, -7, 8, -9, 10, -11, 12, -13, 14, -15, 16};
+  __gcc_v16qi r = __builtin_ia32_pmaxsb128(a, b);
+  __CPROVER_assert(r[0] == 1 && r[1] == 2, "max epi8");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pmaxsb128.desc b/regression/cbmc-library/__builtin_ia32/pmaxsb128.desc
new file mode 100644
index 00000000000..96001578a50
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmaxsb128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pmaxsb128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pmaxsd128.c b/regression/cbmc-library/__builtin_ia32/pmaxsd128.c
new file mode 100644
index 00000000000..4d9d07fbf93
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmaxsd128.c
@@ -0,0 +1,11 @@
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+__gcc_v4si __builtin_ia32_pmaxsd128(__gcc_v4si, __gcc_v4si);
+
+int main()
+{
+  __gcc_v4si a = (__gcc_v4si){1, -2, 3, -4};
+  __gcc_v4si b = (__gcc_v4si){-1, 2, -3, 4};
+  __gcc_v4si r = __builtin_ia32_pmaxsd128(a, b);
+  __CPROVER_assert(r[0] == 1 && r[1] == 2, "max epi32");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pmaxsd128.desc b/regression/cbmc-library/__builtin_ia32/pmaxsd128.desc
new file mode 100644
index 00000000000..2a3c7a89a68
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmaxsd128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pmaxsd128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pmaxsw128.c b/regression/cbmc-library/__builtin_ia32/pmaxsw128.c
new file mode 100644
index 00000000000..285d0c96bfb
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmaxsw128.c
@@ -0,0 +1,11 @@
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+__gcc_v8hi __builtin_ia32_pmaxsw128(__gcc_v8hi, __gcc_v8hi);
+
+int main()
+{
+  __gcc_v8hi a = (__gcc_v8hi){1, -2, 3, -4, 5, -6, 7, -8};
+  __gcc_v8hi b = (__gcc_v8hi){-1, 2, -3, 4, -5, 6, -7, 8};
+  __gcc_v8hi r = __builtin_ia32_pmaxsw128(a, b);
+  __CPROVER_assert(r[0] == 1 && r[1] == 2, "max epi16");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pmaxsw128.desc b/regression/cbmc-library/__builtin_ia32/pmaxsw128.desc
new file mode 100644
index 00000000000..47602b1aa00
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmaxsw128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pmaxsw128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pmaxub128.c b/regression/cbmc-library/__builtin_ia32/pmaxub128.c
new file mode 100644
index 00000000000..2307506d037
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmaxub128.c
@@ -0,0 +1,15 @@
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+__gcc_v16qi __builtin_ia32_pmaxub128(__gcc_v16qi, __gcc_v16qi);
+
+int main()
+{
+  // Lane 0 distinguishes unsigned from signed: -1 is 0xFF, the largest value
+  // under unsigned comparison, so the unsigned max of {-1, 0} is -1.
+  __gcc_v16qi a =
+    (__gcc_v16qi){-1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
+  __gcc_v16qi b =
+    (__gcc_v16qi){0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1};
+  __gcc_v16qi r = __builtin_ia32_pmaxub128(a, b);
+  __CPROVER_assert(r[0] == -1 && r[15] == 16, "max epu8");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pmaxub128.desc b/regression/cbmc-library/__builtin_ia32/pmaxub128.desc
new file mode 100644
index 00000000000..c0a91eb8daf
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmaxub128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pmaxub128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pmaxud128.c b/regression/cbmc-library/__builtin_ia32/pmaxud128.c
new file mode 100644
index 00000000000..e2a59ca4915
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmaxud128.c
@@ -0,0 +1,14 @@
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+__gcc_v4si __builtin_ia32_pmaxud128(__gcc_v4si, __gcc_v4si);
+
+int main()
+{
+  // Lane 0 distinguishes unsigned from signed: -1 is 0xFFFFFFFF, the largest
+  // value under unsigned comparison, so the unsigned max of {-1, 0} is -1
+  // (a signed max would pick 0).
+  __gcc_v4si a = (__gcc_v4si){-1, 2, 3, 4};
+  __gcc_v4si b = (__gcc_v4si){0, 3, 2, 1};
+  __gcc_v4si r = __builtin_ia32_pmaxud128(a, b);
+  __CPROVER_assert(r[0] == -1 && r[3] == 4, "max epu32");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pmaxud128.desc b/regression/cbmc-library/__builtin_ia32/pmaxud128.desc
new file mode 100644
index 00000000000..8e734356995
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmaxud128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pmaxud128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pmaxud256.c b/regression/cbmc-library/__builtin_ia32/pmaxud256.c
new file mode 100644
index 00000000000..c5bec14a94f
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmaxud256.c
@@ -0,0 +1,12 @@
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+__gcc_v8si __builtin_ia32_pmaxud256(__gcc_v8si, __gcc_v8si);
+
+int main()
+{
+  // Lane 0: -1 is 0xFFFFFFFF, the unsigned max of {-1, 0}.
+  __gcc_v8si a = (__gcc_v8si){-1, 2, 3, 4, 5, 6, 7, 8};
+  __gcc_v8si b = (__gcc_v8si){0, 3, 2, 1, 0, 0, 0, 0};
+  __gcc_v8si r = __builtin_ia32_pmaxud256(a, b);
+  __CPROVER_assert(r[0] == -1 && r[3] == 4, "max epu32 (256)");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pmaxud256.desc b/regression/cbmc-library/__builtin_ia32/pmaxud256.desc
new file mode 100644
index 00000000000..93b79986958
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmaxud256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pmaxud256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pmaxuw128.c b/regression/cbmc-library/__builtin_ia32/pmaxuw128.c
new file mode 100644
index 00000000000..8b973b207e1
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmaxuw128.c
@@ -0,0 +1,13 @@
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+__gcc_v8hi __builtin_ia32_pmaxuw128(__gcc_v8hi, __gcc_v8hi);
+
+int main()
+{
+  // Lane 0 distinguishes unsigned from signed: -1 is 0xFFFF, the largest
+  // value under unsigned comparison, so the unsigned max of {-1, 0} is -1.
+  __gcc_v8hi a = (__gcc_v8hi){-1, 2, 3, 4, 5, 6, 7, 8};
+  __gcc_v8hi b = (__gcc_v8hi){0, 7, 6, 5, 4, 3, 2, 1};
+  __gcc_v8hi r = __builtin_ia32_pmaxuw128(a, b);
+  __CPROVER_assert(r[0] == -1 && r[7] == 8, "max epu16");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pmaxuw128.desc b/regression/cbmc-library/__builtin_ia32/pmaxuw128.desc
new file mode 100644
index 00000000000..4dd45fa9dad
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmaxuw128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pmaxuw128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pminsb128.c b/regression/cbmc-library/__builtin_ia32/pminsb128.c
new file mode 100644
index 00000000000..bd9bc28963c
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pminsb128.c
@@ -0,0 +1,15 @@
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+__gcc_v16qi __builtin_ia32_pminsb128(__gcc_v16qi, __gcc_v16qi);
+
+int main()
+{
+  __gcc_v16qi a = (__gcc_v16qi){
+    1, -2, 3, -4, 5, -6, 7, -8, 9, -10, 11, -12, 13, -14, 15, -16};
+  __gcc_v16qi b = (__gcc_v16qi){
+    -1, 2, -3, 4, -5, 6, -7, 8, -9, 10, -11, 12, -13, 14, -15, 16};
+  __gcc_v16qi r = __builtin_ia32_pminsb128(a, b);
+  // Compare as bytes: -1, -2 cast to char yield 0xFF, 0xFE on either
+  // signedness.
+  __CPROVER_assert(r[0] == (char)-1 && r[1] == (char)-2, "min epi8");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pminsb128.desc b/regression/cbmc-library/__builtin_ia32/pminsb128.desc
new file mode 100644
index 00000000000..0643d0b6179
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pminsb128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pminsb128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pminsd128.c b/regression/cbmc-library/__builtin_ia32/pminsd128.c
new file mode 100644
index 00000000000..0fda0a35a62
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pminsd128.c
@@ -0,0 +1,11 @@
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+__gcc_v4si __builtin_ia32_pminsd128(__gcc_v4si, __gcc_v4si);
+
+int main()
+{
+  __gcc_v4si a = (__gcc_v4si){1, -2, 3, -4};
+  __gcc_v4si b = (__gcc_v4si){-1, 2, -3, 4};
+  __gcc_v4si r = __builtin_ia32_pminsd128(a, b);
+  __CPROVER_assert(r[0] == -1 && r[1] == -2, "min epi32");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pminsd128.desc b/regression/cbmc-library/__builtin_ia32/pminsd128.desc
new file mode 100644
index 00000000000..9f5975a043f
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pminsd128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pminsd128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pminsw128.c b/regression/cbmc-library/__builtin_ia32/pminsw128.c
new file mode 100644
index 00000000000..fb474403b6d
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pminsw128.c
@@ -0,0 +1,11 @@
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+__gcc_v8hi __builtin_ia32_pminsw128(__gcc_v8hi, __gcc_v8hi);
+
+int main()
+{
+  __gcc_v8hi a = (__gcc_v8hi){1, -2, 3, -4, 5, -6, 7, -8};
+  __gcc_v8hi b = (__gcc_v8hi){-1, 2, -3, 4, -5, 6, -7, 8};
+  __gcc_v8hi r = __builtin_ia32_pminsw128(a, b);
+  __CPROVER_assert(r[0] == -1 && r[1] == -2, "min epi16");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pminsw128.desc b/regression/cbmc-library/__builtin_ia32/pminsw128.desc
new file mode 100644
index 00000000000..014e50d7944
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pminsw128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pminsw128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pminub128.c b/regression/cbmc-library/__builtin_ia32/pminub128.c
new file mode 100644
index 00000000000..c68ee28c946
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pminub128.c
@@ -0,0 +1,15 @@
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+__gcc_v16qi __builtin_ia32_pminub128(__gcc_v16qi, __gcc_v16qi);
+
+int main()
+{
+  // Lane 0 distinguishes unsigned from signed: -1 is 0xFF, the largest value
+  // under unsigned comparison, so the unsigned min of {-1, 0} is 0.
+  __gcc_v16qi a =
+    (__gcc_v16qi){-1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
+  __gcc_v16qi b =
+    (__gcc_v16qi){0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1};
+  __gcc_v16qi r = __builtin_ia32_pminub128(a, b);
+  __CPROVER_assert(r[0] == 0 && r[15] == 1, "min epu8");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pminub128.desc b/regression/cbmc-library/__builtin_ia32/pminub128.desc
new file mode 100644
index 00000000000..ed4bebc51b7
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pminub128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pminub128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pminud128.c b/regression/cbmc-library/__builtin_ia32/pminud128.c
new file mode 100644
index 00000000000..c97bc1a77f8
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pminud128.c
@@ -0,0 +1,14 @@
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+__gcc_v4si __builtin_ia32_pminud128(__gcc_v4si, __gcc_v4si);
+
+int main()
+{
+  // Lane 0 distinguishes unsigned from signed: -1 is 0xFFFFFFFF, the largest
+  // value under unsigned comparison, so the unsigned min of {-1, 0} is 0
+  // (a signed min would pick -1).
+  __gcc_v4si a = (__gcc_v4si){-1, 2, 3, 4};
+  __gcc_v4si b = (__gcc_v4si){0, 3, 2, 1};
+  __gcc_v4si r = __builtin_ia32_pminud128(a, b);
+  __CPROVER_assert(r[0] == 0 && r[3] == 1, "min epu32");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pminud128.desc b/regression/cbmc-library/__builtin_ia32/pminud128.desc
new file mode 100644
index 00000000000..2d44dc3dff0
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pminud128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pminud128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pminuw128.c b/regression/cbmc-library/__builtin_ia32/pminuw128.c
new file mode 100644
index 00000000000..e8e04258b50
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pminuw128.c
@@ -0,0 +1,13 @@
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+__gcc_v8hi __builtin_ia32_pminuw128(__gcc_v8hi, __gcc_v8hi);
+
+int main()
+{
+  // Lane 0 distinguishes unsigned from signed: -1 is 0xFFFF, the largest
+  // value under unsigned comparison, so the unsigned min of {-1, 0} is 0.
+  __gcc_v8hi a = (__gcc_v8hi){-1, 2, 3, 4, 5, 6, 7, 8};
+  __gcc_v8hi b = (__gcc_v8hi){0, 7, 6, 5, 4, 3, 2, 1};
+  __gcc_v8hi r = __builtin_ia32_pminuw128(a, b);
+  __CPROVER_assert(r[0] == 0 && r[7] == 1, "min epu16");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pminuw128.desc b/regression/cbmc-library/__builtin_ia32/pminuw128.desc
new file mode 100644
index 00000000000..b58b52930e7
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pminuw128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pminuw128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pmulld128.c b/regression/cbmc-library/__builtin_ia32/pmulld128.c
new file mode 100644
index 00000000000..a582362b705
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmulld128.c
@@ -0,0 +1,16 @@
+#include <limits.h>
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+__gcc_v4si __builtin_ia32_pmulld128(__gcc_v4si, __gcc_v4si);
+
+int main()
+{
+  // Lane 0 exercises two's-complement wraparound: INT_MAX * 2 keeps only the
+  // low 32 bits, 0xFFFFFFFE == -2. Run under --signed-overflow-check (see
+  // test.desc).
+  __gcc_v4si a = (__gcc_v4si){INT_MAX, 2, 3, 4};
+  __gcc_v4si b = (__gcc_v4si){2, 6, 7, 8};
+  __gcc_v4si r = __builtin_ia32_pmulld128(a, b);
+  __CPROVER_assert(r[0] == -2 && r[3] == 32, "mullo epi32");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pmulld128.desc b/regression/cbmc-library/__builtin_ia32/pmulld128.desc
new file mode 100644
index 00000000000..f0e3d66f8b1
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmulld128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pmulld128.c
+--signed-overflow-check
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pmullw128.c b/regression/cbmc-library/__builtin_ia32/pmullw128.c
new file mode 100644
index 00000000000..dab334b8aa6
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmullw128.c
@@ -0,0 +1,11 @@
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+__gcc_v8hi __builtin_ia32_pmullw128(__gcc_v8hi, __gcc_v8hi);
+
+int main()
+{
+  __gcc_v8hi a = (__gcc_v8hi){1, 2, 3, 4, 5, 6, 7, 8};
+  __gcc_v8hi b = (__gcc_v8hi){2, 3, 4, 5, 6, 7, 8, 9};
+  __gcc_v8hi r = __builtin_ia32_pmullw128(a, b);
+  __CPROVER_assert(r[0] == 2 && r[1] == 6 && r[7] == 72, "mullo epi16");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pmullw128.desc b/regression/cbmc-library/__builtin_ia32/pmullw128.desc
new file mode 100644
index 00000000000..a5c22e6c33b
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pmullw128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pmullw128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/por128.c b/regression/cbmc-library/__builtin_ia32/por128.c
new file mode 100644
index 00000000000..443d917d23e
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/por128.c
@@ -0,0 +1,14 @@
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+__gcc_v2di __builtin_ia32_por128(__gcc_v2di, __gcc_v2di);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native |) for all inputs.
+  __gcc_v2di a, b;
+  __gcc_v2di r = __builtin_ia32_por128(a, b);
+  __gcc_v2di ref = a | b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1], "__builtin_ia32_por128 == native |");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/por128.desc b/regression/cbmc-library/__builtin_ia32/por128.desc
new file mode 100644
index 00000000000..49e47ee5685
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/por128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+por128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/por256.c b/regression/cbmc-library/__builtin_ia32/por256.c
new file mode 100644
index 00000000000..de40cdb55d0
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/por256.c
@@ -0,0 +1,15 @@
+typedef long long __gcc_v4di __attribute__((__vector_size__(32)));
+__gcc_v4di __builtin_ia32_por256(__gcc_v4di, __gcc_v4di);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native |) for all inputs.
+  __gcc_v4di a, b;
+  __gcc_v4di r = __builtin_ia32_por256(a, b);
+  __gcc_v4di ref = a | b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1] && r[2] == ref[2] && r[3] == ref[3],
+    "__builtin_ia32_por256 == native |");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/por256.desc b/regression/cbmc-library/__builtin_ia32/por256.desc
new file mode 100644
index 00000000000..586bd591757
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/por256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+por256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/psllwi128.c b/regression/cbmc-library/__builtin_ia32/psllwi128.c
new file mode 100644
index 00000000000..b6e38827275
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psllwi128.c
@@ -0,0 +1,11 @@
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+__gcc_v8hi __builtin_ia32_psllwi128(__gcc_v8hi, int);
+
+int main()
+{
+  __gcc_v8hi a = (__gcc_v8hi){1, 2, 3, 4, 5, 6, 7, 8};
+  __gcc_v8hi r = __builtin_ia32_psllwi128(a, 4);  // logical left by 4
+  __gcc_v8hi z = __builtin_ia32_psllwi128(a, 20); // count >= 16 -> 0
+  __CPROVER_assert(r[0] == 16 && r[1] == 32 && z[0] == 0, "slli epi16");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/psllwi128.desc b/regression/cbmc-library/__builtin_ia32/psllwi128.desc
new file mode 100644
index 00000000000..1db4a839ec1
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psllwi128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+psllwi128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/psradi128.c b/regression/cbmc-library/__builtin_ia32/psradi128.c
new file mode 100644
index 00000000000..bcac8fcf821
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psradi128.c
@@ -0,0 +1,13 @@
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+__gcc_v4si __builtin_ia32_psradi128(__gcc_v4si, int);
+
+int main()
+{
+  __gcc_v4si a = (__gcc_v4si){-16, 8, -1, 4};
+  __gcc_v4si r = __builtin_ia32_psradi128(a, 2); // arithmetic right by 2
+  // count >= 32 -> sign fill: -16 -> -1, 8 -> 0
+  __gcc_v4si s = __builtin_ia32_psradi128(a, 40);
+  __CPROVER_assert(
+    r[0] == -4 && r[1] == 2 && s[0] == -1 && s[1] == 0, "srai epi32");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/psradi128.desc b/regression/cbmc-library/__builtin_ia32/psradi128.desc
new file mode 100644
index 00000000000..55d32b101bd
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psradi128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+psradi128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/psrlwi128.c b/regression/cbmc-library/__builtin_ia32/psrlwi128.c
new file mode 100644
index 00000000000..12f4af8e8e3
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psrlwi128.c
@@ -0,0 +1,12 @@
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+__gcc_v8hi __builtin_ia32_psrlwi128(__gcc_v8hi, int);
+
+int main()
+{
+  // Logical (zero-fill) right shift: 0xFFFF (-1) >> 4 == 0x0FFF == 4095,
+  // distinguishing it from an arithmetic shift (which would give -1).
+  __gcc_v8hi a = (__gcc_v8hi){-1, 16, 3, 4, 5, 6, 7, 8};
+  __gcc_v8hi r = __builtin_ia32_psrlwi128(a, 4);
+  __CPROVER_assert(r[0] == 4095 && r[1] == 1, "srli epi16");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/psrlwi128.desc b/regression/cbmc-library/__builtin_ia32/psrlwi128.desc
new file mode 100644
index 00000000000..3e27476f080
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psrlwi128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+psrlwi128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/psubb.c b/regression/cbmc-library/__builtin_ia32/psubb.c
new file mode 100644
index 00000000000..b3c09b7806e
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubb.c
@@ -0,0 +1,18 @@
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
+__gcc_v8qi __builtin_ia32_psubb(__gcc_v8qi, __gcc_v8qi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native -) for all inputs.
+  __gcc_v8qi a, b;
+  __gcc_v8qi r = __builtin_ia32_psubb(a, b);
+  __gcc_v8qi_u ref = (__gcc_v8qi_u)a - (__gcc_v8qi_u)b;
+  __CPROVER_assert(
+    r[0] == (char)ref[0] && r[1] == (char)ref[1] && r[2] == (char)ref[2] &&
+      r[3] == (char)ref[3] && r[4] == (char)ref[4] && r[5] == (char)ref[5] &&
+      r[6] == (char)ref[6] && r[7] == (char)ref[7],
+    "__builtin_ia32_psubb == native -");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/psubb.desc b/regression/cbmc-library/__builtin_ia32/psubb.desc
new file mode 100644
index 00000000000..71728a90042
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubb.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+psubb.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/psubb128.c b/regression/cbmc-library/__builtin_ia32/psubb128.c
new file mode 100644
index 00000000000..fe94fef8cd1
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubb128.c
@@ -0,0 +1,22 @@
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+__gcc_v16qi __builtin_ia32_psubb128(__gcc_v16qi, __gcc_v16qi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native -) for all inputs.
+  __gcc_v16qi a, b;
+  __gcc_v16qi r = __builtin_ia32_psubb128(a, b);
+  __gcc_v16qi_u ref = (__gcc_v16qi_u)a - (__gcc_v16qi_u)b;
+  __CPROVER_assert(
+    r[0] == (char)ref[0] && r[1] == (char)ref[1] && r[2] == (char)ref[2] &&
+      r[3] == (char)ref[3] && r[4] == (char)ref[4] && r[5] == (char)ref[5] &&
+      r[6] == (char)ref[6] && r[7] == (char)ref[7] && r[8] == (char)ref[8] &&
+      r[9] == (char)ref[9] && r[10] == (char)ref[10] &&
+      r[11] == (char)ref[11] && r[12] == (char)ref[12] &&
+      r[13] == (char)ref[13] && r[14] == (char)ref[14] &&
+      r[15] == (char)ref[15],
+    "__builtin_ia32_psubb128 == native -");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/psubb128.desc b/regression/cbmc-library/__builtin_ia32/psubb128.desc
new file mode 100644
index 00000000000..40976fde992
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubb128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+psubb128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/psubb256.c b/regression/cbmc-library/__builtin_ia32/psubb256.c
new file mode 100644
index 00000000000..bbfc355e285
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubb256.c
@@ -0,0 +1,30 @@
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef unsigned char __gcc_v32qi_u __attribute__((__vector_size__(32)));
+__gcc_v32qi __builtin_ia32_psubb256(__gcc_v32qi, __gcc_v32qi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native -) for all inputs.
+  __gcc_v32qi a, b;
+  __gcc_v32qi r = __builtin_ia32_psubb256(a, b);
+  __gcc_v32qi_u ref = (__gcc_v32qi_u)a - (__gcc_v32qi_u)b;
+  __CPROVER_assert(
+    r[0] == (char)ref[0] && r[1] == (char)ref[1] && r[2] == (char)ref[2] &&
+      r[3] == (char)ref[3] && r[4] == (char)ref[4] && r[5] == (char)ref[5] &&
+      r[6] == (char)ref[6] && r[7] == (char)ref[7] && r[8] == (char)ref[8] &&
+      r[9] == (char)ref[9] && r[10] == (char)ref[10] &&
+      r[11] == (char)ref[11] && r[12] == (char)ref[12] &&
+      r[13] == (char)ref[13] && r[14] == (char)ref[14] &&
+      r[15] == (char)ref[15] && r[16] == (char)ref[16] &&
+      r[17] == (char)ref[17] && r[18] == (char)ref[18] &&
+      r[19] == (char)ref[19] && r[20] == (char)ref[20] &&
+      r[21] == (char)ref[21] && r[22] == (char)ref[22] &&
+      r[23] == (char)ref[23] && r[24] == (char)ref[24] &&
+      r[25] == (char)ref[25] && r[26] == (char)ref[26] &&
+      r[27] == (char)ref[27] && r[28] == (char)ref[28] &&
+      r[29] == (char)ref[29] && r[30] == (char)ref[30] &&
+      r[31] == (char)ref[31],
+    "__builtin_ia32_psubb256 == native -");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/psubb256.desc b/regression/cbmc-library/__builtin_ia32/psubb256.desc
new file mode 100644
index 00000000000..a13661f6b8f
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubb256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+psubb256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/psubd.c b/regression/cbmc-library/__builtin_ia32/psubd.c
new file mode 100644
index 00000000000..3e223f870c9
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubd.c
@@ -0,0 +1,16 @@
+typedef int __gcc_v2si __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+__gcc_v2si __builtin_ia32_psubd(__gcc_v2si, __gcc_v2si);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native -) for all inputs.
+  __gcc_v2si a, b;
+  __gcc_v2si r = __builtin_ia32_psubd(a, b);
+  __gcc_v2si_u ref = (__gcc_v2si_u)a - (__gcc_v2si_u)b;
+  __CPROVER_assert(
+    r[0] == (int)ref[0] && r[1] == (int)ref[1],
+    "__builtin_ia32_psubd == native -");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/psubd.desc b/regression/cbmc-library/__builtin_ia32/psubd.desc
new file mode 100644
index 00000000000..3b628b4af52
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubd.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+psubd.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/psubd128.c b/regression/cbmc-library/__builtin_ia32/psubd128.c
new file mode 100644
index 00000000000..630d8101ac5
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubd128.c
@@ -0,0 +1,17 @@
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+__gcc_v4si __builtin_ia32_psubd128(__gcc_v4si, __gcc_v4si);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native -) for all inputs.
+  __gcc_v4si a, b;
+  __gcc_v4si r = __builtin_ia32_psubd128(a, b);
+  __gcc_v4si_u ref = (__gcc_v4si_u)a - (__gcc_v4si_u)b;
+  __CPROVER_assert(
+    r[0] == (int)ref[0] && r[1] == (int)ref[1] && r[2] == (int)ref[2] &&
+      r[3] == (int)ref[3],
+    "__builtin_ia32_psubd128 == native -");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/psubd128.desc b/regression/cbmc-library/__builtin_ia32/psubd128.desc
new file mode 100644
index 00000000000..1ca34f34565
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubd128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+psubd128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/psubd256.c b/regression/cbmc-library/__builtin_ia32/psubd256.c
new file mode 100644
index 00000000000..001597b2bd2
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubd256.c
@@ -0,0 +1,18 @@
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+typedef unsigned int __gcc_v8si_u __attribute__((__vector_size__(32)));
+__gcc_v8si __builtin_ia32_psubd256(__gcc_v8si, __gcc_v8si);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native -) for all inputs.
+  __gcc_v8si a, b;
+  __gcc_v8si r = __builtin_ia32_psubd256(a, b);
+  __gcc_v8si_u ref = (__gcc_v8si_u)a - (__gcc_v8si_u)b;
+  __CPROVER_assert(
+    r[0] == (int)ref[0] && r[1] == (int)ref[1] && r[2] == (int)ref[2] &&
+      r[3] == (int)ref[3] && r[4] == (int)ref[4] && r[5] == (int)ref[5] &&
+      r[6] == (int)ref[6] && r[7] == (int)ref[7],
+    "__builtin_ia32_psubd256 == native -");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/psubd256.desc b/regression/cbmc-library/__builtin_ia32/psubd256.desc
new file mode 100644
index 00000000000..b0a24982bb3
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubd256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+psubd256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/psubq128.c b/regression/cbmc-library/__builtin_ia32/psubq128.c
new file mode 100644
index 00000000000..aec057f6561
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubq128.c
@@ -0,0 +1,16 @@
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+typedef unsigned long long __gcc_v2di_u __attribute__((__vector_size__(16)));
+__gcc_v2di __builtin_ia32_psubq128(__gcc_v2di, __gcc_v2di);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native -) for all inputs.
+  __gcc_v2di a, b;
+  __gcc_v2di r = __builtin_ia32_psubq128(a, b);
+  __gcc_v2di_u ref = (__gcc_v2di_u)a - (__gcc_v2di_u)b;
+  __CPROVER_assert(
+    r[0] == (long long)ref[0] && r[1] == (long long)ref[1],
+    "__builtin_ia32_psubq128 == native -");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/psubq128.desc b/regression/cbmc-library/__builtin_ia32/psubq128.desc
new file mode 100644
index 00000000000..aa61b3b8f3c
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubq128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+psubq128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/psubq256.c b/regression/cbmc-library/__builtin_ia32/psubq256.c
new file mode 100644
index 00000000000..6474480d969
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubq256.c
@@ -0,0 +1,17 @@
+typedef long long __gcc_v4di __attribute__((__vector_size__(32)));
+typedef unsigned long long __gcc_v4di_u __attribute__((__vector_size__(32)));
+__gcc_v4di __builtin_ia32_psubq256(__gcc_v4di, __gcc_v4di);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native -) for all inputs.
+  __gcc_v4di a, b;
+  __gcc_v4di r = __builtin_ia32_psubq256(a, b);
+  __gcc_v4di_u ref = (__gcc_v4di_u)a - (__gcc_v4di_u)b;
+  __CPROVER_assert(
+    r[0] == (long long)ref[0] && r[1] == (long long)ref[1] &&
+      r[2] == (long long)ref[2] && r[3] == (long long)ref[3],
+    "__builtin_ia32_psubq256 == native -");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/psubq256.desc b/regression/cbmc-library/__builtin_ia32/psubq256.desc
new file mode 100644
index 00000000000..230f733b7be
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubq256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+psubq256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/psubusb128.c b/regression/cbmc-library/__builtin_ia32/psubusb128.c
new file mode 100644
index 00000000000..f8458eefb2b
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubusb128.c
@@ -0,0 +1,14 @@
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+__gcc_v16qi __builtin_ia32_psubusb128(__gcc_v16qi, __gcc_v16qi);
+
+int main()
+{
+  // Unsigned saturating subtract: 10-20 clamps to 0; the bytes 200-100 == 100.
+  __gcc_v16qi a =
+    (__gcc_v16qi){10, 200, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
+  __gcc_v16qi b =
+    (__gcc_v16qi){20, 100, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+  __gcc_v16qi r = __builtin_ia32_psubusb128(a, b);
+  __CPROVER_assert(r[0] == 0 && r[1] == 100, "subs epu8");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/psubusb128.desc b/regression/cbmc-library/__builtin_ia32/psubusb128.desc
new file mode 100644
index 00000000000..0f8d81fef34
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubusb128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+psubusb128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/psubw.c b/regression/cbmc-library/__builtin_ia32/psubw.c
new file mode 100644
index 00000000000..a81d07b56ff
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubw.c
@@ -0,0 +1,17 @@
+typedef short __gcc_v4hi __attribute__((__vector_size__(8)));
+typedef unsigned short __gcc_v4hi_u __attribute__((__vector_size__(8)));
+__gcc_v4hi __builtin_ia32_psubw(__gcc_v4hi, __gcc_v4hi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native -) for all inputs.
+  __gcc_v4hi a, b;
+  __gcc_v4hi r = __builtin_ia32_psubw(a, b);
+  __gcc_v4hi_u ref = (__gcc_v4hi_u)a - (__gcc_v4hi_u)b;
+  __CPROVER_assert(
+    r[0] == (short)ref[0] && r[1] == (short)ref[1] && r[2] == (short)ref[2] &&
+      r[3] == (short)ref[3],
+    "__builtin_ia32_psubw == native -");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/psubw.desc b/regression/cbmc-library/__builtin_ia32/psubw.desc
new file mode 100644
index 00000000000..f6a893ff5e2
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubw.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+psubw.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/psubw128.c b/regression/cbmc-library/__builtin_ia32/psubw128.c
new file mode 100644
index 00000000000..1a0f193610f
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubw128.c
@@ -0,0 +1,18 @@
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+__gcc_v8hi __builtin_ia32_psubw128(__gcc_v8hi, __gcc_v8hi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native -) for all inputs.
+  __gcc_v8hi a, b;
+  __gcc_v8hi r = __builtin_ia32_psubw128(a, b);
+  __gcc_v8hi_u ref = (__gcc_v8hi_u)a - (__gcc_v8hi_u)b;
+  __CPROVER_assert(
+    r[0] == (short)ref[0] && r[1] == (short)ref[1] && r[2] == (short)ref[2] &&
+      r[3] == (short)ref[3] && r[4] == (short)ref[4] && r[5] == (short)ref[5] &&
+      r[6] == (short)ref[6] && r[7] == (short)ref[7],
+    "__builtin_ia32_psubw128 == native -");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/psubw128.desc b/regression/cbmc-library/__builtin_ia32/psubw128.desc
new file mode 100644
index 00000000000..ac9681939ac
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubw128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+psubw128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/psubw256.c b/regression/cbmc-library/__builtin_ia32/psubw256.c
new file mode 100644
index 00000000000..d289f217475
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubw256.c
@@ -0,0 +1,22 @@
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+typedef unsigned short __gcc_v16hi_u __attribute__((__vector_size__(32)));
+__gcc_v16hi __builtin_ia32_psubw256(__gcc_v16hi, __gcc_v16hi);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native -) for all inputs.
+  __gcc_v16hi a, b;
+  __gcc_v16hi r = __builtin_ia32_psubw256(a, b);
+  __gcc_v16hi_u ref = (__gcc_v16hi_u)a - (__gcc_v16hi_u)b;
+  __CPROVER_assert(
+    r[0] == (short)ref[0] && r[1] == (short)ref[1] && r[2] == (short)ref[2] &&
+      r[3] == (short)ref[3] && r[4] == (short)ref[4] && r[5] == (short)ref[5] &&
+      r[6] == (short)ref[6] && r[7] == (short)ref[7] && r[8] == (short)ref[8] &&
+      r[9] == (short)ref[9] && r[10] == (short)ref[10] &&
+      r[11] == (short)ref[11] && r[12] == (short)ref[12] &&
+      r[13] == (short)ref[13] && r[14] == (short)ref[14] &&
+      r[15] == (short)ref[15],
+    "__builtin_ia32_psubw256 == native -");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/psubw256.desc b/regression/cbmc-library/__builtin_ia32/psubw256.desc
new file mode 100644
index 00000000000..e8600412950
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/psubw256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+psubw256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pxor128.c b/regression/cbmc-library/__builtin_ia32/pxor128.c
new file mode 100644
index 00000000000..43fcce31ddc
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pxor128.c
@@ -0,0 +1,14 @@
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+__gcc_v2di __builtin_ia32_pxor128(__gcc_v2di, __gcc_v2di);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native ^) for all inputs.
+  __gcc_v2di a, b;
+  __gcc_v2di r = __builtin_ia32_pxor128(a, b);
+  __gcc_v2di ref = a ^ b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1], "__builtin_ia32_pxor128 == native ^");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pxor128.desc b/regression/cbmc-library/__builtin_ia32/pxor128.desc
new file mode 100644
index 00000000000..1fb6c77d18a
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pxor128.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pxor128.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32/pxor256.c b/regression/cbmc-library/__builtin_ia32/pxor256.c
new file mode 100644
index 00000000000..d4171685a1f
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pxor256.c
@@ -0,0 +1,15 @@
+typedef long long __gcc_v4di __attribute__((__vector_size__(32)));
+__gcc_v4di __builtin_ia32_pxor256(__gcc_v4di, __gcc_v4di);
+
+int main()
+{
+  // Exhaustive equivalence: the model must agree with CBMC's own
+  // vector semantics (native ^) for all inputs.
+  __gcc_v4di a, b;
+  __gcc_v4di r = __builtin_ia32_pxor256(a, b);
+  __gcc_v4di ref = a ^ b;
+  __CPROVER_assert(
+    r[0] == ref[0] && r[1] == ref[1] && r[2] == ref[2] && r[3] == ref[3],
+    "__builtin_ia32_pxor256 == native ^");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_ia32/pxor256.desc b/regression/cbmc-library/__builtin_ia32/pxor256.desc
new file mode 100644
index 00000000000..4144dd8d1da
--- /dev/null
+++ b/regression/cbmc-library/__builtin_ia32/pxor256.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+pxor256.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_ia32_sfence/main.c b/regression/cbmc-library/__builtin_ia32/sfence.c
similarity index 100%
rename from regression/cbmc-library/__builtin_ia32_sfence/main.c
rename to regression/cbmc-library/__builtin_ia32/sfence.c
diff --git a/regression/cbmc-library/__builtin_ia32_sfence/test.desc b/regression/cbmc-library/__builtin_ia32/sfence.desc
similarity index 92%
rename from regression/cbmc-library/__builtin_ia32_sfence/test.desc
rename to regression/cbmc-library/__builtin_ia32/sfence.desc
index 9542d988e8d..325b8348c21 100644
--- a/regression/cbmc-library/__builtin_ia32_sfence/test.desc
+++ b/regression/cbmc-library/__builtin_ia32/sfence.desc
@@ -1,5 +1,5 @@
 KNOWNBUG
-main.c
+sfence.c
 --pointer-check --bounds-check
 ^EXIT=0$
 ^SIGNAL=0$
diff --git a/regression/cbmc-library/__builtin_neon/vabdq_v.c b/regression/cbmc-library/__builtin_neon/vabdq_v.c
new file mode 100644
index 00000000000..0daa95aaff0
--- /dev/null
+++ b/regression/cbmc-library/__builtin_neon/vabdq_v.c
@@ -0,0 +1,14 @@
+// The NEON builtin is declared by the front-end (gcc_builtin_headers_aarch64.h)
+// under an AArch64 target, and its body comes from the cprover library model in
+// src/ansi-c/library/arm_neon.c.  The absolute difference of any vector with
+// itself is zero, regardless of the lane interpretation (type code 32 = s8).
+typedef char v16qi __attribute__((vector_size(16)));
+
+int main()
+{
+  v16qi a;
+  v16qi r = __builtin_neon_vabdq_v(a, a, 32);
+  for(int i = 0; i < 16; i++)
+    __CPROVER_assert(r[i] == 0, "vabdq of equal vectors is zero");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_neon/vabdq_v.desc b/regression/cbmc-library/__builtin_neon/vabdq_v.desc
new file mode 100644
index 00000000000..a5952cb5d38
--- /dev/null
+++ b/regression/cbmc-library/__builtin_neon/vabdq_v.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+vabdq_v.c
+--arch arm64
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_neon/vbslq_v.c b/regression/cbmc-library/__builtin_neon/vbslq_v.c
new file mode 100644
index 00000000000..4e61e7e18b4
--- /dev/null
+++ b/regression/cbmc-library/__builtin_neon/vbslq_v.c
@@ -0,0 +1,16 @@
+// Generated model for the bitwise-select builtin (mnemonic BSL). The operation
+// is bit-level, so it is independent of the lane type code: each result bit
+// comes from a where the mask bit is set, otherwise from b.
+typedef signed char v16 __attribute__((vector_size(16)));
+typedef char v16qi __attribute__((vector_size(16)));
+
+int main()
+{
+  v16 mask, a, b;
+  v16 r = (v16)__builtin_neon_vbslq_v((v16qi)mask, (v16qi)a, (v16qi)b, 32);
+  for(int i = 0; i < 16; i++)
+    __CPROVER_assert(
+      r[i] == (signed char)((mask[i] & a[i]) | (~mask[i] & b[i])),
+      "vbslq selects bits by mask");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_neon/vbslq_v.desc b/regression/cbmc-library/__builtin_neon/vbslq_v.desc
new file mode 100644
index 00000000000..e371f4d9e7e
--- /dev/null
+++ b/regression/cbmc-library/__builtin_neon/vbslq_v.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+vbslq_v.c
+--arch arm64
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_neon/vhaddq_v.c b/regression/cbmc-library/__builtin_neon/vhaddq_v.c
new file mode 100644
index 00000000000..7e3a4760864
--- /dev/null
+++ b/regression/cbmc-library/__builtin_neon/vhaddq_v.c
@@ -0,0 +1,15 @@
+// Generated model for the halving-add builtin (mnemonic UHADD); type code 48
+// selects the uint8x16 interpretation. Check it against floor((a+b)/2).
+typedef unsigned char v16 __attribute__((vector_size(16)));
+typedef char v16qi __attribute__((vector_size(16)));
+
+int main()
+{
+  v16 a, b;
+  v16 r = (v16)__builtin_neon_vhaddq_v((v16qi)a, (v16qi)b, 48);
+  for(int i = 0; i < 16; i++)
+    __CPROVER_assert(
+      r[i] == (unsigned char)(((int)a[i] + (int)b[i]) >> 1),
+      "vhaddq_u8 == floor((a+b)/2)");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_neon/vhaddq_v.desc b/regression/cbmc-library/__builtin_neon/vhaddq_v.desc
new file mode 100644
index 00000000000..3da17f4008f
--- /dev/null
+++ b/regression/cbmc-library/__builtin_neon/vhaddq_v.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+vhaddq_v.c
+--arch arm64
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_neon/vmaxq_v.c b/regression/cbmc-library/__builtin_neon/vmaxq_v.c
new file mode 100644
index 00000000000..466efc00b81
--- /dev/null
+++ b/regression/cbmc-library/__builtin_neon/vmaxq_v.c
@@ -0,0 +1,14 @@
+// The model for __builtin_neon_vmaxq_v is generated by
+// scripts/generate_neon_models.py from arm_neon.td. Type code 32 selects the
+// int8x16 lane interpretation; verify the model agrees with a per-lane max.
+typedef signed char v16 __attribute__((vector_size(16)));
+typedef char v16qi __attribute__((vector_size(16)));
+
+int main()
+{
+  v16 a, b;
+  v16 r = (v16)__builtin_neon_vmaxq_v((v16qi)a, (v16qi)b, 32);
+  for(int i = 0; i < 16; i++)
+    __CPROVER_assert(r[i] == (a[i] > b[i] ? a[i] : b[i]), "vmaxq_s8 == max");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_neon/vmaxq_v.desc b/regression/cbmc-library/__builtin_neon/vmaxq_v.desc
new file mode 100644
index 00000000000..dcfd9328239
--- /dev/null
+++ b/regression/cbmc-library/__builtin_neon/vmaxq_v.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+vmaxq_v.c
+--arch arm64
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_neon/vpmaxq_v.c b/regression/cbmc-library/__builtin_neon/vpmaxq_v.c
new file mode 100644
index 00000000000..0f54fa7b3f4
--- /dev/null
+++ b/regression/cbmc-library/__builtin_neon/vpmaxq_v.c
@@ -0,0 +1,19 @@
+// Generated model for the pairwise-maximum builtin (mnemonic SMAXP); type code
+// 32 selects int8x16. The result is the pairwise maxima of a followed by those
+// of b -- exercises the reshaping code path.
+typedef signed char v16 __attribute__((vector_size(16)));
+typedef char v16qi __attribute__((vector_size(16)));
+
+int main()
+{
+  v16 a, b;
+  v16 r = (v16)__builtin_neon_vpmaxq_v((v16qi)a, (v16qi)b, 32);
+  for(int i = 0; i < 8; i++)
+  {
+    signed char ea = a[2 * i] > a[2 * i + 1] ? a[2 * i] : a[2 * i + 1];
+    signed char eb = b[2 * i] > b[2 * i + 1] ? b[2 * i] : b[2 * i + 1];
+    __CPROVER_assert(r[i] == ea, "vpmaxq_s8 lower half from a");
+    __CPROVER_assert(r[8 + i] == eb, "vpmaxq_s8 upper half from b");
+  }
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_neon/vpmaxq_v.desc b/regression/cbmc-library/__builtin_neon/vpmaxq_v.desc
new file mode 100644
index 00000000000..9d00f8df904
--- /dev/null
+++ b/regression/cbmc-library/__builtin_neon/vpmaxq_v.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+vpmaxq_v.c
+--arch arm64
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_neon/vqaddq_v.c b/regression/cbmc-library/__builtin_neon/vqaddq_v.c
new file mode 100644
index 00000000000..6898e2ac9b7
--- /dev/null
+++ b/regression/cbmc-library/__builtin_neon/vqaddq_v.c
@@ -0,0 +1,17 @@
+// Generated model for the saturating-add builtin (mnemonic SQADD); type code
+// 32 selects the int8x16 interpretation. Check it against a clamped reference.
+typedef signed char v16 __attribute__((vector_size(16)));
+typedef char v16qi __attribute__((vector_size(16)));
+
+int main()
+{
+  v16 a, b;
+  v16 r = (v16)__builtin_neon_vqaddq_v((v16qi)a, (v16qi)b, 32);
+  for(int i = 0; i < 16; i++)
+  {
+    int s = (int)a[i] + (int)b[i];
+    int ref = s < -128 ? -128 : (s > 127 ? 127 : s);
+    __CPROVER_assert(r[i] == ref, "vqaddq_s8 saturates");
+  }
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_neon/vqaddq_v.desc b/regression/cbmc-library/__builtin_neon/vqaddq_v.desc
new file mode 100644
index 00000000000..61d389d75fb
--- /dev/null
+++ b/regression/cbmc-library/__builtin_neon/vqaddq_v.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+vqaddq_v.c
+--arch arm64
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc-library/__builtin_neon/vtstq_v.c b/regression/cbmc-library/__builtin_neon/vtstq_v.c
new file mode 100644
index 00000000000..4a20457837b
--- /dev/null
+++ b/regression/cbmc-library/__builtin_neon/vtstq_v.c
@@ -0,0 +1,14 @@
+// Generated model for the test-bits builtin (mnemonic CMTST); type code 32
+// selects int8x16. Each lane is all-ones where (a & b) is non-zero.
+typedef signed char v16 __attribute__((vector_size(16)));
+typedef char v16qi __attribute__((vector_size(16)));
+
+int main()
+{
+  v16 a, b;
+  v16 r = (v16)__builtin_neon_vtstq_v((v16qi)a, (v16qi)b, 32);
+  for(int i = 0; i < 16; i++)
+    __CPROVER_assert(
+      r[i] == ((a[i] & b[i]) != 0 ? -1 : 0), "vtstq_s8 sets lanes on bit test");
+  return 0;
+}
diff --git a/regression/cbmc-library/__builtin_neon/vtstq_v.desc b/regression/cbmc-library/__builtin_neon/vtstq_v.desc
new file mode 100644
index 00000000000..9085a3026e3
--- /dev/null
+++ b/regression/cbmc-library/__builtin_neon/vtstq_v.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+vtstq_v.c
+--arch arm64
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc/SIMD_ia32_models/main.c b/regression/cbmc/SIMD_ia32_models/main.c
new file mode 100644
index 00000000000..e6f304c54f7
--- /dev/null
+++ b/regression/cbmc/SIMD_ia32_models/main.c
@@ -0,0 +1,1014 @@
+// Auto-generated by scripts/generate_simd_smoke_test.py
+// Exercises every modelled SIMD builtin once so the library models are
+// type-checked, linked and symex'd. See doc/neon-intrinsic-models.md.
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+typedef unsigned short __gcc_v16hi_u __attribute__((__vector_size__(32)));
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+typedef int __gcc_v16si __attribute__((__vector_size__(64)));
+typedef unsigned int __gcc_v16si_u __attribute__((__vector_size__(64)));
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+typedef unsigned long long __gcc_v2di_u __attribute__((__vector_size__(16)));
+typedef short __gcc_v32hi __attribute__((__vector_size__(64)));
+typedef unsigned short __gcc_v32hi_u __attribute__((__vector_size__(64)));
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef signed char __gcc_v32qi_s __attribute__((__vector_size__(32)));
+typedef unsigned char __gcc_v32qi_u __attribute__((__vector_size__(32)));
+typedef long long __gcc_v4di __attribute__((__vector_size__(32)));
+typedef unsigned long long __gcc_v4di_u __attribute__((__vector_size__(32)));
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+typedef char __gcc_v64qi __attribute__((__vector_size__(64)));
+typedef signed char __gcc_v64qi_s __attribute__((__vector_size__(64)));
+typedef unsigned char __gcc_v64qi_u __attribute__((__vector_size__(64)));
+typedef long long __gcc_v8di __attribute__((__vector_size__(64)));
+typedef unsigned long long __gcc_v8di_u __attribute__((__vector_size__(64)));
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+typedef unsigned int __gcc_v8si_u __attribute__((__vector_size__(32)));
+
+int main(void)
+{
+  {
+    __gcc_v32qi a0 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_pabsb256(a0);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_pabsw256(a0);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    __gcc_v16qi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16qi r = __builtin_ia32_paddb128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    __gcc_v32qi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_paddb256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v64qi a0 = {0};
+    __gcc_v64qi a1 = {0};
+    __gcc_v64qi a2 = {0};
+    unsigned long long a3 = {0};
+    volatile __gcc_v64qi r = __builtin_ia32_paddb512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8si a0 = {0};
+    __gcc_v8si a1 = {0};
+    __gcc_v8si a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8si r = __builtin_ia32_paddd256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v2di a0 = {0};
+    __gcc_v2di a1 = {0};
+    __gcc_v2di a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v2di r = __builtin_ia32_paddq128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v4di a0 = {0};
+    __gcc_v4di a1 = {0};
+    __gcc_v4di a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v4di r = __builtin_ia32_paddq256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8di a0 = {0};
+    __gcc_v8di a1 = {0};
+    __gcc_v8di a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8di r = __builtin_ia32_paddq512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    __gcc_v16qi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16qi r = __builtin_ia32_paddsb128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_paddsb256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    __gcc_v32qi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_paddsb256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v64qi a0 = {0};
+    __gcc_v64qi a1 = {0};
+    __gcc_v64qi a2 = {0};
+    unsigned long long a3 = {0};
+    volatile __gcc_v64qi r = __builtin_ia32_paddsb512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8hi a0 = {0};
+    __gcc_v8hi a1 = {0};
+    __gcc_v8hi a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8hi r = __builtin_ia32_paddsw128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_paddsw256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    __gcc_v16hi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_paddsw256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32hi a0 = {0};
+    __gcc_v32hi a1 = {0};
+    __gcc_v32hi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32hi r = __builtin_ia32_paddsw512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    __gcc_v16qi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16qi r = __builtin_ia32_paddusb128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_paddusb256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    __gcc_v32qi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_paddusb256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v64qi a0 = {0};
+    __gcc_v64qi a1 = {0};
+    __gcc_v64qi a2 = {0};
+    unsigned long long a3 = {0};
+    volatile __gcc_v64qi r = __builtin_ia32_paddusb512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8hi a0 = {0};
+    __gcc_v8hi a1 = {0};
+    volatile __gcc_v8hi r = __builtin_ia32_paddusw128(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v8hi a0 = {0};
+    __gcc_v8hi a1 = {0};
+    __gcc_v8hi a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8hi r = __builtin_ia32_paddusw128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_paddusw256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    __gcc_v16hi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_paddusw256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32hi a0 = {0};
+    __gcc_v32hi a1 = {0};
+    __gcc_v32hi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32hi r = __builtin_ia32_paddusw512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8hi a0 = {0};
+    __gcc_v8hi a1 = {0};
+    __gcc_v8hi a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8hi r = __builtin_ia32_paddw128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    __gcc_v16hi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_paddw256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32hi a0 = {0};
+    __gcc_v32hi a1 = {0};
+    __gcc_v32hi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32hi r = __builtin_ia32_paddw512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    __gcc_v16qi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16qi r = __builtin_ia32_pavgb128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_pavgb256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    __gcc_v32qi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_pavgb256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v64qi a0 = {0};
+    __gcc_v64qi a1 = {0};
+    __gcc_v64qi a2 = {0};
+    unsigned long long a3 = {0};
+    volatile __gcc_v64qi r = __builtin_ia32_pavgb512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8hi a0 = {0};
+    __gcc_v8hi a1 = {0};
+    __gcc_v8hi a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8hi r = __builtin_ia32_pavgw128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_pavgw256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    __gcc_v16hi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_pavgw256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32hi a0 = {0};
+    __gcc_v32hi a1 = {0};
+    __gcc_v32hi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32hi r = __builtin_ia32_pavgw512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    __gcc_v16qi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16qi r = __builtin_ia32_pmaxsb128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_pmaxsb256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    __gcc_v32qi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_pmaxsb256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v64qi a0 = {0};
+    __gcc_v64qi a1 = {0};
+    __gcc_v64qi a2 = {0};
+    unsigned long long a3 = {0};
+    volatile __gcc_v64qi r = __builtin_ia32_pmaxsb512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v4si a0 = {0};
+    __gcc_v4si a1 = {0};
+    __gcc_v4si a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v4si r = __builtin_ia32_pmaxsd128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8si a0 = {0};
+    __gcc_v8si a1 = {0};
+    volatile __gcc_v8si r = __builtin_ia32_pmaxsd256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v8si a0 = {0};
+    __gcc_v8si a1 = {0};
+    __gcc_v8si a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8si r = __builtin_ia32_pmaxsd256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16si a0 = {0};
+    __gcc_v16si a1 = {0};
+    __gcc_v16si a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16si r = __builtin_ia32_pmaxsd512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8hi a0 = {0};
+    __gcc_v8hi a1 = {0};
+    __gcc_v8hi a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8hi r = __builtin_ia32_pmaxsw128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_pmaxsw256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    __gcc_v16hi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_pmaxsw256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32hi a0 = {0};
+    __gcc_v32hi a1 = {0};
+    __gcc_v32hi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32hi r = __builtin_ia32_pmaxsw512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    __gcc_v16qi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16qi r = __builtin_ia32_pmaxub128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_pmaxub256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    __gcc_v32qi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_pmaxub256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v64qi a0 = {0};
+    __gcc_v64qi a1 = {0};
+    __gcc_v64qi a2 = {0};
+    unsigned long long a3 = {0};
+    volatile __gcc_v64qi r = __builtin_ia32_pmaxub512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v4si a0 = {0};
+    __gcc_v4si a1 = {0};
+    __gcc_v4si a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v4si r = __builtin_ia32_pmaxud128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8si a0 = {0};
+    __gcc_v8si a1 = {0};
+    __gcc_v8si a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8si r = __builtin_ia32_pmaxud256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16si a0 = {0};
+    __gcc_v16si a1 = {0};
+    __gcc_v16si a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16si r = __builtin_ia32_pmaxud512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8hi a0 = {0};
+    __gcc_v8hi a1 = {0};
+    __gcc_v8hi a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8hi r = __builtin_ia32_pmaxuw128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_pmaxuw256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    __gcc_v16hi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_pmaxuw256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32hi a0 = {0};
+    __gcc_v32hi a1 = {0};
+    __gcc_v32hi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32hi r = __builtin_ia32_pmaxuw512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    __gcc_v16qi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16qi r = __builtin_ia32_pminsb128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_pminsb256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    __gcc_v32qi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_pminsb256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v64qi a0 = {0};
+    __gcc_v64qi a1 = {0};
+    __gcc_v64qi a2 = {0};
+    unsigned long long a3 = {0};
+    volatile __gcc_v64qi r = __builtin_ia32_pminsb512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v4si a0 = {0};
+    __gcc_v4si a1 = {0};
+    __gcc_v4si a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v4si r = __builtin_ia32_pminsd128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8si a0 = {0};
+    __gcc_v8si a1 = {0};
+    volatile __gcc_v8si r = __builtin_ia32_pminsd256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v8si a0 = {0};
+    __gcc_v8si a1 = {0};
+    __gcc_v8si a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8si r = __builtin_ia32_pminsd256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16si a0 = {0};
+    __gcc_v16si a1 = {0};
+    __gcc_v16si a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16si r = __builtin_ia32_pminsd512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8hi a0 = {0};
+    __gcc_v8hi a1 = {0};
+    __gcc_v8hi a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8hi r = __builtin_ia32_pminsw128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_pminsw256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    __gcc_v16hi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_pminsw256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32hi a0 = {0};
+    __gcc_v32hi a1 = {0};
+    __gcc_v32hi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32hi r = __builtin_ia32_pminsw512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    __gcc_v16qi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16qi r = __builtin_ia32_pminub128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_pminub256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    __gcc_v32qi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_pminub256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v64qi a0 = {0};
+    __gcc_v64qi a1 = {0};
+    __gcc_v64qi a2 = {0};
+    unsigned long long a3 = {0};
+    volatile __gcc_v64qi r = __builtin_ia32_pminub512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v4si a0 = {0};
+    __gcc_v4si a1 = {0};
+    __gcc_v4si a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v4si r = __builtin_ia32_pminud128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8si a0 = {0};
+    __gcc_v8si a1 = {0};
+    volatile __gcc_v8si r = __builtin_ia32_pminud256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v8si a0 = {0};
+    __gcc_v8si a1 = {0};
+    __gcc_v8si a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8si r = __builtin_ia32_pminud256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16si a0 = {0};
+    __gcc_v16si a1 = {0};
+    __gcc_v16si a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16si r = __builtin_ia32_pminud512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8hi a0 = {0};
+    __gcc_v8hi a1 = {0};
+    __gcc_v8hi a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8hi r = __builtin_ia32_pminuw128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_pminuw256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    __gcc_v16hi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_pminuw256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32hi a0 = {0};
+    __gcc_v32hi a1 = {0};
+    __gcc_v32hi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32hi r = __builtin_ia32_pminuw512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v4si a0 = {0};
+    __gcc_v4si a1 = {0};
+    __gcc_v4si a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v4si r = __builtin_ia32_pmulld128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8si a0 = {0};
+    __gcc_v8si a1 = {0};
+    volatile __gcc_v8si r = __builtin_ia32_pmulld256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v8si a0 = {0};
+    __gcc_v8si a1 = {0};
+    __gcc_v8si a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8si r = __builtin_ia32_pmulld256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16si a0 = {0};
+    __gcc_v16si a1 = {0};
+    __gcc_v16si a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16si r = __builtin_ia32_pmulld512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8hi a0 = {0};
+    __gcc_v8hi a1 = {0};
+    __gcc_v8hi a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8hi r = __builtin_ia32_pmullw128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_pmullw256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    __gcc_v16hi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_pmullw256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32hi a0 = {0};
+    __gcc_v32hi a1 = {0};
+    __gcc_v32hi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32hi r = __builtin_ia32_pmullw512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v4si a0 = {0};
+    volatile __gcc_v4si r = __builtin_ia32_pslldi128(a0, 1);
+    (void)r;
+  }
+  {
+    __gcc_v8si a0 = {0};
+    volatile __gcc_v8si r = __builtin_ia32_pslldi256(a0, 1);
+    (void)r;
+  }
+  {
+    __gcc_v2di a0 = {0};
+    volatile __gcc_v2di r = __builtin_ia32_psllqi128(a0, 1);
+    (void)r;
+  }
+  {
+    __gcc_v4di a0 = {0};
+    volatile __gcc_v4di r = __builtin_ia32_psllqi256(a0, 1);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_psllwi256(a0, 1);
+    (void)r;
+  }
+  {
+    __gcc_v8si a0 = {0};
+    volatile __gcc_v8si r = __builtin_ia32_psradi256(a0, 1);
+    (void)r;
+  }
+  {
+    __gcc_v8hi a0 = {0};
+    volatile __gcc_v8hi r = __builtin_ia32_psrawi128(a0, 1);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_psrawi256(a0, 1);
+    (void)r;
+  }
+  {
+    __gcc_v4si a0 = {0};
+    volatile __gcc_v4si r = __builtin_ia32_psrldi128(a0, 1);
+    (void)r;
+  }
+  {
+    __gcc_v8si a0 = {0};
+    volatile __gcc_v8si r = __builtin_ia32_psrldi256(a0, 1);
+    (void)r;
+  }
+  {
+    __gcc_v2di a0 = {0};
+    volatile __gcc_v2di r = __builtin_ia32_psrlqi128(a0, 1);
+    (void)r;
+  }
+  {
+    __gcc_v4di a0 = {0};
+    volatile __gcc_v4di r = __builtin_ia32_psrlqi256(a0, 1);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_psrlwi256(a0, 1);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    __gcc_v16qi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16qi r = __builtin_ia32_psubb128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    __gcc_v32qi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_psubb256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v64qi a0 = {0};
+    __gcc_v64qi a1 = {0};
+    __gcc_v64qi a2 = {0};
+    unsigned long long a3 = {0};
+    volatile __gcc_v64qi r = __builtin_ia32_psubb512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v4si a0 = {0};
+    __gcc_v4si a1 = {0};
+    __gcc_v4si a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v4si r = __builtin_ia32_psubd128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8si a0 = {0};
+    __gcc_v8si a1 = {0};
+    __gcc_v8si a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8si r = __builtin_ia32_psubd256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16si a0 = {0};
+    __gcc_v16si a1 = {0};
+    __gcc_v16si a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16si r = __builtin_ia32_psubd512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v2di a0 = {0};
+    __gcc_v2di a1 = {0};
+    __gcc_v2di a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v2di r = __builtin_ia32_psubq128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v4di a0 = {0};
+    __gcc_v4di a1 = {0};
+    __gcc_v4di a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v4di r = __builtin_ia32_psubq256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8di a0 = {0};
+    __gcc_v8di a1 = {0};
+    __gcc_v8di a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8di r = __builtin_ia32_psubq512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    volatile __gcc_v16qi r = __builtin_ia32_psubsb128(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    __gcc_v16qi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16qi r = __builtin_ia32_psubsb128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_psubsb256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    __gcc_v32qi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_psubsb256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v64qi a0 = {0};
+    __gcc_v64qi a1 = {0};
+    __gcc_v64qi a2 = {0};
+    unsigned long long a3 = {0};
+    volatile __gcc_v64qi r = __builtin_ia32_psubsb512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8hi a0 = {0};
+    __gcc_v8hi a1 = {0};
+    __gcc_v8hi a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8hi r = __builtin_ia32_psubsw128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_psubsw256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    __gcc_v16hi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_psubsw256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32hi a0 = {0};
+    __gcc_v32hi a1 = {0};
+    __gcc_v32hi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32hi r = __builtin_ia32_psubsw512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    __gcc_v16qi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16qi r = __builtin_ia32_psubusb128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_psubusb256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v32qi a0 = {0};
+    __gcc_v32qi a1 = {0};
+    __gcc_v32qi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32qi r = __builtin_ia32_psubusb256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v64qi a0 = {0};
+    __gcc_v64qi a1 = {0};
+    __gcc_v64qi a2 = {0};
+    unsigned long long a3 = {0};
+    volatile __gcc_v64qi r = __builtin_ia32_psubusb512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8hi a0 = {0};
+    __gcc_v8hi a1 = {0};
+    __gcc_v8hi a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8hi r = __builtin_ia32_psubusw128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_psubusw256(a0, a1);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    __gcc_v16hi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_psubusw256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32hi a0 = {0};
+    __gcc_v32hi a1 = {0};
+    __gcc_v32hi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32hi r = __builtin_ia32_psubusw512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v8hi a0 = {0};
+    __gcc_v8hi a1 = {0};
+    __gcc_v8hi a2 = {0};
+    unsigned char a3 = {0};
+    volatile __gcc_v8hi r = __builtin_ia32_psubw128_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v16hi a0 = {0};
+    __gcc_v16hi a1 = {0};
+    __gcc_v16hi a2 = {0};
+    unsigned short a3 = {0};
+    volatile __gcc_v16hi r = __builtin_ia32_psubw256_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  {
+    __gcc_v32hi a0 = {0};
+    __gcc_v32hi a1 = {0};
+    __gcc_v32hi a2 = {0};
+    unsigned int a3 = {0};
+    volatile __gcc_v32hi r = __builtin_ia32_psubw512_mask(a0, a1, a2, a3);
+    (void)r;
+  }
+  __CPROVER_assert(1, "SIMD model smoke test");
+  return 0;
+}
diff --git a/regression/cbmc/SIMD_ia32_models/test.desc b/regression/cbmc/SIMD_ia32_models/test.desc
new file mode 100644
index 00000000000..83b8819429a
--- /dev/null
+++ b/regression/cbmc/SIMD_ia32_models/test.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+main.c
+
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/regression/cbmc/SIMD_neon_models/main.c b/regression/cbmc/SIMD_neon_models/main.c
new file mode 100644
index 00000000000..a6de2a7d2a2
--- /dev/null
+++ b/regression/cbmc/SIMD_neon_models/main.c
@@ -0,0 +1,143 @@
+// Auto-generated by scripts/generate_simd_smoke_test.py
+// Exercises every modelled SIMD builtin once so the library models are
+// type-checked, linked and symex'd. See doc/neon-intrinsic-models.md.
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+typedef long long __gcc_v1di_s __attribute__((__vector_size__(8)));
+typedef unsigned long long __gcc_v1di_u __attribute__((__vector_size__(8)));
+typedef long long __gcc_v2di_s __attribute__((__vector_size__(16)));
+typedef unsigned long long __gcc_v2di_u __attribute__((__vector_size__(16)));
+typedef int __gcc_v2si_s __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+typedef short __gcc_v4hi_s __attribute__((__vector_size__(8)));
+typedef unsigned short __gcc_v4hi_u __attribute__((__vector_size__(8)));
+typedef int __gcc_v4si_s __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+typedef short __gcc_v8hi_s __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+typedef signed char __gcc_v8qi_s __attribute__((__vector_size__(8)));
+typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
+
+int main(void)
+{
+  {
+    __gcc_v8qi a0 = {0};
+    __gcc_v8qi a1 = {0};
+    volatile __gcc_v8qi r = __builtin_neon_vabd_v(a0, a1, 0);
+    (void)r;
+  }
+  {
+    __gcc_v8qi a0 = {0};
+    __gcc_v8qi a1 = {0};
+    __gcc_v8qi a2 = {0};
+    volatile __gcc_v8qi r = __builtin_neon_vbsl_v(a0, a1, a2, 1);
+    (void)r;
+  }
+  {
+    __gcc_v8qi a0 = {0};
+    __gcc_v8qi a1 = {0};
+    volatile __gcc_v8qi r = __builtin_neon_vhadd_v(a0, a1, 0);
+    (void)r;
+  }
+  {
+    __gcc_v8qi a0 = {0};
+    __gcc_v8qi a1 = {0};
+    volatile __gcc_v8qi r = __builtin_neon_vhsub_v(a0, a1, 0);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    volatile __gcc_v16qi r = __builtin_neon_vhsubq_v(a0, a1, 32);
+    (void)r;
+  }
+  {
+    __gcc_v8qi a0 = {0};
+    __gcc_v8qi a1 = {0};
+    volatile __gcc_v8qi r = __builtin_neon_vmax_v(a0, a1, 0);
+    (void)r;
+  }
+  {
+    __gcc_v8qi a0 = {0};
+    __gcc_v8qi a1 = {0};
+    volatile __gcc_v8qi r = __builtin_neon_vmin_v(a0, a1, 0);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    volatile __gcc_v16qi r = __builtin_neon_vminq_v(a0, a1, 32);
+    (void)r;
+  }
+  {
+    __gcc_v8qi a0 = {0};
+    __gcc_v8qi a1 = {0};
+    volatile __gcc_v8qi r = __builtin_neon_vpadd_v(a0, a1, 0);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    volatile __gcc_v16qi r = __builtin_neon_vpaddq_v(a0, a1, 32);
+    (void)r;
+  }
+  {
+    __gcc_v8qi a0 = {0};
+    __gcc_v8qi a1 = {0};
+    volatile __gcc_v8qi r = __builtin_neon_vpmax_v(a0, a1, 0);
+    (void)r;
+  }
+  {
+    __gcc_v8qi a0 = {0};
+    __gcc_v8qi a1 = {0};
+    volatile __gcc_v8qi r = __builtin_neon_vpmin_v(a0, a1, 0);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    volatile __gcc_v16qi r = __builtin_neon_vpminq_v(a0, a1, 32);
+    (void)r;
+  }
+  {
+    __gcc_v8qi a0 = {0};
+    __gcc_v8qi a1 = {0};
+    volatile __gcc_v8qi r = __builtin_neon_vqadd_v(a0, a1, 0);
+    (void)r;
+  }
+  {
+    __gcc_v8qi a0 = {0};
+    __gcc_v8qi a1 = {0};
+    volatile __gcc_v8qi r = __builtin_neon_vqsub_v(a0, a1, 0);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    volatile __gcc_v16qi r = __builtin_neon_vqsubq_v(a0, a1, 32);
+    (void)r;
+  }
+  {
+    __gcc_v8qi a0 = {0};
+    __gcc_v8qi a1 = {0};
+    volatile __gcc_v8qi r = __builtin_neon_vrhadd_v(a0, a1, 0);
+    (void)r;
+  }
+  {
+    __gcc_v16qi a0 = {0};
+    __gcc_v16qi a1 = {0};
+    volatile __gcc_v16qi r = __builtin_neon_vrhaddq_v(a0, a1, 32);
+    (void)r;
+  }
+  {
+    __gcc_v8qi a0 = {0};
+    __gcc_v8qi a1 = {0};
+    volatile __gcc_v8qi r = __builtin_neon_vtst_v(a0, a1, 0);
+    (void)r;
+  }
+  __CPROVER_assert(1, "SIMD model smoke test");
+  return 0;
+}
diff --git a/regression/cbmc/SIMD_neon_models/test.desc b/regression/cbmc/SIMD_neon_models/test.desc
new file mode 100644
index 00000000000..95d3979cc1f
--- /dev/null
+++ b/regression/cbmc/SIMD_neon_models/test.desc
@@ -0,0 +1,8 @@
+CORE gcc-only
+main.c
+--arch arm64
+^EXIT=0$
+^SIGNAL=0$
+^VERIFICATION SUCCESSFUL$
+--
+^warning: ignoring
diff --git a/scripts/check_intrinsic_models_sync.sh b/scripts/check_intrinsic_models_sync.sh
new file mode 100755
index 00000000000..c2c2717ebfd
--- /dev/null
+++ b/scripts/check_intrinsic_models_sync.sh
@@ -0,0 +1,29 @@
+#!/usr/bin/env bash
+#
+# Verify that src/ansi-c/library/x86_intrinsics.c is in sync with its
+# generator (scripts/generate_intrinsic_models.py). The generated library
+# must never be hand-edited; this check fails if regenerating it would
+# produce a different file, so a stale committed copy (or a MODELS change
+# without regeneration) is caught in CI.
+
+set -e
+
+script_dir=$(cd "$(dirname "$0")" && pwd)
+root=$(cd "$script_dir/.." && pwd)
+committed="$root/src/ansi-c/library/x86_intrinsics.c"
+
+tmp=$(mktemp)
+trap 'rm -f "$tmp"' EXIT
+
+python3 "$script_dir/generate_intrinsic_models.py" --cbmc-root "$root" -o "$tmp"
+
+if ! diff -u "$committed" "$tmp"; then
+  echo
+  echo "ERROR: src/ansi-c/library/x86_intrinsics.c is out of sync with"
+  echo "scripts/generate_intrinsic_models.py. Regenerate it with:"
+  echo "  python3 scripts/generate_intrinsic_models.py \\"
+  echo "    -o src/ansi-c/library/x86_intrinsics.c"
+  exit 1
+fi
+
+echo "x86_intrinsics.c is in sync with the generator."
diff --git a/scripts/generate_intrinsic_models.py b/scripts/generate_intrinsic_models.py
new file mode 100755
index 00000000000..a2d3c5f8c15
--- /dev/null
+++ b/scripts/generate_intrinsic_models.py
@@ -0,0 +1,809 @@
+#!/usr/bin/env python3
+"""
+Generate CBMC library models for x86 SIMD intrinsics.
+
+Models are described by the curated MODELS table below -- one entry per Intel
+_mm_* intrinsic, giving the element type, lane count, per-lane body and (where
+applicable) signedness, shift parameter, equivalence oracle and AVX-512 mask
+type. Wider-vector (AVX2 256-bit, AVX-512 512-bit) and merge-masked variants
+are derived automatically from the 128-bit base entries. Each model is emitted
+as a CBMC library function keyed by its GCC __builtin_ia32_* name, and the tool
+cross-checks against the __builtin_ia32_* declarations shipped in CBMC's
+compiler headers (src/ansi-c/compiler_headers/gcc_builtin_headers_ia32*.h) so
+that it only emits models for builtins CBMC actually knows about.
+
+The MODELS table is the authoritative, human-reviewed source of truth. The XML
+modes below are *maintainer aids* for extending it; they never feed the
+committed library directly.
+
+Modes
+-----
+  -o FILE
+      (Re)generate the library models into FILE (normally
+      src/ansi-c/library/x86_intrinsics.c). Output is piped through
+      clang-format-15 so regeneration is idempotent. CI re-runs this and
+      diffs the result via scripts/check_intrinsic_models_sync.sh, so the
+      committed file must always match the generator.
+
+  --status
+      Print a coverage report: which declared __builtin_ia32_* builtins are
+      modeled, grouped by CPUID feature. Use this to see what is left to do.
+
+  --status --xml data-latest.xml
+      As --status, plus a survey of which not-yet-modeled builtins have an
+      <operation> in Intel's Intrinsics Guide XML that is simple enough for the
+      --emit-drafts translator to handle (see below). Helps pick the next
+      tractable batch to model.
+
+  --emit-drafts data-latest.xml
+      Maintainer aid for growing the MODELS table. Translates the simple
+      element-wise pseudocode of not-yet-modeled intrinsics (see
+      parse_operation() for the exact accepted shape) into *draft* Model()
+      entries printed to stdout for review, and self-checks the translator by
+      re-deriving the geometry of the hand-written models and reporting any
+      mismatch. The drafts are intentionally incomplete: the translator does
+      NOT infer signedness or apply the UB-hardening (unsigned wrapping
+      arithmetic, modular negation) that correct models need, so a human must
+      finish and move each draft into MODELS. Nothing is written to the
+      library by this mode.
+
+  --emit-tests DIR
+      Write exhaustive-equivalence regression tests (model == CBMC's native
+      vector operator for all inputs) under DIR for every model with an
+      oracle. Used to (re)generate the per-function cbmc-library tests.
+
+Typical workflow for adding intrinsics
+--------------------------------------
+  1. scripts/generate_intrinsic_models.py --status --xml data-latest.xml
+     to find declared-but-unmodeled builtins with tractable pseudocode;
+  2. --emit-drafts data-latest.xml to get draft Model() entries;
+  3. review/finish each draft (signedness, UB-hardening) and add it to MODELS;
+  4. -o src/ansi-c/library/x86_intrinsics.c to regenerate the library;
+  5. --emit-tests regression/cbmc-library/__builtin_ia32 to refresh tests.
+
+The Intel Intrinsics Guide XML used by --xml/--emit-drafts can be downloaded
+from:
+  https://www.intel.com/content/dam/develop/public/us/en/include/intrinsics-guide/data-latest.xml
+"""
+
+import argparse
+import glob
+import os
+import re
+import shutil
+import subprocess
+import sys
+import xml.etree.ElementTree as ET
+from dataclasses import dataclass
+
+# GCC vector types used in CBMC headers, keyed by (element_c_type, count)
+VEC_TYPES = {
+    ("char", 16):      "__gcc_v16qi",
+    ("short", 8):      "__gcc_v8hi",
+    ("int", 4):        "__gcc_v4si",
+    ("long long", 2):  "__gcc_v2di",
+    ("float", 4):      "__gcc_v4sf",
+    ("double", 2):     "__gcc_v2df",
+    ("char", 8):       "__gcc_v8qi",
+    ("short", 4):      "__gcc_v4hi",
+    ("int", 2):        "__gcc_v2si",
+    # 256-bit (AVX2)
+    ("char", 32):      "__gcc_v32qi",
+    ("short", 16):     "__gcc_v16hi",
+    ("int", 8):        "__gcc_v8si",
+    ("long long", 4):  "__gcc_v4di",
+    # 512-bit (AVX-512)
+    ("char", 64):      "__gcc_v64qi",
+    ("short", 32):     "__gcc_v32hi",
+    ("int", 16):       "__gcc_v16si",
+    ("long long", 8):  "__gcc_v8di",
+}
+
+# AVX-512 write-mask C type (__mmask8/16/32/64) for a given lane count: the
+# smallest mask type with at least one bit per lane.
+def mask_type_for(count):
+    if count <= 8:
+        return "unsigned char"
+    if count <= 16:
+        return "unsigned short"
+    if count <= 32:
+        return "unsigned int"
+    if count <= 64:
+        return "unsigned long long"
+    return None
+
+# Bytes per element C type.
+ELEM_SIZE = {"char": 1, "short": 2, "int": 4, "long long": 8}
+
+# The library file this tool owns and (re)generates. Its models are this
+# tool's own output, so they are excluded from the "already modeled elsewhere"
+# check that decides what to emit (keeping regeneration idempotent).
+GENERATED_LIBRARY = os.path.join("src", "ansi-c", "library", "x86_intrinsics.c")
+
+
+@dataclass
+class Model:
+    """A single per-element SIMD intrinsic model.
+
+    builtin: the GCC __builtin_ia32_* name the model implements.
+    elem:    base element C type ("char", "short", "int", "long long").
+    count:   number of lanes.
+    body:    per-element body template using {a}, {b} (operands) and {j}
+             (lane index), assigned to dst[j].
+    sign:    how the per-element operation is carried out, by aliasing the
+             operands to a vector of the chosen signedness before the loop and
+             casting the result back:
+               ""         - use the (signed-by-default) vector type as-is;
+               "signed"   - force signed semantics. Needed where 'char' may be
+                            unsigned (e.g. ARM): without this '< 0' is always
+                            false (-Werror=type-limits in library_check.sh) and
+                            'a > b' would silently become an unsigned compare,
+                            which is wrong for signed intrinsics like
+                            _mm_max_epi8;
+               "unsigned" - perform the operation in the matching unsigned type.
+                            Used both for genuinely unsigned intrinsics (min/max
+                            epu*, avg) and for the wrapping signed-arithmetic
+                            intrinsics (add/sub/mullo on 32/64-bit lanes), where
+                            'int + int' etc. would be signed-overflow UB:
+                            unsigned arithmetic is well-defined modular and the
+                            cast back reproduces the two's-complement result.
+    scalar2: C type of a scalar second parameter (e.g. "int" for a shift
+             count) instead of a second vector operand. When set, the body
+             refers to it as {b} (a scalar, not {b}[{j}]).
+    oracle:  a native C vector operator ("+", "-", "*") for which CBMC's own
+             vector semantics provide an independent reference; --emit-tests
+             then generates an exhaustive equivalence proof (model == native
+             operator for all inputs).
+    mask_type: when set (to an __mmask C type), this is an AVX-512 merge-masked
+             variant: the function takes (a, b, merge-source, mask) and each
+             lane is the base body if its mask bit is set, else the merge
+             source. body/sign describe the underlying (unmasked) operation.
+    """
+    builtin: str
+    elem: str
+    count: int
+    body: str
+    sign: str = ""
+    scalar2: str = None
+    oracle: str = None
+    mask_type: str = None
+
+
+# Intel _mm_* name -> Model
+MODELS = {
+    # --- add (32/64-bit done unsigned to avoid signed-overflow UB) ---
+    "_mm_add_epi8":  Model("__builtin_ia32_paddb128", "char", 16, "{a}[{j}] + {b}[{j}]", oracle="+"),
+    "_mm_add_epi16": Model("__builtin_ia32_paddw128", "short", 8, "{a}[{j}] + {b}[{j}]", oracle="+"),
+    "_mm_add_epi32": Model("__builtin_ia32_paddd128", "int", 4, "{a}[{j}] + {b}[{j}]", sign="unsigned", oracle="+"),
+    "_mm_add_epi64": Model("__builtin_ia32_paddq128", "long long", 2, "{a}[{j}] + {b}[{j}]", sign="unsigned", oracle="+"),
+    # --- sub (ditto) ---
+    "_mm_sub_epi8":  Model("__builtin_ia32_psubb128", "char", 16, "{a}[{j}] - {b}[{j}]", oracle="-"),
+    "_mm_sub_epi16": Model("__builtin_ia32_psubw128", "short", 8, "{a}[{j}] - {b}[{j}]", oracle="-"),
+    "_mm_sub_epi32": Model("__builtin_ia32_psubd128", "int", 4, "{a}[{j}] - {b}[{j}]", sign="unsigned", oracle="-"),
+    "_mm_sub_epi64": Model("__builtin_ia32_psubq128", "long long", 2, "{a}[{j}] - {b}[{j}]", sign="unsigned", oracle="-"),
+    # --- min/max signed ---
+    "_mm_min_epi8":  Model("__builtin_ia32_pminsb128", "char", 16, "{a}[{j}] < {b}[{j}] ? {a}[{j}] : {b}[{j}]", sign="signed"),
+    "_mm_min_epi16": Model("__builtin_ia32_pminsw128", "short", 8, "{a}[{j}] < {b}[{j}] ? {a}[{j}] : {b}[{j}]"),
+    "_mm_min_epi32": Model("__builtin_ia32_pminsd128", "int", 4, "{a}[{j}] < {b}[{j}] ? {a}[{j}] : {b}[{j}]"),
+    "_mm_max_epi8":  Model("__builtin_ia32_pmaxsb128", "char", 16, "{a}[{j}] > {b}[{j}] ? {a}[{j}] : {b}[{j}]", sign="signed"),
+    "_mm_max_epi16": Model("__builtin_ia32_pmaxsw128", "short", 8, "{a}[{j}] > {b}[{j}] ? {a}[{j}] : {b}[{j}]"),
+    "_mm_max_epi32": Model("__builtin_ia32_pmaxsd128", "int", 4, "{a}[{j}] > {b}[{j}] ? {a}[{j}] : {b}[{j}]"),
+    # --- min/max unsigned ---
+    "_mm_min_epu8":  Model("__builtin_ia32_pminub128", "char", 16, "{a}[{j}] < {b}[{j}] ? {a}[{j}] : {b}[{j}]", sign="unsigned"),
+    "_mm_max_epu8":  Model("__builtin_ia32_pmaxub128", "char", 16, "{a}[{j}] > {b}[{j}] ? {a}[{j}] : {b}[{j}]", sign="unsigned"),
+    "_mm_min_epu16": Model("__builtin_ia32_pminuw128", "short", 8, "{a}[{j}] < {b}[{j}] ? {a}[{j}] : {b}[{j}]", sign="unsigned"),
+    "_mm_max_epu16": Model("__builtin_ia32_pmaxuw128", "short", 8, "{a}[{j}] > {b}[{j}] ? {a}[{j}] : {b}[{j}]", sign="unsigned"),
+    "_mm_min_epu32": Model("__builtin_ia32_pminud128", "int", 4, "{a}[{j}] < {b}[{j}] ? {a}[{j}] : {b}[{j}]", sign="unsigned"),
+    "_mm_max_epu32": Model("__builtin_ia32_pmaxud128", "int", 4, "{a}[{j}] > {b}[{j}] ? {a}[{j}] : {b}[{j}]", sign="unsigned"),
+    # --- abs (32-bit uses unsigned modular negation to avoid -INT_MIN UB) ---
+    "_mm_abs_epi8":  Model("__builtin_ia32_pabsb128", "char", 16, "{a}[{j}] < 0 ? -{a}[{j}] : {a}[{j}]", sign="signed"),
+    "_mm_abs_epi16": Model("__builtin_ia32_pabsw128", "short", 8, "{a}[{j}] < 0 ? -{a}[{j}] : {a}[{j}]"),
+    "_mm_abs_epi32": Model("__builtin_ia32_pabsd128", "int", 4, "{a}[{j}] < 0 ? (int)(0u - (unsigned){a}[{j}]) : {a}[{j}]"),
+    # --- compare (result is all-1s or all-0s per element) ---
+    "_mm_cmpeq_epi8":  Model("__builtin_ia32_pcmpeqb128", "char", 16, "{a}[{j}] == {b}[{j}] ? -1 : 0", oracle="=="),
+    "_mm_cmpeq_epi16": Model("__builtin_ia32_pcmpeqw128", "short", 8, "{a}[{j}] == {b}[{j}] ? -1 : 0", oracle="=="),
+    "_mm_cmpeq_epi32": Model("__builtin_ia32_pcmpeqd128", "int", 4, "{a}[{j}] == {b}[{j}] ? -1 : 0", oracle="=="),
+    "_mm_cmpgt_epi8":  Model("__builtin_ia32_pcmpgtb128", "char", 16, "{a}[{j}] > {b}[{j}] ? -1 : 0", sign="signed", oracle=">"),
+    "_mm_cmpgt_epi16": Model("__builtin_ia32_pcmpgtw128", "short", 8, "{a}[{j}] > {b}[{j}] ? -1 : 0", oracle=">"),
+    "_mm_cmpgt_epi32": Model("__builtin_ia32_pcmpgtd128", "int", 4, "{a}[{j}] > {b}[{j}] ? -1 : 0", oracle=">"),
+    # --- average unsigned ---
+    "_mm_avg_epu8":  Model("__builtin_ia32_pavgb128", "char", 16, "({a}[{j}] + {b}[{j}] + 1) >> 1", sign="unsigned"),
+    "_mm_avg_epu16": Model("__builtin_ia32_pavgw128", "short", 8, "({a}[{j}] + {b}[{j}] + 1) >> 1", sign="unsigned"),
+    # --- mullo (low half of multiply; 32-bit done unsigned to avoid UB) ---
+    "_mm_mullo_epi16": Model("__builtin_ia32_pmullw128", "short", 8, "{a}[{j}] * {b}[{j}]"),
+    "_mm_mullo_epi32": Model("__builtin_ia32_pmulld128", "int", 4, "{a}[{j}] * {b}[{j}]", sign="unsigned"),
+    # --- bitwise (whole-register; modelled on 64-bit lanes) ---
+    "_mm_and_si128":    Model("__builtin_ia32_pand128", "long long", 2, "{a}[{j}] & {b}[{j}]", oracle="&"),
+    "_mm_or_si128":     Model("__builtin_ia32_por128", "long long", 2, "{a}[{j}] | {b}[{j}]", oracle="|"),
+    "_mm_xor_si128":    Model("__builtin_ia32_pxor128", "long long", 2, "{a}[{j}] ^ {b}[{j}]", oracle="^"),
+    "_mm_andnot_si128": Model("__builtin_ia32_pandn128", "long long", 2, "~{a}[{j}] & {b}[{j}]", oracle="andnot"),
+    # --- MMX 64-bit add/sub (32-bit lanes done unsigned to avoid UB) ---
+    "_mm_add_pi8":  Model("__builtin_ia32_paddb", "char", 8, "{a}[{j}] + {b}[{j}]", oracle="+"),
+    "_mm_add_pi16": Model("__builtin_ia32_paddw", "short", 4, "{a}[{j}] + {b}[{j}]", oracle="+"),
+    "_mm_add_pi32": Model("__builtin_ia32_paddd", "int", 2, "{a}[{j}] + {b}[{j}]", sign="unsigned", oracle="+"),
+    "_mm_sub_pi8":  Model("__builtin_ia32_psubb", "char", 8, "{a}[{j}] - {b}[{j}]", oracle="-"),
+    "_mm_sub_pi16": Model("__builtin_ia32_psubw", "short", 4, "{a}[{j}] - {b}[{j}]", oracle="-"),
+    "_mm_sub_pi32": Model("__builtin_ia32_psubd", "int", 2, "{a}[{j}] - {b}[{j}]", sign="unsigned", oracle="-"),
+    # --- saturating add: clamp to the element type's range ---
+    "_mm_adds_epi8":  Model("__builtin_ia32_paddsb128", "char", 16, "({a}[{j}] + {b}[{j}]) < -128 ? -128 : ({a}[{j}] + {b}[{j}]) > 127 ? 127 : {a}[{j}] + {b}[{j}]", sign="signed"),
+    "_mm_adds_epi16": Model("__builtin_ia32_paddsw128", "short", 8, "({a}[{j}] + {b}[{j}]) < -32768 ? -32768 : ({a}[{j}] + {b}[{j}]) > 32767 ? 32767 : {a}[{j}] + {b}[{j}]"),
+    "_mm_adds_epu8":  Model("__builtin_ia32_paddusb128", "char", 16, "({a}[{j}] + {b}[{j}]) > 255 ? 255 : {a}[{j}] + {b}[{j}]", sign="unsigned"),
+    "_mm_adds_epu16": Model("__builtin_ia32_paddusw128", "short", 8, "({a}[{j}] + {b}[{j}]) > 65535 ? 65535 : {a}[{j}] + {b}[{j}]", sign="unsigned"),
+    # --- saturating sub ---
+    "_mm_subs_epi8":  Model("__builtin_ia32_psubsb128", "char", 16, "({a}[{j}] - {b}[{j}]) < -128 ? -128 : ({a}[{j}] - {b}[{j}]) > 127 ? 127 : {a}[{j}] - {b}[{j}]", sign="signed"),
+    "_mm_subs_epi16": Model("__builtin_ia32_psubsw128", "short", 8, "({a}[{j}] - {b}[{j}]) < -32768 ? -32768 : ({a}[{j}] - {b}[{j}]) > 32767 ? 32767 : {a}[{j}] - {b}[{j}]"),
+    "_mm_subs_epu8":  Model("__builtin_ia32_psubusb128", "char", 16, "({a}[{j}] - {b}[{j}]) < 0 ? 0 : {a}[{j}] - {b}[{j}]", sign="unsigned"),
+    "_mm_subs_epu16": Model("__builtin_ia32_psubusw128", "short", 8, "({a}[{j}] - {b}[{j}]) < 0 ? 0 : {a}[{j}] - {b}[{j}]", sign="unsigned"),
+    # --- shift by immediate (count in a scalar int) ---
+    # Logical shifts use unsigned lanes (well-defined modular shift); a count
+    # of >= element width yields 0. Casting the count to unsigned also makes a
+    # negative/out-of-range immediate clamp to "too large" rather than UB.
+    "_mm_slli_epi16": Model("__builtin_ia32_psllwi128", "short", 8, "(unsigned){b} >= 16 ? 0 : {a}[{j}] << {b}", sign="unsigned", scalar2="int"),
+    "_mm_slli_epi32": Model("__builtin_ia32_pslldi128", "int", 4, "(unsigned){b} >= 32 ? 0 : {a}[{j}] << {b}", sign="unsigned", scalar2="int"),
+    "_mm_slli_epi64": Model("__builtin_ia32_psllqi128", "long long", 2, "(unsigned){b} >= 64 ? 0 : {a}[{j}] << {b}", sign="unsigned", scalar2="int"),
+    "_mm_srli_epi16": Model("__builtin_ia32_psrlwi128", "short", 8, "(unsigned){b} >= 16 ? 0 : {a}[{j}] >> {b}", sign="unsigned", scalar2="int"),
+    "_mm_srli_epi32": Model("__builtin_ia32_psrldi128", "int", 4, "(unsigned){b} >= 32 ? 0 : {a}[{j}] >> {b}", sign="unsigned", scalar2="int"),
+    "_mm_srli_epi64": Model("__builtin_ia32_psrlqi128", "long long", 2, "(unsigned){b} >= 64 ? 0 : {a}[{j}] >> {b}", sign="unsigned", scalar2="int"),
+    # Arithmetic right shift uses signed lanes; a count of >= width yields the
+    # sign fill (-1 for negative inputs, 0 otherwise).
+    "_mm_srai_epi16": Model("__builtin_ia32_psrawi128", "short", 8, "(unsigned){b} >= 16 ? ({a}[{j}] < 0 ? -1 : 0) : {a}[{j}] >> {b}", scalar2="int"),
+    "_mm_srai_epi32": Model("__builtin_ia32_psradi128", "int", 4, "(unsigned){b} >= 32 ? ({a}[{j}] < 0 ? -1 : 0) : {a}[{j}] >> {b}", scalar2="int"),
+}
+
+
+def width_variants(declared):
+    """Derive wider-vector variants of the 128-bit base MODELS entries.
+
+    The per-element body is width-independent, so a 256-bit (AVX2) variant
+    differs only in the builtin name (...128 -> ...256), the Intel name
+    (_mm_ -> _mm256_) and the lane count (doubled). A variant is produced only
+    when its builtin is actually declared in CBMC's headers. (512-bit AVX-512
+    forms are mask-only -- e.g. ...512_mask -- and are handled separately.)"""
+    variants = {}
+    for intel_name, m in MODELS.items():
+        if not m.builtin.endswith("128"):
+            continue
+        builtin256 = m.builtin[:-len("128")] + "256"
+        if builtin256 not in declared:
+            continue
+        name256 = intel_name.replace("_mm_", "_mm256_", 1)
+        variants[name256] = Model(
+            builtin256, m.elem, m.count * 2, m.body, m.sign, m.scalar2,
+            m.oracle)
+    return variants
+
+
+def mask_variants(declared):
+    """Derive AVX-512 merge-masked variants (128-, 256- and 512-bit) of the
+    binary pointwise base entries (those with a second vector operand and no
+    scalar parameter), gated on the ...<width>_mask builtin being declared.
+    The masking is a uniform wrapper over the base per-element body. (There is
+    no separate _maskz builtin for these ops: zero-masking is the _mask form
+    with a zero merge source.)"""
+    variants = {}
+    for intel_name, m in MODELS.items():
+        if m.scalar2 or "{b}" not in m.body or not m.builtin.endswith("128"):
+            continue
+        # Masked compares (pcmp*_mask) are not merge-masked vector ops: they
+        # return an __mmask and take (a, b, k), so the merge-mask wrapper below
+        # would give them the wrong signature. Skip the compare base ops.
+        if m.oracle in ("==", ">"):
+            continue
+        mnemonic = m.builtin[len("__builtin_ia32_"):-len("128")]
+        for width, factor in (("128", 1), ("256", 2), ("512", 4)):
+            builtin = f"__builtin_ia32_{mnemonic}{width}_mask"
+            count = m.count * factor
+            mask_type = mask_type_for(count)
+            if (builtin not in declared or mask_type is None
+                    or (m.elem, count) not in VEC_TYPES):
+                continue
+            prefix = "_mm_mask_" if width == "128" else f"_mm{width}_mask_"
+            name = intel_name.replace("_mm_", prefix, 1)
+            variants[name] = Model(
+                builtin, m.elem, count, m.body, m.sign, mask_type=mask_type)
+    return variants
+
+
+def get_existing_models(cbmc_root, exclude=None):
+    """Collect __builtin_ia32_* models already present in the library. When
+    regenerating a file, that file is passed as *exclude* so its own models do
+    not count as "already present" (keeping regeneration idempotent)."""
+    models = set()
+    lib_dir = os.path.join(cbmc_root, "src", "ansi-c", "library")
+    exclude = os.path.abspath(exclude) if exclude else None
+    for fname in os.listdir(lib_dir):
+        if not fname.endswith(".c"):
+            continue
+        path = os.path.join(lib_dir, fname)
+        if exclude and os.path.abspath(path) == exclude:
+            continue
+        with open(path) as f:
+            for m in re.finditer(r'/\* FUNCTION: (__builtin_ia32_\w+)', f.read()):
+                models.add(m.group(1))
+    return models
+
+
+def get_declared_builtins(cbmc_root):
+    builtins = set()
+    pattern = os.path.join(cbmc_root, "src", "ansi-c", "compiler_headers",
+                           "gcc_builtin_headers_ia32*.h")
+    for hdr in glob.glob(pattern):
+        with open(hdr) as f:
+            for m in re.finditer(r'(__builtin_ia32_\w+)', f.read()):
+                builtins.add(m.group(1))
+    return builtins
+
+
+def emit_model(model):
+    """Emit a CBMC library model function for a Model, or None if the
+    (element type, count) combination has no known GCC vector type."""
+    vec_type = VEC_TYPES.get((model.elem, model.count))
+    if vec_type is None:
+        return None
+
+    total_bytes = model.count * ELEM_SIZE[model.elem]
+    vec_typedef = (f"typedef {model.elem} {vec_type} "
+                   f"__attribute__((__vector_size__({total_bytes})));")
+
+    lines = [f"/* FUNCTION: {model.builtin} */", "", vec_typedef, ""]
+
+    # Determine the type the per-element operation runs in (work_type) and,
+    # if it differs from the public vector type, the alias typedef for it.
+    work_type = vec_type
+    if model.sign in ("signed", "unsigned"):
+        work_type = f"{vec_type}_{'u' if model.sign == 'unsigned' else 's'}"
+        work_typedef = (f"typedef {model.sign} {model.elem} {work_type} "
+                        f"__attribute__((__vector_size__({total_bytes})));")
+        lines.insert(3, work_typedef)
+
+    scalar = model.scalar2 is not None
+    n_params = 2 if (scalar or "{b}" in model.body) else 1
+    # A scalar second operand is referred to directly as {b} (not {b}[{j}]).
+    body = model.body.format(a="a_", b=("b" if scalar else "b_"), j="j")
+    cast = f"({work_type})" if work_type != vec_type else ""
+
+    if n_params == 1:
+        lines.append(f"{vec_type} {model.builtin}({vec_type} a)")
+    elif scalar:
+        lines.append(
+            f"{vec_type} {model.builtin}({vec_type} a, {model.scalar2} b)")
+    else:
+        lines.append(f"{vec_type} {model.builtin}({vec_type} a, {vec_type} b)")
+
+    lines.append("{")
+    lines.append(f"  {work_type} a_ = {cast}a;")
+    if n_params > 1 and not scalar:
+        lines.append(f"  {work_type} b_ = {cast}b;")
+    lines.append(f"  {work_type} dst;")
+    lines.append(f"  for(int j = 0; j < {model.count}; j++)")
+    lines.append(f"    dst[j] = {body};")
+    if work_type != vec_type:
+        lines.append(f"  return ({vec_type})dst;")
+    else:
+        lines.append("  return dst;")
+    lines.append("}")
+    lines.append("")
+    return "\n".join(lines)
+
+
+def emit_masked_model(model):
+    """Emit an AVX-512 merge-masked model: per lane, the base body if the
+    mask bit is set, otherwise the corresponding lane of the merge source."""
+    vec_type = VEC_TYPES.get((model.elem, model.count))
+    if vec_type is None:
+        return None
+    total_bytes = model.count * ELEM_SIZE[model.elem]
+    lines = [f"/* FUNCTION: {model.builtin} */", "",
+             f"typedef {model.elem} {vec_type} "
+             f"__attribute__((__vector_size__({total_bytes})));"]
+    work_type = vec_type
+    if model.sign in ("signed", "unsigned"):
+        work_type = f"{vec_type}_{'u' if model.sign == 'unsigned' else 's'}"
+        lines.append(f"typedef {model.sign} {model.elem} {work_type} "
+                     f"__attribute__((__vector_size__({total_bytes})));")
+    lines.append("")
+    body = model.body.format(a="a_", b="b_", j="j")
+    cast = f"({work_type})" if work_type != vec_type else ""
+    lines.append(f"{vec_type} {model.builtin}({vec_type} a, {vec_type} b, "
+                 f"{vec_type} src, {model.mask_type} k)")
+    lines.append("{")
+    lines.append(f"  {work_type} a_ = {cast}a;")
+    lines.append(f"  {work_type} b_ = {cast}b;")
+    lines.append(f"  {vec_type} dst;")
+    lines.append(f"  for(int j = 0; j < {model.count}; j++)")
+    lines.append(f"    dst[j] = (k >> j) & 1 ? ({model.elem})({body}) : src[j];")
+    lines.append("  return dst;")
+    lines.append("}")
+    lines.append("")
+    return "\n".join(lines)
+
+
+# --- Intel Intrinsics Guide XML survey (--xml) -----------------------------
+
+# Base C element type for an element bit width.
+_BITS_TO_ELEM = {8: "char", 16: "short", 32: "int", 64: "long long"}
+
+
+def parse_operation(op_text):
+    """Translate a simple element-wise Intel <operation> into
+    (elem, count, body) for the generator, or None if it is not the supported
+    shape: a single 'FOR j := 0 to N' loop with one
+    'dst[i+W-1:i] := <expr>' assignment whose expression uses only the per-lane
+    operands a/b, the operators + - *, and parentheses.
+
+    This deliberately does NOT infer signedness or apply the UB-hardening
+    (unsigned wrapping arithmetic etc.) that the hand-written MODELS use, so
+    its output is a draft for human review rather than a finished model."""
+    if not op_text:
+        return None
+    # Drop trailing upper-bits-zero lines such as 'dst[MAX:256] := 0'.
+    lines = [ln for ln in op_text.strip().splitlines()
+             if not re.match(r'\s*dst\[(?:MAX|\d+):\d+\]\s*:=\s*0\s*$', ln)]
+    text = "\n".join(lines)
+    m_for = re.search(r'FOR\s+j\s*:=\s*0\s+to\s+(\d+)', text)
+    if not m_for or len(re.findall(r'\bFOR\b', text)) != 1:
+        return None
+    count = int(m_for.group(1)) + 1
+    assignments = re.findall(r'dst\[i\+(\d+):i\]\s*:=\s*(.+)', text)
+    if len(assignments) != 1:
+        return None
+    width = int(assignments[0][0]) + 1
+    elem = _BITS_TO_ELEM.get(width)
+    if elem is None:
+        return None
+    expr = assignments[0][1].strip()
+    # Reject widening/narrowing ops: every operand lane slice must have the
+    # same width as the destination lane (e.g. _mm_mul_epu32 reads 32-bit
+    # halves into a 64-bit dst and must not be translated element-wise).
+    operand_widths = {int(w) + 1
+                      for w in re.findall(r'\b[ab]\[i\+(\d+):i\]', expr)}
+    if operand_widths and operand_widths != {width}:
+        return None
+    # Per-lane slices a[i+W-1:i] / b[i+W-1:i] become {a}[{j}] / {b}[{j}].
+    expr = re.sub(r'\ba\[i\+\d+:i\]', '{a}[{j}]', expr)
+    expr = re.sub(r'\bb\[i\+\d+:i\]', '{b}[{j}]', expr)
+    # Anything other than the lane placeholders, + - *, parentheses and
+    # whitespace means we do not fully understand the expression.
+    residue = re.sub(r'\{a\}\[\{j\}\]|\{b\}\[\{j\}\]|[-+*()\s]', '', expr)
+    if residue:
+        return None
+    return elem, count, expr
+
+
+def xml_emit_drafts(xml_path, declared, existing, all_models):
+    """Return (drafts, geometry_mismatches). drafts maps a not-yet-modeled
+    declared builtin to (intel_name, elem, count, body) derived from its
+    pseudocode. geometry_mismatches lists already-modeled builtins where the
+    translator's (elem, count) disagrees with the hand-written Model -- a
+    self-check that the translator reads the pseudocode geometry correctly."""
+    root = ET.parse(xml_path).getroot()
+    builtin_to_model = {m.builtin: m for m in all_models.values()}
+    modeled = set(builtin_to_model) | existing
+    drafts = {}
+    mismatches = []
+    for intrinsic in root.iter("intrinsic"):
+        operation = intrinsic.find("operation")
+        parsed = parse_operation(
+            operation.text if operation is not None else None)
+        if not parsed:
+            continue
+        elem, count, body = parsed
+        for builtin in _builtin_candidates(intrinsic):
+            if builtin not in declared:
+                continue
+            model = builtin_to_model.get(builtin)
+            if model is not None:
+                if (model.elem, model.count) != (elem, count):
+                    mismatches.append(
+                        (builtin, (elem, count), (model.elem, model.count)))
+            elif builtin not in modeled:
+                drafts.setdefault(
+                    builtin, (intrinsic.get("name"), elem, count, body))
+    return drafts, mismatches
+
+
+# Width (and hence GCC builtin suffix) implied by an <instruction> form.
+def _instruction_width(form):
+    f = (form or "").lower()
+    if "zmm" in f:
+        return "512"
+    if "ymm" in f:
+        return "256"
+    if "xmm" in f:
+        return "128"
+    if "mm" in f:
+        return ""  # 64-bit MMX builtins typically carry no width suffix
+    return None
+
+
+def _builtin_candidates(intrinsic):
+    """Best-effort set of GCC builtin names an <intrinsic> might correspond to,
+    derived from its <instruction> mnemonic(s) and register width. Heuristic:
+    AVX-512 masked variants and a few irregular names will not map."""
+    out = set()
+    for instr in intrinsic.findall("instruction"):
+        mnemonic = (instr.get("name") or "").lower()
+        width = _instruction_width(instr.get("form"))
+        if mnemonic and width is not None:
+            out.add(f"__builtin_ia32_{mnemonic}{width}")
+    return out
+
+
+def _is_auto_generatable(operation):
+    """Heuristic: does this <operation> pseudocode have the simple per-element
+    shape the generator can already emit (a single FOR loop assigning dst[...]
+    from a/b, with no control flow or helper-function calls)?"""
+    if not operation:
+        return False
+    s = operation.strip()
+    # exactly one FOR ... ENDFOR (note "ENDFOR" also contains "FOR")
+    if len(re.findall(r'\bFOR\b', s)) != 1 or "ENDFOR" not in s:
+        return False
+    if re.search(r'\b(CASE|IF|ELSE|RETURN|DEFINE|WHILE)\b', s):
+        return False
+    body = "\n".join(line for line in s.splitlines()
+                     if not re.search(r'\b(FOR|ENDFOR)\b', line))
+    if "dst[" not in body:
+        return False
+    # reject helper-function calls such as ABS(), SignExtend(), Saturate*()
+    if re.search(r'[A-Za-z_]\w*\s*\(', body):
+        return False
+    return True
+
+
+def xml_autogen_candidates(xml_path, declared, existing):
+    """Return a sorted list of (intel_name, builtin) for not-yet-modeled
+    builtins whose Intel pseudocode looks auto-generatable, plus the total
+    number of auto-generatable intrinsics seen (regardless of mapping)."""
+    root = ET.parse(xml_path).getroot()
+    missing = declared - existing
+    candidates = {}
+    total_parseable = 0
+    for intrinsic in root.iter("intrinsic"):
+        operation = intrinsic.find("operation")
+        op_text = operation.text if operation is not None else None
+        if not _is_auto_generatable(op_text):
+            continue
+        total_parseable += 1
+        name = intrinsic.get("name")
+        for builtin in _builtin_candidates(intrinsic):
+            if builtin in missing:
+                candidates.setdefault(builtin, name)
+    return (sorted((name, b) for b, name in candidates.items()),
+            total_parseable)
+
+
+def xml_cpuid_coverage(xml_path, declared, existing):
+    """Per CPUID feature, how many mappable-to-declared builtins are modeled.
+    Returns rows (feature, modeled_count, declared_count) sorted by declared
+    count descending. Grouped by the intrinsic's first <CPUID> element."""
+    from collections import defaultdict
+    root = ET.parse(xml_path).getroot()
+    decl = defaultdict(set)
+    modeled = defaultdict(set)
+    for intrinsic in root.iter("intrinsic"):
+        feature = intrinsic.findtext("CPUID") or "(none)"
+        for builtin in _builtin_candidates(intrinsic):
+            if builtin in declared:
+                decl[feature].add(builtin)
+                if builtin in existing:
+                    modeled[feature].add(builtin)
+    return [(feat, len(modeled[feat]), len(decl[feat]))
+            for feat in sorted(decl, key=lambda f: len(decl[f]), reverse=True)]
+
+
+def format_output(text, assume_filename):
+    """Run generated C through clang-format so bodies of any length come out
+    matching the project style (and the CI clang-format check). A no-op on
+    already-clean output; if clang-format is unavailable the text is returned
+    unchanged (CI's clang-format check would then catch any divergence).
+    *assume_filename* tells clang-format which .clang-format / language to use."""
+    for clang_format in ("clang-format-15", "clang-format"):
+        if shutil.which(clang_format):
+            result = subprocess.run(
+                [clang_format, "--assume-filename", assume_filename],
+                input=text, capture_output=True, text=True)
+            if result.returncode == 0:
+                return result.stdout
+            break
+    sys.stderr.write("warning: clang-format not found; output not reformatted\n")
+    return text
+
+
+def equivalence_test(model):
+    """C source for an exhaustive equivalence test: model(a, b) must equal a
+    reference built from CBMC's native vector operators for all inputs. The
+    reference is independent of the library model (CBMC implements vector
+    operators directly). Returns None if the model has no oracle / no vector
+    type.
+
+    Arithmetic references (+, -, *) are computed on unsigned lanes so they are
+    overflow-clean and wrap like the hardware; bitwise and comparison
+    references use the signed vector type directly (signedness is irrelevant
+    to & | ^ and ==, and pcmpgt is a signed compare)."""
+    if not model.oracle:
+        return None
+    vec = VEC_TYPES.get((model.elem, model.count))
+    if vec is None:
+        return None
+    nbytes = model.count * ELEM_SIZE[model.elem]
+    decls = [f"typedef {model.elem} {vec} "
+             f"__attribute__((__vector_size__({nbytes})));"]
+    if model.oracle in ("+", "-", "*"):
+        uvec = vec + "_u"
+        decls.append(f"typedef unsigned {model.elem} {uvec} "
+                     f"__attribute__((__vector_size__({nbytes})));")
+        ref_type = uvec
+        ref_expr = f"({uvec})a {model.oracle} ({uvec})b"
+        lane = lambda k: f"r[{k}] == ({model.elem})ref[{k}]"
+        desc = f"native {model.oracle}"
+    elif model.oracle == "andnot":
+        ref_type = vec
+        ref_expr = "~a & b"
+        lane = lambda k: f"r[{k}] == ref[{k}]"
+        desc = "native ~a & b"
+    else:  # & | ^ == >
+        ref_type = vec
+        ref_expr = f"a {model.oracle} b"
+        lane = lambda k: f"r[{k}] == ref[{k}]"
+        desc = f"native {model.oracle}"
+    decls.append(f"{vec} {model.builtin}({vec}, {vec});")
+    lanes = " && ".join(lane(k) for k in range(model.count))
+    return (
+        "\n".join(decls) + "\n\n"
+        "int main()\n"
+        "{\n"
+        "  // Exhaustive equivalence: the model must agree with CBMC's own\n"
+        f"  // vector semantics ({desc}) for all inputs.\n"
+        f"  {vec} a, b;\n"
+        f"  {vec} r = {model.builtin}(a, b);\n"
+        f"  {ref_type} ref = {ref_expr};\n"
+        f"  __CPROVER_assert(\n    {lanes},\n"
+        f'    "{model.builtin} == {desc}");\n'
+        "  return 0;\n"
+        "}\n")
+
+
+TEST_DESC = ("CORE gcc-only\nmain.c\n\n"
+             "^EXIT=0$\n^SIGNAL=0$\n^VERIFICATION SUCCESSFUL$\n--\n"
+             "^warning: ignoring\n")
+
+
+def emit_tests(out_dir, all_models):
+    """Write an exhaustive-equivalence regression test (main.c + test.desc)
+    under out_dir/<builtin>/ for every model that has a native-operator
+    oracle. Returns the number of tests written."""
+    written = 0
+    for model in all_models.values():
+        source = equivalence_test(model)
+        if source is None:
+            continue
+        test_dir = os.path.join(out_dir, model.builtin)
+        os.makedirs(test_dir, exist_ok=True)
+        main_c = os.path.join(test_dir, "main.c")
+        with open(main_c, "w") as f:
+            f.write(format_output(source, main_c))
+        with open(os.path.join(test_dir, "test.desc"), "w") as f:
+            f.write(TEST_DESC)
+        written += 1
+    return written
+
+
+def main():
+    p = argparse.ArgumentParser(description=__doc__,
+        formatter_class=argparse.RawDescriptionHelpFormatter)
+    p.add_argument("--cbmc-root", default=".")
+    p.add_argument("-o", "--output")
+    p.add_argument("--status", action="store_true",
+                   help="Show declared vs modeled intrinsics")
+    p.add_argument("--xml",
+                   help="Intel Intrinsics Guide data-latest.xml; with --status, "
+                        "survey which not-yet-modeled builtins have "
+                        "auto-generatable pseudocode")
+    p.add_argument("--emit-drafts", metavar="XML",
+                   help="Translate the simple element-wise pseudocode of "
+                        "not-yet-modeled intrinsics into draft Model() entries "
+                        "for review (signedness and UB-hardening still need a "
+                        "human), and self-check the translator against the "
+                        "hand-written models")
+    p.add_argument("--emit-tests", metavar="DIR",
+                   help="Write exhaustive-equivalence regression tests "
+                        "(model == CBMC's native vector operator for all "
+                        "inputs) under DIR for every model with an oracle")
+    args = p.parse_args()
+
+    existing = get_existing_models(args.cbmc_root)
+    declared = get_declared_builtins(args.cbmc_root)
+    # The base 128-bit MODELS plus derived wider-vector variants (gated on the
+    # builtin being declared) form the full set this tool can emit.
+    all_models = {**MODELS, **width_variants(declared),
+                  **mask_variants(declared)}
+
+    if args.emit_tests:
+        n = emit_tests(args.emit_tests, all_models)
+        sys.stderr.write(f"Wrote {n} equivalence test(s) under "
+                         f"{args.emit_tests}\n")
+        return
+
+    if args.emit_drafts:
+        drafts, mismatches = xml_emit_drafts(
+            args.emit_drafts, declared, existing, all_models)
+        sys.stderr.write(
+            f"Translator self-check: {len(mismatches)} geometry mismatch(es) "
+            f"against hand-written models.\n")
+        for builtin, got, want in mismatches:
+            sys.stderr.write(f"  MISMATCH {builtin}: derived {got} vs {want}\n")
+        print(f"# {len(drafts)} draft model(s) from element-wise pseudocode.")
+        print("# Review each: infer signedness, and harden against signed UB")
+        print("# (unsigned wrapping arithmetic, modular negation) before use.")
+        for builtin in sorted(drafts):
+            iname, elem, count, body = drafts[builtin]
+            print(f'    "{iname}": Model("{builtin}", "{elem}", {count}, '
+                  f'"{body}"),')
+        return
+
+    if args.status:
+        print(f"Declared __builtin_ia32_* in CBMC headers: {len(declared)}")
+        print(f"Already modeled in library: {len(existing)}")
+        print(f"Missing models: {len(declared) - len(existing)}")
+        can = [(iname, m.builtin) for iname, m in all_models.items()
+               if m.builtin in declared and m.builtin not in existing]
+        print(f"\nCan auto-generate from MODELS ({len(can)}):")
+        for iname, bname in sorted(can, key=lambda x: x[1]):
+            print(f"  {bname}  ({iname})")
+        not_yet = declared - existing - {m.builtin for m in all_models.values()}
+        print(f"\nNot yet covered by this tool: {len(not_yet)}")
+        if args.xml:
+            candidates, total = xml_autogen_candidates(
+                args.xml, declared, existing)
+            print(f"\nIntel intrinsics with auto-generatable pseudocode: "
+                  f"{total}")
+            print(f"... mapping to a not-yet-modeled CBMC builtin "
+                  f"({len(candidates)}):")
+            for iname, bname in candidates:
+                print(f"  {bname}  ({iname})")
+            rows = xml_cpuid_coverage(args.xml, declared, existing)
+            print(f"\nCoverage by CPUID feature (modeled / mappable-declared):")
+            for feat, n_modeled, n_declared in rows:
+                print(f"  {feat:20s} {n_modeled:5d} / {n_declared}")
+        return
+
+    # Emit a model unless that builtin is already modeled in another library
+    # file (the owned GENERATED_LIBRARY is excluded so regeneration is
+    # idempotent rather than emitting nothing).
+    external = get_existing_models(
+        args.cbmc_root, exclude=os.path.join(args.cbmc_root, GENERATED_LIBRARY))
+    models = []
+    for intel_name, model in sorted(all_models.items(),
+                                    key=lambda x: x[1].builtin):
+        if model.builtin in external:
+            continue
+        if model.builtin not in declared:
+            print(f"Skip {model.builtin}: not declared in CBMC headers",
+                  file=sys.stderr)
+            continue
+        emitted = (emit_masked_model(model) if model.mask_type
+                   else emit_model(model))
+        if emitted:
+            models.append(emitted)
+
+    header = (
+        "// x86 SIMD intrinsic models for CBMC\n"
+        "// Generated by scripts/generate_intrinsic_models.py\n"
+        f"// Models: {len(models)}\n\n"
+    )
+    output = header + "\n".join(models)
+    output = format_output(
+        output, os.path.join(args.cbmc_root, GENERATED_LIBRARY))
+
+    if args.output:
+        with open(args.output, "w") as f:
+            f.write(output)
+        print(f"Generated {len(models)} models -> {args.output}",
+              file=sys.stderr)
+    else:
+        print(output)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/generate_neon_models.py b/scripts/generate_neon_models.py
new file mode 100644
index 00000000000..9071d95029e
--- /dev/null
+++ b/scripts/generate_neon_models.py
@@ -0,0 +1,442 @@
+#!/usr/bin/env python3
+#
+# Draft generator for ARM/AArch64 NEON CBMC library models.
+#
+# Two-source design:
+#
+#  * Structure comes from Clang's arm_neon.td: which builtins exist, the
+#    element types each supports, and -- since Clang's NEON builtins are
+#    polymorphic -- the NeonTypeFlags type code that selects the lane type at
+#    each call site.  An intrinsic defined with an *OpInst class is open-coded
+#    by <arm_neon.h> into native C operators, so it needs no model and is
+#    skipped here; only the opaque SInst/IInst/... builtins are modelled.
+#
+#  * Semantics come from OP_TABLE below.  arm_neon.td carries no semantics
+#    the opaque builtins (the Operation field is OP_NONE), so the per-lane
+#    computation is supplied here.  For the mechanically-translatable ops
+#    (min/max/absolute-difference/...) the body is obvious from the operation
+#    and is encoded directly.  The non-trivial ops (saturating, rounding,
+#    narrowing, floating-point estimate, table, crypto, ...) need real
+#    pseudocode -- ultimately from ARM's machine-readable spec -- and are
+#    reported as unmodelled rather than guessed at.
+#
+# The emitted models match the declarations in gcc_builtin_headers_aarch64.h:
+# every operand is the byte-representative lane type (__gcc_v16qi for 128-bit,
+# __gcc_v8qi for 64-bit) plus an int type code, exactly as <arm_neon.h> calls
+# them.
+
+import argparse
+import re
+import shutil
+import subprocess
+import sys
+
+# NeonTypeFlags element-type enum (clang/Basic/TargetBuiltins.h) for the
+# base types we model, plus the lane bit width.  The full integer type code is
+#   EltType | (unsigned ? 0x10 : 0) | (quad ? 0x20 : 0)
+INT_BASE = {
+        'c': ('Int8', 0, 8),
+        's': ('Int16', 1, 16),
+        'i': ('Int32', 2, 32),
+        'l': ('Int64', 3, 64),
+        }
+UNSIGNED_FLAG = 0x10
+QUAD_FLAG = 0x20
+
+# gcc vector typedef stem for a lane width (see gcc_builtin_headers_types).
+STEM = {8: 'qi', 16: 'hi', 32: 'si', 64: 'di'}
+# scalar C type for a lane.
+SCALAR = {
+        (8, False): 'signed char', (8, True): 'unsigned char',
+        (16, False): 'short', (16, True): 'unsigned short',
+        (32, False): 'int', (32, True): 'unsigned int',
+        (64, False): 'long long', (64, True): 'unsigned long long',
+        }
+# next wider *signed* type, used to compute a signed difference without
+# overflow.
+WIDER = {8: 'int', 16: 'int', 32: 'long long', 64: '__int128'}
+
+
+def sat_bounds(signed, width):
+    """Return (lo, hi) C integer literals for a lane's saturation range."""
+    if signed:
+        hi = 2 ** (width - 1) - 1
+        if width == 64:
+            return '(-{}LL - 1)'.format(hi), '{}LL'.format(hi)
+        return str(-2 ** (width - 1)), str(hi)
+    hi = 2 ** width - 1
+    if width == 64:
+        return '0', '{}ULL'.format(hi)
+    return '0', str(hi)
+
+
+def lane_body(op, signed, width):
+    """Return the loop body computing r[i] from x[i], y[i] for one lane, for an
+    element-wise (non-reshaping) operation.  Signed arithmetic is widened to
+    avoid signed-overflow undefined behaviour."""
+    wide = WIDER[width]
+    if op == 'vmax':
+        return 'r[i] = x[i] > y[i] ? x[i] : y[i];'
+    if op == 'vmin':
+        return 'r[i] = x[i] < y[i] ? x[i] : y[i];'
+    if op == 'vabd':
+        if signed:
+            return ('{{ {w} d = ({w})x[i] - ({w})y[i]; '
+                    'r[i] = d < 0 ? -d : d; }}').format(w=wide)
+        return 'r[i] = x[i] > y[i] ? x[i] - y[i] : y[i] - x[i];'
+    if op == 'vhadd':  # halving add: floor((a + b) / 2)
+        return 'r[i] = (({w})x[i] + ({w})y[i]) >> 1;'.format(w=wide)
+    if op == 'vhsub':  # halving subtract
+        return 'r[i] = (({w})x[i] - ({w})y[i]) >> 1;'.format(w=wide)
+    if op == 'vrhadd':  # rounding halving add: floor((a + b + 1) / 2)
+        return 'r[i] = (({w})x[i] + ({w})y[i] + 1) >> 1;'.format(w=wide)
+    if op == 'vqadd':  # saturating add
+        lo, hi = sat_bounds(signed, width)
+        if width == 64 and signed:
+            # avoid __int128 (rejected by -pedantic): detect overflow on the
+            # wrapped sum instead of widening.
+            return (
+                '{{ long long s = (long long)('
+                '(unsigned long long)x[i] + (unsigned long long)y[i]); '
+                'r[i] = ((x[i] ^ s) & (y[i] ^ s)) < 0 '
+                '? (x[i] < 0 ? {lo} : {hi}) : s; }}').format(lo=lo, hi=hi)
+        if width == 64:
+            return ('{{ unsigned long long s = x[i] + y[i]; '
+                    'r[i] = s < x[i] ? {hi} : s; }}').format(hi=hi)
+        if signed:
+            return ('{{ {w} s = ({w})x[i] + ({w})y[i]; '
+                    'r[i] = s < {lo} ? {lo} : (s > {hi} ? {hi} : s); }}'
+                    ).format(w=wide, lo=lo, hi=hi)
+        return ('{{ {w} s = ({w})x[i] + ({w})y[i]; '
+                'r[i] = s > {hi} ? {hi} : s; }}').format(w=wide, hi=hi)
+    if op == 'vqsub':  # saturating subtract
+        lo, hi = sat_bounds(signed, width)
+        if not signed:
+            return 'r[i] = x[i] > y[i] ? x[i] - y[i] : 0;'
+        if width == 64:
+            return ('{{ long long d = (long long)((unsigned long long)x[i] '
+                    '- (unsigned long long)y[i]); '
+                    'r[i] = ((x[i] ^ y[i]) & (x[i] ^ d)) < 0 '
+                    '? (x[i] < 0 ? {lo} : {hi}) : d; }}').format(lo=lo, hi=hi)
+        return ('{{ {w} s = ({w})x[i] - ({w})y[i]; '
+                'r[i] = s < {lo} ? {lo} : (s > {hi} ? {hi} : s); }}'
+                ).format(w=wide, lo=lo, hi=hi)
+    if op == 'vtst':  # test bits: all-ones per lane where (a & b) != 0
+        return 'r[i] = (x[i] & y[i]) != 0 ? -1 : 0;'
+    raise KeyError(op)
+
+
+def pair_reduce(op, signed, width, p, q):
+    """Return an expression combining two adjacent lanes p, q for a pairwise
+    (reshaping) operation."""
+    if op == 'vpmax':
+        return '{p} > {q} ? {p} : {q}'.format(p=p, q=q)
+    if op == 'vpmin':
+        return '{p} < {q} ? {p} : {q}'.format(p=p, q=q)
+    if op == 'vpadd':  # modular add; compute unsigned to avoid overflow UB
+        u = SCALAR[(width, True)]
+        return '({u}){p} + ({u}){q}'.format(u=u, p=p, q=q)
+    raise KeyError(op)
+
+
+# Element-wise opaque builtins we can model directly (one lane in, one out).
+OP_TABLE = {'vabd', 'vmax', 'vmin', 'vqadd', 'vqsub', 'vhadd', 'vhsub',
+            'vrhadd', 'vtst'}
+# Pairwise opaque builtins (reduce adjacent lane pairs, concatenating a, b).
+PAIRWISE = {'vpadd', 'vpmax', 'vpmin'}
+# Bitwise-select: r = (mask & a) | (~mask & b); bit-level, so type-independent.
+BITSELECT = {'vbsl'}
+MODELLED = OP_TABLE | PAIRWISE | BITSELECT
+
+# AArch64 instruction mnemonic (from ACLE advsimd.md) -> operation kind.  The
+# instruction mnemonic is the authoritative semantic identity of an intrinsic;
+# this compact table is the hand-written "semantics" source (see
+# doc/neon-intrinsic-models.md).  Extend it (and MODELLED / lane_body /
+# pair_reduce) to cover more instruction families.
+INSTR_TABLE = {
+        'SABD': 'vabd', 'UABD': 'vabd',
+        'SMAX': 'vmax', 'UMAX': 'vmax',
+        'SMIN': 'vmin', 'UMIN': 'vmin',
+        'SQADD': 'vqadd', 'UQADD': 'vqadd',
+        'SQSUB': 'vqsub', 'UQSUB': 'vqsub',
+        'SHADD': 'vhadd', 'UHADD': 'vhadd',
+        'SHSUB': 'vhsub', 'UHSUB': 'vhsub',
+        'SRHADD': 'vrhadd', 'URHADD': 'vrhadd',
+        'ADDP': 'vpadd',
+        'SMAXP': 'vpmax', 'UMAXP': 'vpmax',
+        'SMINP': 'vpmin', 'UMINP': 'vpmin',
+        'CMTST': 'vtst',
+        'BSL': 'vbsl',
+        }
+
+
+def typed_intrinsic(base, width, unsigned, quad):
+    """Reconstruct the ACLE typed-intrinsic name, e.g. ('vabd', 8, False, True)
+    -> 'vabdq_s8'."""
+    suffix = ('u' if unsigned else 's') + str(width)
+    return base + ('q' if quad else '') + '_' + suffix
+
+
+ACLE_NAME_RE = re.compile(r'intrinsics/(\w+)"')
+ACLE_MNEM_RE = re.compile(r'`([A-Z][A-Z0-9]+)\b')
+
+
+def parse_acle(md_text):
+    """Parse ARM's ACLE neon_intrinsics/advsimd.md into {intrinsic: mnemonic}.
+    Each intrinsic is a markdown table row carrying a link to its guide page
+    and the AArch64 instruction in backticks."""
+    mapping = {}
+    for line in md_text.splitlines():
+        if '<code>' not in line:
+            continue
+        nm = ACLE_NAME_RE.search(line)
+        if not nm:
+            continue
+        mn = ACLE_MNEM_RE.search(line)
+        mapping[nm.group(1)] = mn.group(1) if mn else None
+    return mapping
+
+
+def parse_typespec(typespec):
+    """Yield (base_char, unsigned, quad, other) for each type in a typespec,
+    e.g. 'csiUcQUs' -> Int8, Int16, Int32, uInt8, quad-uInt16.  'other' is set
+    when a modifier we do not model is present (S scalar, P poly, ...), so the
+    caller can skip those variants -- they belong to different builtins."""
+    i = 0
+    while i < len(typespec):
+        unsigned = quad = other = False
+        while typespec[i].isupper():
+            if typespec[i] == 'U':
+                unsigned = True
+            elif typespec[i] == 'Q':
+                quad = True
+            else:
+                other = True
+            i += 1
+        yield typespec[i], unsigned, quad, other
+        i += 1
+
+
+INST_RE = re.compile(
+        r'def\s+\w+\s*:\s*([A-Za-z]*Inst)<\s*"([^"]+)"\s*,\s*"[^"]*"\s*,'
+        r'\s*"([^"]+)"')
+
+
+def collect(td_text):
+    """Return {builtin_name: [(code, width, unsigned), ...]} for the modelled
+    ops, plus a sorted list of intrinsic names skipped for want of
+    semantics."""
+    builtins = {}
+    skipped = set()
+    for m in INST_RE.finditer(td_text):
+        cls, name, typespec = m.group(1), m.group(2), m.group(3)
+        if cls.endswith('OpInst'):
+            continue  # open-coded -> native operators, no model needed
+        if name not in MODELLED:
+            skipped.add(name)
+            continue
+        for base, unsigned, quad, other in parse_typespec(typespec):
+            if other or base not in INT_BASE:
+                continue  # scalar/poly/float: not a plain integer vector
+            _, elt_enum, width = INT_BASE[base]
+            code = elt_enum | (UNSIGNED_FLAG if unsigned else 0) | \
+                (QUAD_FLAG if quad else 0)
+            builtin = '__builtin_neon_' + name + ('q' if quad else '') + '_v'
+            # de-duplicate by type code: several .td records may map to the
+            # same polymorphic builtin (e.g. scalar variants), and a switch
+            # cannot repeat a case label.
+            builtins.setdefault(builtin, {})[code] = (width, unsigned)
+    models = {b: [(c, w, u) for c, (w, u) in sorted(d.items())]
+              for b, d in builtins.items()}
+    return models, sorted(skipped)
+
+
+def emit_model(builtin, cases, acle=None):
+    """Emit one /* FUNCTION */ block.  cases is a list of (code, width,
+    unsigned); all share the same total width (64- or 128-bit).  If an ACLE
+    {intrinsic: mnemonic} map is given, annotate the model with the
+    authoritative instruction mnemonic(s) for provenance."""
+    op = builtin[len('__builtin_neon_'):].rstrip('_v').rstrip('q')
+    quad = builtin.endswith('q_v')
+    total_bytes = 16 if quad else 8
+    rep = '__gcc_v{}qi'.format(total_bytes)
+
+    mnemonics = []
+    if acle is not None:
+        for _, width, unsigned in sorted(cases):
+            mn = acle.get(typed_intrinsic(op, width, unsigned, quad))
+            if mn and mn not in mnemonics:
+                mnemonics.append(mn)
+
+    if op in BITSELECT:
+        # Bitwise select operates on the raw bits, so it is independent of the
+        # lane type code: r = (mask & a) | (~mask & b).
+        out = ['/* FUNCTION: {} */'.format(builtin), '']
+        if mnemonics:
+            out.append(
+                '// Arm instruction(s): {} (per ACLE advsimd.md)'.format(
+                    ', '.join(mnemonics)))
+            out.append('')
+        out.append(
+            'typedef char {} __attribute__((__vector_size__({})));'.format(
+                rep, total_bytes))
+        out.append('')
+        out.append('{rep} {b}({rep} mask, {rep} a, {rep} b, int type)'.format(
+            rep=rep, b=builtin))
+        out.append('{')
+        out.append('  (void)type;')
+        out.append('  return (mask & a) | (~mask & b);')
+        out.append('}')
+        return '\n'.join(out)
+
+    # Collect the lane typedefs we need.
+    typedefs = ['typedef char {} __attribute__((__vector_size__({})));'.format(
+        rep, total_bytes)]
+    seen = {rep}
+    body_cases = []
+    for code, width, unsigned in sorted(cases):
+        lanes = total_bytes * 8 // width
+        suffix = 'u' if unsigned else 's'
+        lane_t = '__gcc_v{}{}_{}'.format(lanes, STEM[width], suffix)
+        if lane_t not in seen:
+            typedefs.append(
+                'typedef {} {} __attribute__((__vector_size__({})));'.format(
+                    SCALAR[(width, unsigned)], lane_t, total_bytes))
+            seen.add(lane_t)
+        if op in PAIRWISE:
+            rx = pair_reduce(op, not unsigned, width,
+                             'x[2 * i]', 'x[2 * i + 1]')
+            ry = pair_reduce(op, not unsigned, width,
+                             'y[2 * i]', 'y[2 * i + 1]')
+            body_cases.append(
+                '  case {code}:\n'
+                '  {{\n'
+                '    {t} x = ({t})a, y = ({t})b, r;\n'
+                '    int h = {n} / 2;\n'
+                '    for(int i = 0; i < h; i++)\n'
+                '      r[i] = {rx};\n'
+                '    for(int i = 0; i < h; i++)\n'
+                '      r[h + i] = {ry};\n'
+                '    return ({rep})r;\n'
+                '  }}'.format(code=code, t=lane_t, n=lanes, rx=rx, ry=ry,
+                             rep=rep))
+        else:
+            body = lane_body(op, not unsigned, width)
+            body_cases.append(
+                '  case {}:\n'
+                '  {{\n'
+                '    {t} x = ({t})a, y = ({t})b, r;\n'
+                '    for(int i = 0; i < {n}; i++)\n'
+                '      {body}\n'
+                '    return ({rep})r;\n'
+                '  }}'.format(code, t=lane_t, n=lanes, body=body, rep=rep))
+
+    out = ['/* FUNCTION: {} */'.format(builtin), '']
+    if mnemonics:
+        out.append(
+            '// Arm instruction(s): {} (per ACLE advsimd.md)'.format(
+                ', '.join(mnemonics)))
+        out.append('')
+    out += typedefs
+    out.append('')
+    out.append(
+        '{rep} {b}({rep} a, {rep} b, int type)'.format(rep=rep, b=builtin))
+    out.append('{')
+    out.append('  switch(type)')
+    out.append('  {')
+    out += body_cases
+    out.append('  }')
+    out.append('')
+    out.append('  {} r = {{0}};'.format(rep))
+    out.append('  return r;')
+    out.append('}')
+    return '\n'.join(out)
+
+
+def audit(td_text, acle):
+    """Report, over the opaque (model-needing) builtins, how the ACLE
+    instruction mnemonics distribute and how far INSTR_TABLE covers them -- the
+    modeling roadmap."""
+    import collections
+    covered = collections.Counter()
+    todo = collections.Counter()
+    for m in INST_RE.finditer(td_text):
+        cls, name, typespec = m.group(1), m.group(2), m.group(3)
+        if cls.endswith('OpInst'):
+            continue
+        for base, unsigned, quad, other in parse_typespec(typespec):
+            if other or base not in INT_BASE:
+                continue
+            _, _, width = INT_BASE[base]
+            mn = acle.get(typed_intrinsic(name, width, unsigned, quad))
+            if mn is None:
+                continue
+            (covered if mn in INSTR_TABLE else todo)[mn] += 1
+    sys.stderr.write(
+        'ACLE audit: {} integer opaque-builtin lane-variants map to mnemonics '
+        'INSTR_TABLE covers; {} do not yet.\n'.format(
+            sum(covered.values()), sum(todo.values())))
+    sys.stderr.write('  covered mnemonics: {}\n'.format(
+        ', '.join('{}={}'.format(k, v) for k, v in covered.most_common())))
+    sys.stderr.write('  top uncovered (modeling roadmap): {}\n'.format(
+        ', '.join('{}={}'.format(k, v) for k, v in todo.most_common(15))))
+
+
+def format_output(text):
+    """Run the generated C through clang-format so it matches the project style
+    (and the CI clang-format check), keeping regeneration idempotent. A no-op
+    on already-clean output; if clang-format is unavailable the text is left
+    unchanged."""
+    for clang_format in ('clang-format-15', 'clang-format'):
+        if shutil.which(clang_format):
+            result = subprocess.run(
+                [clang_format, '--assume-filename', 'arm_neon.c'],
+                input=text, capture_output=True, text=True)
+            if result.returncode == 0:
+                return result.stdout
+            break
+    sys.stderr.write(
+        'warning: clang-format not found; output left unformatted\n')
+    return text
+
+
+def main():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument('arm_neon_td', help='path to clang arm_neon.td')
+    parser.add_argument(
+        '--acle', metavar='ADVSIMD_MD',
+        help='path to ARM ACLE neon_intrinsics/advsimd.md; keys semantics on '
+             'the authoritative instruction mnemonic and annotates provenance')
+    parser.add_argument(
+        '-o', '--output', help='output .c file (default: stdout)')
+    args = parser.parse_args()
+
+    with open(args.arm_neon_td) as f:
+        td_text = f.read()
+    builtins, skipped = collect(td_text)
+
+    acle = None
+    if args.acle:
+        with open(args.acle) as f:
+            acle = parse_acle(f.read())
+
+    blocks = [emit_model(b, cases, acle)
+              for b, cases in sorted(builtins.items())]
+    text = format_output('\n\n'.join(blocks) + '\n')
+
+    if args.output:
+        with open(args.output, 'w') as f:
+            f.write(text)
+    else:
+        sys.stdout.write(text)
+
+    sys.stderr.write(
+        'generated {} model(s) for {} op(s); {} other opaque intrinsic(s) '
+        'need ARM-sourced semantics\n'.format(
+            len(builtins), len(MODELLED), len(skipped)))
+    if acle is not None:
+        audit(td_text, acle)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/scripts/generate_simd_smoke_test.py b/scripts/generate_simd_smoke_test.py
new file mode 100644
index 00000000000..6417ecb0b04
--- /dev/null
+++ b/scripts/generate_simd_smoke_test.py
@@ -0,0 +1,136 @@
+#!/usr/bin/env python3
+#
+# Generate an aggregate "smoke test" for the SIMD intrinsic library models.
+#
+# Many generated x86 (__builtin_ia32_*) and ARM NEON (__builtin_neon_*) models
+# are not worth an individual cbmc-library equivalence test, but every model
+# should still be exercised so that it type-checks, links and symexes without
+# error.  This script parses a generated library file (x86_intrinsics.c or
+# arm_neon.c), and emits a single C file that calls every modelled builtin once
+# with nondeterministic arguments.  The result is placed under
+# regression/cbmc/SIMD*; library_check.sh treats the builtins it references as
+# covered.
+#
+# The builtins are declared by the front-end (for the matching --arch), so the
+# test only needs to reproduce the vector typedefs and call each function with
+# arguments of the right type.  A constant is passed for the trailing NeonType
+# code (the first switch case) or an x86 shift immediate.
+
+import argparse
+import re
+import sys
+
+# Split the library into /* FUNCTION: name */ blocks.
+BLOCK_RE = re.compile(r'/\* FUNCTION: (\S+) \*/(.*?)(?=/\* FUNCTION:|\Z)',
+                      re.DOTALL)
+TYPEDEF_RE = re.compile(r'^typedef .*?;$', re.MULTILINE)
+TYPEDEF_NAME_RE = re.compile(r'\b(__gcc_\w+)\b\s*__attribute__')
+CASE_RE = re.compile(r'\bcase (\d+):')
+
+
+def parse_signature(name, block):
+    """Return (return_type, [param_types]) for the function definition of
+    `name` in `block`, or None if not found."""
+    m = re.search(
+        r'([A-Za-z_][\w ]*?\**)\s*\b' + re.escape(name) +
+        r'\s*\(([^;{]*?)\)\s*\{', block, re.DOTALL)
+    if not m:
+        return None
+    ret = ' '.join(m.group(1).split())
+    params = []
+    for p in m.group(2).split(','):
+        p = ' '.join(p.split())
+        if not p or p == 'void':
+            continue
+        # the parameter type is everything but the trailing identifier
+        params.append(p.rsplit(' ', 1)[0])
+    return ret, params
+
+
+def main():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument('library', help='generated library .c (x86/neon)')
+    parser.add_argument(
+        '--exclude', action='append', default=[], metavar='DIR',
+        help='skip builtins already referenced by a .c under DIR (e.g. the '
+             'consolidated cbmc-library directory of individual tests)')
+    parser.add_argument('-o', '--output', help='output .c (default: stdout)')
+    args = parser.parse_args()
+
+    import glob
+    import os
+    excluded = set()
+    for d in args.exclude:
+        for cfile in glob.glob(os.path.join(d, '*.c')):
+            excluded.update(
+                re.findall(r'__builtin_(?:ia32|neon)_\w+', open(cfile).read()))
+
+    text = open(args.library).read()
+    typedefs = {}     # name -> full typedef line (de-duplicated)
+    calls = []
+    skipped = 0
+    for name, block in BLOCK_RE.findall(text):
+        if name in excluded:
+            continue
+        for td in TYPEDEF_RE.findall(block):
+            nm = TYPEDEF_NAME_RE.search(td)
+            if nm:
+                typedefs[nm.group(1)] = td
+        sig = parse_signature(name, block)
+        if sig is None:
+            skipped += 1
+            continue
+        ret, params = sig
+        case = CASE_RE.search(block)
+        type_code = case.group(1) if case else '1'
+        args_src = []
+        decls = []
+        for i, ptype in enumerate(params):
+            if ptype == 'int':
+                # NeonType code (first switch case) or x86 shift immediate
+                args_src.append(type_code)
+            else:
+                # zero-initialise the argument: this is a smoke test (every
+                # model must type-check, link and symex), so constant inputs
+                # that CBMC can fold keep it fast; the per-function equivalence
+                # tests cover behaviour with nondeterministic inputs.
+                decls.append('    {} a{} = {{0}};'.format(ptype, i))
+                args_src.append('a{}'.format(i))
+        calls.append(
+            '  {{\n'
+            '{decls}\n'
+            '    volatile {ret} r = {name}({args});\n'
+            '    (void)r;\n'
+            '  }}'.format(
+                decls='\n'.join(decls), ret=ret, name=name,
+                args=', '.join(args_src)))
+
+    out = []
+    out.append('// Auto-generated by scripts/generate_simd_smoke_test.py')
+    out.append('// Exercises every modelled SIMD builtin once so the library '
+               'models are')
+    out.append('// type-checked, linked and symex\'d. See '
+               'doc/neon-intrinsic-models.md.')
+    out.append('')
+    for nm in sorted(typedefs):
+        out.append(typedefs[nm])
+    out.append('')
+    out.append('int main(void)')
+    out.append('{')
+    out.extend(calls)
+    out.append('  __CPROVER_assert(1, "SIMD model smoke test");')
+    out.append('  return 0;')
+    out.append('}')
+    result = '\n'.join(out) + '\n'
+
+    if args.output:
+        with open(args.output, 'w') as f:
+            f.write(result)
+    else:
+        sys.stdout.write(result)
+    sys.stderr.write('emitted {} call(s); skipped {} (no parseable '
+                     'signature)\n'.format(len(calls), skipped))
+
+
+if __name__ == '__main__':
+    main()
diff --git a/src/ansi-c/CMakeLists.txt b/src/ansi-c/CMakeLists.txt
index 9934bb9d455..1c58b64757e 100644
--- a/src/ansi-c/CMakeLists.txt
+++ b/src/ansi-c/CMakeLists.txt
@@ -65,6 +65,7 @@ make_inc(compiler_headers/clang_builtin_headers)
 make_inc(compiler_headers/cw_builtin_headers)
 make_inc(compiler_headers/gcc_builtin_headers_alpha)
 make_inc(compiler_headers/gcc_builtin_headers_arm)
+make_inc(compiler_headers/gcc_builtin_headers_aarch64)
 make_inc(compiler_headers/gcc_builtin_headers_generic)
 make_inc(compiler_headers/gcc_builtin_headers_ia32)
 make_inc(compiler_headers/gcc_builtin_headers_ia32-2)
@@ -92,6 +93,7 @@ set(extra_dependencies
     ${CMAKE_CURRENT_BINARY_DIR}/compiler_headers/cw_builtin_headers.inc
     ${CMAKE_CURRENT_BINARY_DIR}/compiler_headers/gcc_builtin_headers_alpha.inc
     ${CMAKE_CURRENT_BINARY_DIR}/compiler_headers/gcc_builtin_headers_arm.inc
+    ${CMAKE_CURRENT_BINARY_DIR}/compiler_headers/gcc_builtin_headers_aarch64.inc
     ${CMAKE_CURRENT_BINARY_DIR}/compiler_headers/gcc_builtin_headers_generic.inc
     ${CMAKE_CURRENT_BINARY_DIR}/compiler_headers/gcc_builtin_headers_ia32-2.inc
     ${CMAKE_CURRENT_BINARY_DIR}/compiler_headers/gcc_builtin_headers_ia32-3.inc
diff --git a/src/ansi-c/Makefile b/src/ansi-c/Makefile
index cdf202904df..5bd86c56027 100644
--- a/src/ansi-c/Makefile
+++ b/src/ansi-c/Makefile
@@ -71,6 +71,7 @@ BUILTIN_FILES = \
   compiler_headers/cw_builtin_headers.inc \
   compiler_headers/gcc_builtin_headers_alpha.inc \
   compiler_headers/gcc_builtin_headers_arm.inc \
+  compiler_headers/gcc_builtin_headers_aarch64.inc \
   compiler_headers/gcc_builtin_headers_generic.inc \
   compiler_headers/gcc_builtin_headers_ia32-2.inc \
   compiler_headers/gcc_builtin_headers_ia32-3.inc \
diff --git a/src/ansi-c/ansi_c_convert_type.cpp b/src/ansi-c/ansi_c_convert_type.cpp
index c7ed6453c90..321c3724e3e 100644
--- a/src/ansi-c/ansi_c_convert_type.cpp
+++ b/src/ansi-c/ansi_c_convert_type.cpp
@@ -196,6 +196,8 @@ void ansi_c_convert_typet::read_rec(const typet &type)
   {
     // note that this is not yet a vector_typet -- this is a size only
     vector_size = static_cast<const constant_exprt &>(type.find(ID_size));
+    // neon_vector_type gives the size as a lane count rather than in bytes
+    vector_lanes = type.get_bool(ID_C_vector_lanes);
   }
   else if(type.id()==ID_void)
   {
@@ -659,6 +661,8 @@ void ansi_c_convert_typet::build_type_with_subtype(typet &type) const
   {
     type_with_subtypet new_type(ID_frontend_vector, type);
     new_type.set(ID_size, vector_size);
+    if(vector_lanes)
+      new_type.set(ID_C_vector_lanes, true);
     new_type.add_source_location()=vector_size.source_location();
     type=new_type;
   }
diff --git a/src/ansi-c/ansi_c_convert_type.h b/src/ansi-c/ansi_c_convert_type.h
index 043198d8fb5..a79d494c653 100644
--- a/src/ansi-c/ansi_c_convert_type.h
+++ b/src/ansi-c/ansi_c_convert_type.h
@@ -40,7 +40,7 @@ class ansi_c_convert_typet
 
   typet gcc_attribute_mode;
 
-  bool packed, aligned;
+  bool packed, aligned, vector_lanes;
   exprt vector_size, alignment, bv_width, fraction_width;
   exprt msc_based; // this is Visual Studio
   bool constructor, destructor;
@@ -106,6 +106,7 @@ class ansi_c_convert_typet
       gcc_attribute_mode(static_cast<const typet &>(get_nil_irep())),
       packed(false),
       aligned(false),
+      vector_lanes(false),
       vector_size(nil_exprt{}),
       alignment(nil_exprt{}),
       bv_width(nil_exprt{}),
diff --git a/src/ansi-c/ansi_c_internal_additions.cpp b/src/ansi-c/ansi_c_internal_additions.cpp
index 14d7cf52a9c..c32a362636e 100644
--- a/src/ansi-c/ansi_c_internal_additions.cpp
+++ b/src/ansi-c/ansi_c_internal_additions.cpp
@@ -89,6 +89,11 @@ const char gcc_builtin_headers_arm[] = "#line 1 \"gcc_builtin_headers_arm.h\"\n"
 #include "compiler_headers/gcc_builtin_headers_arm.inc" // IWYU pragma: keep
   ; // NOLINT(whitespace/semicolon)
 
+const char gcc_builtin_headers_aarch64[] =
+  "#line 1 \"gcc_builtin_headers_aarch64.h\"\n"
+#include "compiler_headers/gcc_builtin_headers_aarch64.inc" // IWYU pragma: keep
+  ; // NOLINT(whitespace/semicolon)
+
 const char gcc_builtin_headers_mips[] =
   "#line 1 \"gcc_builtin_headers_mips.h\"\n"
 #include "compiler_headers/gcc_builtin_headers_mips.inc" // IWYU pragma: keep
diff --git a/src/ansi-c/ansi_c_internal_additions.h b/src/ansi-c/ansi_c_internal_additions.h
index 4f55798a36c..c21fcf80c19 100644
--- a/src/ansi-c/ansi_c_internal_additions.h
+++ b/src/ansi-c/ansi_c_internal_additions.h
@@ -35,6 +35,7 @@ extern const char gcc_builtin_headers_ia32_8[];
 extern const char gcc_builtin_headers_ia32_9[];
 extern const char gcc_builtin_headers_alpha[];
 extern const char gcc_builtin_headers_arm[];
+extern const char gcc_builtin_headers_aarch64[];
 extern const char gcc_builtin_headers_mips[];
 extern const char gcc_builtin_headers_power[];
 extern const char arm_builtin_headers[];
diff --git a/src/ansi-c/builtin_factory.cpp b/src/ansi-c/builtin_factory.cpp
index 172cb96b05a..f7a6d6b76f4 100644
--- a/src/ansi-c/builtin_factory.cpp
+++ b/src/ansi-c/builtin_factory.cpp
@@ -206,6 +206,9 @@ bool builtin_factory(
     {
       if(find_pattern(pattern, gcc_builtin_headers_arm, s))
         return convert(identifier, s, symbol_table, mh);
+
+      if(find_pattern(pattern, gcc_builtin_headers_aarch64, s))
+        return convert(identifier, s, symbol_table, mh);
     }
     else if(config.ansi_c.arch=="mips64el" ||
             config.ansi_c.arch=="mipsn32el" ||
diff --git a/src/ansi-c/c_typecheck_type.cpp b/src/ansi-c/c_typecheck_type.cpp
index 2362e3966f6..7b26e7d3390 100644
--- a/src/ansi-c/c_typecheck_type.cpp
+++ b/src/ansi-c/c_typecheck_type.cpp
@@ -708,6 +708,10 @@ void c_typecheck_baset::typecheck_vector_type(typet &type)
   exprt size = static_cast<const exprt &>(type.find(ID_size));
   const source_locationt source_location = size.find_source_location();
 
+  // neon_vector_type gives the size as a lane count, whereas vector_size (and
+  // hence the default below) gives it in bytes.
+  const bool size_is_lane_count = type.get_bool(ID_C_vector_lanes);
+
   typecheck_expr(size);
 
   typet subtype = to_type_with_subtype(type).subtype();
@@ -770,14 +774,17 @@ void c_typecheck_baset::typecheck_vector_type(typet &type)
   }
 
   // adjust by width of base type
-  if(s % *sub_size != 0)
+  if(!size_is_lane_count)
   {
-    throw errort().with_location(source_location)
-      << "vector size (" << s << ") expected to be multiple of base type size ("
-      << *sub_size << ")";
-  }
+    if(s % *sub_size != 0)
+    {
+      throw errort().with_location(source_location)
+        << "vector size (" << s
+        << ") expected to be multiple of base type size (" << *sub_size << ")";
+    }
 
-  s /= *sub_size;
+    s /= *sub_size;
+  }
 
   // produce the type with ID_vector
   vector_typet new_type(
diff --git a/src/ansi-c/compiler_headers/clang_builtins.py b/src/ansi-c/compiler_headers/clang_builtins.py
index 357f60778a5..397f98a0c29 100755
--- a/src/ansi-c/compiler_headers/clang_builtins.py
+++ b/src/ansi-c/compiler_headers/clang_builtins.py
@@ -2,56 +2,45 @@
 #
 # Download Clang builtin declarations from the llvm-project git repository and
 # parse them to generate declarations to use from within our C front-end.
+#
+# Two input formats are supported:
+#
+# 1. The TableGen ".td" builtin databases (the default).  As of LLVM 20 the
+#    per-target databases were migrated from the X-macro ".def" files to
+#    TableGen, where a builtin is a record such as
+#
+#        def paddd128 : X86Builtin<"_Vector<4, int>(_Vector<4, int>, "
+#                                  "_Vector<4, int>)">;
+#
+#    The record class determines the name prefix (X86Builtin adds
+#    "__builtin_ia32_") and the prototype is an almost-C signature whose only
+#    special constructs are "_Vector<N, ElementType>" and "_Constant Type".
+#    These files are fetched straight from the llvm-project repository, so no
+#    LLVM build is required.
+#
+# 2. ".inc" files produced by clang-tblgen (--inc PREFIX:PATH).  Targets such
+#    as ARM NEON do not spell their builtins in directly-parseable TableGen;
+#    they are generated by clang-tblgen, e.g.
+#
+#        clang-tblgen -gen-arm-neon-sema -I clang/include/clang/Basic \
+#            -I clang/include clang/include/clang/Basic/arm_neon.td \
+#            -o neon_sema.inc
+#
+#    The resulting "..._BUILTIN_INFOS" section lists every builtin with its
+#    name and the classic compact type encoding (e.g. "V8ScV8ScV8Sci").  This
+#    mode parses that section, prepending the supplied PREFIX (for NEON that is
+#    "__builtin_neon_") to each spelling.
+#
+# In both cases the resulting declarations are diffed against the declarations
+# already present in the gcc_builtin_headers_*.h files passed as arguments.
 
+import argparse
 import re
 import requests
 import sys
 
-
-prefix_map = {
-        'I': '',
-        'N': '',
-        'O': 'long long',
-        'S': 'signed',
-        'U': 'unsigned',
-        'W': 'int64_t',
-        'Z': 'int32_t'
-        }
-
-# we don't support:
-# G -> id (Objective-C)
-# H -> SEL (Objective-C)
-# M -> struct objc_super (Objective-C)
-# q -> Scalable vector, followed by the number of elements and base type
-# E -> ext_vector, followed by the number of elements and base type
-# A -> "reference" to __builtin_va_list
-typespec_map = {
-        'F': 'const CFString',
-        'J': 'jmp_buf',
-        'K': 'ucontext_t',
-        'P': 'FILE',
-        'Y': 'ptrdiff_t',
-        'a': '__builtin_va_list',
-        'b': '_Bool',
-        'c': 'char',
-        'd': 'double',
-        'f': 'float',
-        'h': '__fp16',
-        'i': 'int',
-        'p': 'pid_t',
-        's': 'short',
-        'v': 'void',
-        'w': 'wchar_t',
-        'x': '_Float16',
-        'y': '__bf16',
-        'z': '__CPROVER_size_t'
-        }
-
-# we don't support:
-# & -> reference (optionally followed by an address space number)
-modifier_map = {'C': 'const', 'D': 'volatile', 'R': 'restrict'}
-
-# declarations as found in ansi-c/gcc_builtin_headers_types.h
+# Map a (element type, lane count) of a vector type to the corresponding
+# typedef from ansi-c/gcc_builtin_headers_types.h.
 vector_map = {
         'char': {
             8: '__gcc_v8qi',
@@ -68,7 +57,6 @@
             16: '__gcc_v16hi',
             32: '__gcc_v32hi'
             },
-        # new
         'unsigned short': {
             8: '__gcc_v8uhi',
             16: '__gcc_v16uhi',
@@ -81,7 +69,6 @@
             16: '__gcc_v16si',
             256: '__gcc_v256si'
             },
-        # new
         'unsigned int': {
             4: '__gcc_v4usi',
             8: '__gcc_v8usi',
@@ -93,13 +80,11 @@
             4: '__gcc_v4di',
             8: '__gcc_v8di'
             },
-        # new
         'unsigned long long int': {
             2: '__gcc_v2udi',
             4: '__gcc_v4udi',
             8: '__gcc_v8udi',
             },
-        # new
         '_Float16': {
             8: '__gcc_v8hf',
             16: '__gcc_v16hf',
@@ -123,132 +108,466 @@
             }
         }
 
+# Element type spellings that name the same scalar as a vector_map key.
+element_aliases = {
+        'signed char': 'char',
+        'int32_t': 'int',
+        'unsigned int32_t': 'unsigned int',
+        'int64_t': 'long long int',
+        'unsigned int64_t': 'unsigned long long int',
+        '__fp16': '_Float16',
+        }
+
+# Map a TableGen builtin record class to the name prefix it implies (see the
+# "RequiredNamePrefix" fields in BuiltinsX86Base.td).  X86Builtin uses
+# "__builtin_ia32_", the *NoPrefix* and library variants spell names verbatim.
+class_prefix_map = {
+        'X86Builtin': '__builtin_ia32_',
+        'X86NoPrefixBuiltin': '',
+        'X86LibBuiltin': '',
+        }
+
+
+class UnmappableType(Exception):
+    """Raised for types we cannot express, e.g. vectors of pointers
+    (gather/scatter) or vector widths/elements absent from vector_map."""
 
-def parse_prefix(types, i):
+
+def vector_typedef(element, count):
+    element = element.strip()
+    element = element_aliases.get(element, element)
+    widths = vector_map.get(element)
+    if not widths or count not in widths:
+        raise UnmappableType(
+                'no typedef for vector of {} x {}'.format(count, element))
+    return widths[count]
+
+
+# --- TableGen ".td" parser (format 1) --------------------------------------
+
+def strip_comments(text):
+    """Remove // line comments while leaving string literals intact."""
+    out = []
+    i = 0
+    in_string = False
+    while i < len(text):
+        c = text[i]
+        if in_string:
+            out.append(c)
+            if c == '"':
+                in_string = False
+            i += 1
+        elif c == '"':
+            in_string = True
+            out.append(c)
+            i += 1
+        elif c == '/' and i + 1 < len(text) and text[i + 1] == '/':
+            while i < len(text) and text[i] != '\n':
+                i += 1
+        else:
+            out.append(c)
+            i += 1
+    return ''.join(out)
+
+
+def skip_ws(s, pos, end):
+    while pos < end and s[pos].isspace():
+        pos += 1
+    return pos
+
+
+def is_keyword(s, pos, word):
+    if not s.startswith(word, pos):
+        return False
+    after = pos + len(word)
+    return after >= len(s) or not (s[after].isalnum() or s[after] == '_')
+
+
+def match_delim(s, pos, open_ch, close_ch):
+    """Given s[pos] == open_ch, return the index just past the matching
+    close_ch, honouring nesting and string literals."""
+    assert s[pos] == open_ch
+    depth = 0
+    in_string = False
+    while pos < len(s):
+        c = s[pos]
+        if in_string:
+            if c == '"':
+                in_string = False
+        elif c == '"':
+            in_string = True
+        elif c == open_ch:
+            depth += 1
+        elif c == close_ch:
+            depth -= 1
+            if depth == 0:
+                return pos + 1
+        pos += 1
+    raise ValueError('unbalanced ' + open_ch)
+
+
+def find_in_keyword(s, pos, end):
+    """Find the standalone 'in' keyword that terminates a let/foreach head,
+    skipping over [], (), <> groups and string literals."""
+    depth = 0
+    in_string = False
+    while pos < end:
+        c = s[pos]
+        if in_string:
+            if c == '"':
+                in_string = False
+            pos += 1
+        elif c == '"':
+            in_string = True
+            pos += 1
+        elif c in '[(<':
+            depth += 1
+            pos += 1
+        elif c in '])>':
+            depth -= 1
+            pos += 1
+        elif depth == 0 and is_keyword(s, pos, 'in'):
+            return pos
+        else:
+            pos += 1
+    raise ValueError("missing 'in' keyword")
+
+
+def split_top_level(s, sep):
+    """Split s on sep, but only at <>/() nesting depth 0."""
+    parts = []
+    depth = 0
+    last = 0
+    for i, c in enumerate(s):
+        if c in '<(':
+            depth += 1
+        elif c in '>)':
+            depth -= 1
+        elif c == sep and depth == 0:
+            parts.append(s[last:i])
+            last = i + 1
+    parts.append(s[last:])
+    return parts
+
+
+def normalize_type(t):
+    """Translate a single prototype type into C, expanding _Vector<> and
+    dropping the _Constant marker (which only constrains the argument to be a
+    compile-time constant)."""
+    t = t.strip()
+    t = re.sub(r'\b_Constant\b\s*', '', t).strip()
+
+    def repl(m):
+        count = int(m.group(1))
+        element = m.group(2).strip()
+        if '*' in element:
+            raise UnmappableType('vector of pointers: ' + m.group(0))
+        return vector_typedef(element, count)
+
+    t = re.sub(r'_Vector<\s*(\d+)\s*,\s*([^>]+)>', repl, t)
+    return re.sub(r'\s+', ' ', t).strip()
+
+
+def build_declaration(name, prototype):
+    """Build a C declaration string from a builtin name and its TableGen
+    prototype 'ReturnType(ArgType, ...)'."""
+    depth = 0
+    split = -1
+    for i, c in enumerate(prototype):
+        if c == '<':
+            depth += 1
+        elif c == '>':
+            depth -= 1
+        elif c == '(' and depth == 0:
+            split = i
+            break
+    assert split >= 0, 'no argument list in prototype: ' + prototype
+
+    ret = normalize_type(prototype[:split])
+    args_str = prototype[split + 1:prototype.rfind(')')].strip()
+    if args_str == '' or args_str == 'void':
+        args = ['void']
+    else:
+        args = [normalize_type(a) for a in split_top_level(args_str, ',')]
+
+    return ret + ' ' + name + '(' + ', '.join(args) + ');'
+
+
+def resolve_name(raw, bindings):
+    """Resolve a (possibly '#'-pasted) TableGen record name using the current
+    foreach variable bindings."""
+    pieces = []
+    for piece in raw.split('#'):
+        piece = piece.strip()
+        if piece.startswith('"') and piece.endswith('"'):
+            pieces.append(piece[1:-1])
+        elif piece in bindings:
+            pieces.append(bindings[piece])
+        else:
+            pieces.append(piece)
+    return ''.join(pieces)
+
+
+DEF_HEAD_RE = re.compile(r'def\s+(.+?)\s*:\s*(\w+)\s*<', re.DOTALL)
+FEATURES_RE = re.compile(r'\bFeatures\s*=\s*"([^"]*)"')
+FOREACH_RE = re.compile(r'foreach\s+(\w+)\s*=\s*')
+
+
+def parse_def(s, pos, end, bindings, group, out, stats):
+    m = DEF_HEAD_RE.match(s, pos)
+    assert m, 'unparseable def at: ' + s[pos:pos + 60]
+    raw_name, cls = m.group(1), m.group(2)
+    # The prototype is the template argument of the record class.  Its '<'/'>'
+    # delimiters nest, and the prototype string itself contains '<'/'>' (inside
+    # _Vector<>) -- match_delim ignores those as they sit inside string
+    # literals.  TableGen concatenates adjacent string literals, so join them.
+    lt = m.end() - 1
+    gt = match_delim(s, lt, '<', '>')
+    prototype = ''.join(re.findall(r'"([^"]*)"', s[lt + 1:gt - 1]))
+    pos = gt
+    # Consume the trailing ';' or the '{ ... }' body of the record.
+    pos = skip_ws(s, pos, end)
+    if pos < end and s[pos] == '{':
+        pos = match_delim(s, pos, '{', '}')
+    elif pos < end and s[pos] == ';':
+        pos += 1
+
+    prefix = class_prefix_map.get(cls)
+    if prefix is None:
+        stats['unknown_class'] += 1
+        return pos
+
+    name = prefix + resolve_name(raw_name, bindings)
+    try:
+        out.setdefault(group, {})[name] = build_declaration(name, prototype)
+    except UnmappableType:
+        stats['skipped'] += 1
+    return pos
+
+
+def parse_let(s, pos, end, bindings, group, out, stats):
+    in_pos = find_in_keyword(s, pos + len('let'), end)
+    assigns = s[pos + len('let'):in_pos]
+    fm = FEATURES_RE.search(assigns)
+    new_group = fm.group(1) if fm else group
+    body = skip_ws(s, in_pos + len('in'), end)
+    return parse_body(s, body, end, bindings, new_group, out, stats)
+
+
+def parse_foreach(s, pos, end, bindings, group, out, stats):
+    m = FOREACH_RE.match(s, pos)
+    assert m, 'unparseable foreach at: ' + s[pos:pos + 60]
+    var = m.group(1)
+    lst_start = skip_ws(s, m.end(), end)
+    assert s[lst_start] == '[', 'expected list in foreach'
+    lst_end = match_delim(s, lst_start, '[', ']')
+    values = re.findall(r'"([^"]*)"', s[lst_start:lst_end])
+    in_pos = find_in_keyword(s, lst_end, end)
+    body = skip_ws(s, in_pos + len('in'), end)
+    last = body
+    for value in values:
+        new_bindings = dict(bindings)
+        new_bindings[var] = value
+        last = parse_body(s, body, end, new_bindings, group, out, stats)
+    return last
+
+
+def parse_body(s, pos, end, bindings, group, out, stats):
+    """Parse the body of a let/foreach: either a braced block or a single
+    nested construct."""
+    pos = skip_ws(s, pos, end)
+    if pos < end and s[pos] == '{':
+        block_end = match_delim(s, pos, '{', '}')
+        walk(s, pos + 1, block_end - 1, bindings, group, out, stats)
+        return block_end
+    return parse_construct(s, pos, end, bindings, group, out, stats)
+
+
+def parse_construct(s, pos, end, bindings, group, out, stats):
+    if is_keyword(s, pos, 'def'):
+        return parse_def(s, pos, end, bindings, group, out, stats)
+    if is_keyword(s, pos, 'let'):
+        return parse_let(s, pos, end, bindings, group, out, stats)
+    if is_keyword(s, pos, 'foreach'):
+        return parse_foreach(s, pos, end, bindings, group, out, stats)
+    return pos + 1
+
+
+def walk(s, pos, end, bindings, group, out, stats):
+    while pos < end:
+        pos = skip_ws(s, pos, end)
+        if pos >= end:
+            break
+        if is_keyword(s, pos, 'def') or is_keyword(s, pos, 'let') or \
+                is_keyword(s, pos, 'foreach'):
+            pos = parse_construct(s, pos, end, bindings, group, out, stats)
+        elif is_keyword(s, pos, 'include'):
+            semi = s.find(';', pos)
+            pos = semi + 1 if semi != -1 else end
+        else:
+            pos += 1
+    return pos
+
+
+def process_td(text, default_group):
+    """Parse one TableGen builtin database, returning {group: {name: decl}}."""
+    text = strip_comments(text)
+    out = {}
+    stats = {'skipped': 0, 'unknown_class': 0}
+    walk(text, 0, len(text), {}, default_group, out, stats)
+    if stats['skipped'] or stats['unknown_class']:
+        sys.stderr.write(
+            'note: skipped {} builtin(s) with unmappable types, {} with '
+            'unknown record class\n'.format(
+                stats['skipped'], stats['unknown_class']))
+    return out
+
+
+# --- clang-tblgen ".inc" parser (format 2) ---------------------------------
+#
+# The compact type encoding used in the "..._BUILTIN_INFOS" sections, e.g.
+# "V8ScV8ScV8Sci": a 'V' followed by a lane count introduces a vector, type
+# prefixes ('S' signed, 'U' unsigned, ...) precede a base type spec ('c' char,
+# 'i' int, 'f' float, ...), 'I' marks a compile-time constant argument, and
+# '*'/'C'/'D'/'R' apply pointer and qualifiers.
+
+encoding_prefix_map = {
+        'I': '',            # _Constant: only constrains the argument
+        'N': '',
+        'O': 'long long',
+        'S': 'signed',
+        'U': 'unsigned',
+        'W': 'int64_t',
+        'Z': 'int32_t',
+        }
+
+encoding_typespec_map = {
+        'b': '_Bool',
+        'c': 'char',
+        'd': 'double',
+        'f': 'float',
+        'h': '__fp16',
+        'i': 'int',
+        's': 'short',
+        'v': 'void',
+        'x': '_Float16',
+        'y': '__bf16',
+        'z': '__CPROVER_size_t',
+        }
+
+encoding_modifier_map = {'C': 'const', 'D': 'volatile', 'R': 'restrict'}
+
+
+def parse_encoding_prefix(types, i):
     prefix = []
     while i < len(types):
         p = types[i]
-        if i + 3 < len(types) and types[i:i+4] == 'LLLi':
+        if types[i:i + 4] == 'LLLi':
             prefix.append('__int128_t')
             i += 4
-        elif i + 1 < len(types) and types[i:i+2] == 'LL':
+        elif types[i:i + 2] == 'LL':
             prefix.extend(['long', 'long'])
             i += 2
         elif p == 'L':
             prefix.append('long')
             i += 1
-        elif i + 1 < len(types) and types[i:i+2] == 'SJ':
-            break
-        elif i + 1 < len(types) and (
-                types[i:i+2] == 'Wi' or types[i:i+2] == 'Zi'):
-            prefix.append(prefix_map[p])
+        elif types[i:i + 2] in ('Wi', 'Zi'):
+            prefix.append(encoding_prefix_map[p])
             i += 2
-        elif prefix_map.get(p) is not None:
-            mapped = prefix_map[p]
-            if len(mapped):
-                prefix.append(prefix_map[p])
+        elif encoding_prefix_map.get(p) is not None:
+            mapped = encoding_prefix_map[p]
+            if mapped:
+                prefix.append(mapped)
             i += 1
         else:
             break
-
     return prefix, i
 
 
-def build_type_inner(types, i):
-    (typespec, i) = parse_prefix(types, i)
+def build_encoding_type_inner(types, i):
+    (typespec, i) = parse_encoding_prefix(types, i)
 
     if i < len(types):
         t = types[i]
-        if i + 2 < len(types) and t == 'V':
-            m = re.match(r'(\d+).*', types[i+1:])
-            if m and i + 1 + len(m[1]) < len(types):
-                (elem_type_list, next_i) = build_type_inner(
-                        types, i + 1 + len(m[1]))
-                elem_type = ' '.join(elem_type_list)
-                if vector_map.get(elem_type):
-                    typespec.append(vector_map[elem_type][int(m[1])])
-                    i = next_i
-        elif i + 1 < len(types) and t == 'X' and (
-                typespec_map.get(types[i + 1])):
-            typespec.append(typespec_map[types[i + 1]])
-            typespec_map.append('_Complex')
-            i += 2
-        elif i + 1 < len(types) and types[i:i+2] == 'SJ':
-            typespec.append('sigjmp_buf')
-            i += 2
+        if t == 'V':
+            m = re.match(r'(\d+)', types[i + 1:])
+            count = int(m.group(1))
+            (elem_list, next_i) = build_encoding_type_inner(
+                    types, i + 1 + len(m.group(1)))
+            typespec.append(vector_typedef(' '.join(elem_list), count))
+            i = next_i
         elif t == '.' and i + 1 == len(types):
             typespec.append('...')
             i += 1
-        elif typespec_map.get(t):
-            typespec.append(typespec_map[t])
+        elif encoding_typespec_map.get(t):
+            typespec.append(encoding_typespec_map[t])
             i += 1
 
     return typespec, i
 
 
-def build_type(types, i):
-    (typespec, i) = build_type_inner(types, i)
-
+def build_encoding_type(types, i):
+    (typespec, i) = build_encoding_type_inner(types, i)
     while i < len(types):
         s = types[i]
         if s == '*':
             typespec.append('*')
             i += 1
-        elif modifier_map.get(s):
-            typespec.insert(0, modifier_map[s])
+        elif encoding_modifier_map.get(s):
+            typespec.insert(0, encoding_modifier_map[s])
             i += 1
         else:
             break
-
     return ' '.join(typespec), i
 
 
-def process_line(name, types, attributes):
-    """
-    Process the macro declaring "name" as specified at the top of
-    https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Basic/Builtins.def
-    We don't yet parse attributes.
-    """
-
+def decode_encoding(name, encoding):
+    """Decode a compact type encoding into a C declaration, raising
+    UnmappableType if it uses a construct we cannot represent."""
     type_specs = []
     i = 0
-    while i < len(types):
-        (t, i_updated) = build_type(types, i)
-        assert i_updated > i, ('failed to parse type spec of ' + name + ': ' +
-                               types[i:])
+    while i < len(encoding):
+        (t, i_updated) = build_encoding_type(encoding, i)
+        if i_updated <= i:
+            raise UnmappableType(
+                    'unparseable encoding for ' + name + ': ' + encoding[i:])
         i = i_updated
         type_specs.append(t)
 
-    assert len(type_specs), 'missing return type in ' + types
+    if not type_specs:
+        raise UnmappableType('empty encoding for ' + name)
     if len(type_specs) == 1:
         type_specs.append('void')
     return type_specs[0] + ' ' + name + '(' + ', '.join(type_specs[1:]) + ');'
 
 
-def process(input_lines):
-    declarations = {}
-    for l in input_lines:
-        m = re.match(r'BUILTIN\((\w+),\s*"(.+)",\s*"(.*)"\)', l)
-        if m:
-            declaration = process_line(m[1], m[2], m[3])
-            if not declarations.get('clang'):
-                declarations['clang'] = {}
-            declarations['clang'][m[1]] = declaration
-            continue
-        m = re.match(
-                r'TARGET_BUILTIN\((\w+),\s*"(.+)",\s*"(.*)",\s*"(.*)"\)', l)
-        if m:
-            declaration = process_line(m[1], m[2], m[3])
-            group = m[4]
-            if len(group) == 0:
-                group = 'clang'
-            if not declarations.get(group):
-                declarations[group] = {}
-            declarations[group][m[1]] = declaration
-
-    return declarations
+INFO_RE = re.compile(
+        r'StrOffsets\{\s*\d+ /\* (\S+) \*/,\s*\d+ /\* (.+?) \*/,'
+        r'\s*\d+ /\* .*? \*/,\s*(?:\d+ /\* (.+?) \*/|0)')
+
+
+def process_inc(text, prefix):
+    """Parse the ..._BUILTIN_INFOS section of a clang-tblgen .inc file,
+    returning {group: {name: decl}}."""
+    out = {}
+    stats = {'skipped': 0}
+    for m in INFO_RE.finditer(text):
+        name = prefix + m.group(1)
+        encoding = m.group(2)
+        group = m.group(3) if m.group(3) else 'builtins'
+        try:
+            out.setdefault(group, {})[name] = decode_encoding(name, encoding)
+        except UnmappableType:
+            stats['skipped'] += 1
+    if stats['skipped']:
+        sys.stderr.write(
+            'note: skipped {} builtin(s) with unmappable types\n'.format(
+                stats['skipped']))
+    return out
 
 
+# --- output ----------------------------------------------------------------
+
 def print_declarations(declaration_map, known_declarations):
     for k, v in sorted(declaration_map.items()):
         new_decls = []
@@ -267,33 +586,50 @@ def print_declarations(declaration_map, known_declarations):
                 print(decl)
 
 
-def read_declarations():
+def read_declarations(headers):
     known_declarations = {}
-    for fname in sys.argv[1:]:
+    for fname in headers:
         with open(fname) as f:
-            lines = f.readlines()
-            for l in lines:
+            for l in f.readlines():
                 m = re.match(r'.* (\w+)\(.*\);', l)
                 if m:
                     known_declarations[m[1]] = m[0]
-
     return known_declarations
 
 
+def merge(declaration_map, additions):
+    for k, v in additions.items():
+        declaration_map.setdefault(k, {}).update(v)
+
+
 def main():
-    known_declarations = read_declarations()
-    base_url = ('https://raw.githubusercontent.com/llvm/llvm-project/' +
-                'main/clang/include/clang/Basic/')
-    files = ['BuiltinsX86.def', 'BuiltinsX86_64.def']
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+            'headers', nargs='*',
+            help='gcc_builtin_headers_*.h files to diff the result against')
+    parser.add_argument(
+            '--inc', action='append', default=[], metavar='PREFIX:PATH',
+            help='parse a clang-tblgen-generated .inc file instead of the '
+                 'TableGen .td databases, prepending PREFIX to each builtin '
+                 'name (e.g. __builtin_neon_:neon_sema.inc)')
+    args = parser.parse_args()
+
+    known_declarations = read_declarations(args.headers)
     declaration_map = {}
-    for f in files:
-        url = base_url + f
-        lines = requests.get(base_url + f).text.split('\n')
-        for k, v in process(lines).items():
-            if not declaration_map.get(k):
-                declaration_map[k] = v
-            else:
-                declaration_map[k].update(v)
+
+    if args.inc:
+        for spec in args.inc:
+            prefix, _, path = spec.partition(':')
+            if not path:
+                parser.error('--inc expects PREFIX:PATH, got: ' + spec)
+            with open(path) as f:
+                merge(declaration_map, process_inc(f.read(), prefix))
+    else:
+        base_url = ('https://raw.githubusercontent.com/llvm/llvm-project/' +
+                    'main/clang/include/clang/Basic/')
+        for f in ['BuiltinsX86.td', 'BuiltinsX86_64.td']:
+            merge(declaration_map, process_td(requests.get(base_url + f).text,
+                                              'x86'))
 
     print_declarations(declaration_map, known_declarations)
 
diff --git a/src/ansi-c/compiler_headers/gcc_builtin_headers_aarch64.h b/src/ansi-c/compiler_headers/gcc_builtin_headers_aarch64.h
new file mode 100644
index 00000000000..4823334625c
--- /dev/null
+++ b/src/ansi-c/compiler_headers/gcc_builtin_headers_aarch64.h
@@ -0,0 +1,1055 @@
+// clang-format off
+__gcc_v8qi __builtin_neon___a32_vcvt_bf16_f32(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_splat_lane_bf16(__gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_splat_lane_v(__gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_splat_laneq_bf16(__gcc_v16qi, int, int);
+__gcc_v8qi __builtin_neon_splat_laneq_v(__gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_splatq_lane_bf16(__gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_splatq_lane_v(__gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_splatq_laneq_bf16(__gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_splatq_laneq_v(__gcc_v16qi, int, int);
+__gcc_v8qi __builtin_neon_vabd_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vabd_v(__gcc_v8qi, __gcc_v8qi, int);
+double __builtin_neon_vabdd_f64(double, double);
+__gcc_v16qi __builtin_neon_vabdq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vabdq_v(__gcc_v16qi, __gcc_v16qi, int);
+float __builtin_neon_vabds_f32(float, float);
+__gcc_v8qi __builtin_neon_vabs_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vabs_v(__gcc_v8qi, int);
+int64_t int64_t __builtin_neon_vabsd_s64(void);
+__gcc_v16qi __builtin_neon_vabsq_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vabsq_v(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vadd_v(__gcc_v8qi, __gcc_v8qi, int);
+int64_t int64_t int64_t __builtin_neon_vaddd_s64(void);
+unsigned int64_t unsigned int64_t unsigned int64_t __builtin_neon_vaddd_u64(void);
+__gcc_v8qi __builtin_neon_vaddhn_v(__gcc_v16qi, __gcc_v16qi, int);
+int __builtin_neon_vaddlv_s16(__gcc_v4hi);
+int64_t __gcc_v2si __builtin_neon_vaddlv_s32(void);
+short __builtin_neon_vaddlv_s8(__gcc_v8qi);
+int __builtin_neon_vaddlvq_s16(__gcc_v8hi);
+int64_t __gcc_v4si __builtin_neon_vaddlvq_s32(void);
+short __builtin_neon_vaddlvq_s8(__gcc_v16qi);
+unsigned int __builtin_neon_vaddlvq_u16(__gcc_v8uhi);
+unsigned int64_t __gcc_v4usi __builtin_neon_vaddlvq_u32(void);
+unsigned __int128_t unsigned __int128_t unsigned __int128_t __builtin_neon_vaddq_p128(void);
+__gcc_v16qi __builtin_neon_vaddq_v(__gcc_v16qi, __gcc_v16qi, int);
+float __builtin_neon_vaddv_f32(__gcc_v2sf);
+short __builtin_neon_vaddv_s16(__gcc_v4hi);
+int __builtin_neon_vaddv_s32(__gcc_v2si);
+signed char __builtin_neon_vaddv_s8(__gcc_v8qi);
+float __builtin_neon_vaddvq_f32(__gcc_v4sf);
+double __builtin_neon_vaddvq_f64(__gcc_v2df);
+short __builtin_neon_vaddvq_s16(__gcc_v8hi);
+int __builtin_neon_vaddvq_s32(__gcc_v4si);
+int64_t __gcc_v2di __builtin_neon_vaddvq_s64(void);
+signed char __builtin_neon_vaddvq_s8(__gcc_v16qi);
+unsigned short __builtin_neon_vaddvq_u16(__gcc_v8uhi);
+unsigned int __builtin_neon_vaddvq_u32(__gcc_v4usi);
+unsigned int64_t __gcc_v2udi __builtin_neon_vaddvq_u64(void);
+__gcc_v16qi __builtin_neon_vaesdq_u8(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vaeseq_u8(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vaesimcq_u8(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vaesmcq_u8(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vamax_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vamax_f32(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vamaxq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vamaxq_f32(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vamaxq_f64(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vamin_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vamin_f32(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vaminq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vaminq_f32(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vaminq_f64(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vbcaxq_s16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vbcaxq_s32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vbcaxq_s64(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vbcaxq_s8(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vbcaxq_u16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vbcaxq_u32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vbcaxq_u64(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vbcaxq_u8(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vbfdot_f32(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vbfdotq_f32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vbfmlalbq_f32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vbfmlaltq_f32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vbfmmlaq_f32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vbsl_v(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vbslq_v(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vcadd_rot270_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcadd_rot270_f32(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcadd_rot90_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcadd_rot90_f32(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vcaddq_rot270_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcaddq_rot270_f32(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcaddq_rot270_f64(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcaddq_rot90_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcaddq_rot90_f32(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcaddq_rot90_f64(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vcage_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcage_v(__gcc_v8qi, __gcc_v8qi, int);
+unsigned int64_t double __builtin_neon_vcaged_f64(double);
+__gcc_v16qi __builtin_neon_vcageq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcageq_v(__gcc_v16qi, __gcc_v16qi, int);
+unsigned int __builtin_neon_vcages_f32(float, float);
+__gcc_v8qi __builtin_neon_vcagt_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcagt_v(__gcc_v8qi, __gcc_v8qi, int);
+unsigned int64_t double __builtin_neon_vcagtd_f64(double);
+__gcc_v16qi __builtin_neon_vcagtq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcagtq_v(__gcc_v16qi, __gcc_v16qi, int);
+unsigned int __builtin_neon_vcagts_f32(float, float);
+__gcc_v8qi __builtin_neon_vcale_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcale_v(__gcc_v8qi, __gcc_v8qi, int);
+unsigned int64_t double __builtin_neon_vcaled_f64(double);
+__gcc_v16qi __builtin_neon_vcaleq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcaleq_v(__gcc_v16qi, __gcc_v16qi, int);
+unsigned int __builtin_neon_vcales_f32(float, float);
+__gcc_v8qi __builtin_neon_vcalt_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcalt_v(__gcc_v8qi, __gcc_v8qi, int);
+unsigned int64_t double __builtin_neon_vcaltd_f64(double);
+__gcc_v16qi __builtin_neon_vcaltq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcaltq_v(__gcc_v16qi, __gcc_v16qi, int);
+unsigned int __builtin_neon_vcalts_f32(float, float);
+unsigned int64_t double __builtin_neon_vceqd_f64(double);
+unsigned int64_t int64_t int64_t __builtin_neon_vceqd_s64(void);
+unsigned int64_t unsigned int64_t unsigned int64_t __builtin_neon_vceqd_u64(void);
+unsigned int __builtin_neon_vceqs_f32(float, float);
+__gcc_v8qi __builtin_neon_vceqz_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vceqz_v(__gcc_v8qi, int);
+unsigned int64_t double __builtin_neon_vceqzd_f64(void);
+unsigned int64_t int64_t __builtin_neon_vceqzd_s64(void);
+unsigned int64_t unsigned int64_t __builtin_neon_vceqzd_u64(void);
+__gcc_v16qi __builtin_neon_vceqzq_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vceqzq_v(__gcc_v16qi, int);
+unsigned int __builtin_neon_vceqzs_f32(float);
+unsigned int64_t double __builtin_neon_vcged_f64(double);
+unsigned int64_t int64_t int64_t __builtin_neon_vcged_s64(void);
+unsigned int64_t unsigned int64_t unsigned int64_t __builtin_neon_vcged_u64(void);
+unsigned int __builtin_neon_vcges_f32(float, float);
+__gcc_v8qi __builtin_neon_vcgez_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcgez_v(__gcc_v8qi, int);
+unsigned int64_t double __builtin_neon_vcgezd_f64(void);
+unsigned int64_t int64_t __builtin_neon_vcgezd_s64(void);
+__gcc_v16qi __builtin_neon_vcgezq_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcgezq_v(__gcc_v16qi, int);
+unsigned int __builtin_neon_vcgezs_f32(float);
+unsigned int64_t double __builtin_neon_vcgtd_f64(double);
+unsigned int64_t int64_t int64_t __builtin_neon_vcgtd_s64(void);
+unsigned int64_t unsigned int64_t unsigned int64_t __builtin_neon_vcgtd_u64(void);
+unsigned int __builtin_neon_vcgts_f32(float, float);
+__gcc_v8qi __builtin_neon_vcgtz_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcgtz_v(__gcc_v8qi, int);
+unsigned int64_t double __builtin_neon_vcgtzd_f64(void);
+unsigned int64_t int64_t __builtin_neon_vcgtzd_s64(void);
+__gcc_v16qi __builtin_neon_vcgtzq_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcgtzq_v(__gcc_v16qi, int);
+unsigned int __builtin_neon_vcgtzs_f32(float);
+unsigned int64_t double __builtin_neon_vcled_f64(double);
+unsigned int64_t int64_t int64_t __builtin_neon_vcled_s64(void);
+unsigned int64_t unsigned int64_t unsigned int64_t __builtin_neon_vcled_u64(void);
+unsigned int __builtin_neon_vcles_f32(float, float);
+__gcc_v8qi __builtin_neon_vclez_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vclez_v(__gcc_v8qi, int);
+unsigned int64_t double __builtin_neon_vclezd_f64(void);
+unsigned int64_t int64_t __builtin_neon_vclezd_s64(void);
+__gcc_v16qi __builtin_neon_vclezq_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vclezq_v(__gcc_v16qi, int);
+unsigned int __builtin_neon_vclezs_f32(float);
+__gcc_v8qi __builtin_neon_vcls_v(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vclsq_v(__gcc_v16qi, int);
+unsigned int64_t double __builtin_neon_vcltd_f64(double);
+unsigned int64_t int64_t int64_t __builtin_neon_vcltd_s64(void);
+unsigned int64_t unsigned int64_t unsigned int64_t __builtin_neon_vcltd_u64(void);
+unsigned int __builtin_neon_vclts_f32(float, float);
+__gcc_v8qi __builtin_neon_vcltz_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcltz_v(__gcc_v8qi, int);
+unsigned int64_t double __builtin_neon_vcltzd_f64(void);
+unsigned int64_t int64_t __builtin_neon_vcltzd_s64(void);
+__gcc_v16qi __builtin_neon_vcltzq_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcltzq_v(__gcc_v16qi, int);
+unsigned int __builtin_neon_vcltzs_f32(float);
+__gcc_v8qi __builtin_neon_vclz_v(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vclzq_v(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vcmla_f16(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcmla_f32(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcmla_rot180_f16(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcmla_rot180_f32(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcmla_rot270_f16(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcmla_rot270_f32(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcmla_rot90_f16(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcmla_rot90_f32(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vcmlaq_f16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcmlaq_f32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcmlaq_f64(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcmlaq_rot180_f16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcmlaq_rot180_f32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcmlaq_rot180_f64(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcmlaq_rot270_f16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcmlaq_rot270_f32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcmlaq_rot270_f64(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcmlaq_rot90_f16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcmlaq_rot90_f32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcmlaq_rot90_f64(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vcnt_v(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vcntq_v(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vcvt_bf16_f32(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vcvt_f16_f32(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vcvt_f16_s16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvt_f16_u16(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vcvt_f32_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvt_f32_f64(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vcvt_f32_v(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vcvt_f64_f32(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvt_f64_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvt_n_f16_s16(__gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vcvt_n_f16_u16(__gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vcvt_n_f32_v(__gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vcvt_n_f64_v(__gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vcvt_n_s16_f16(__gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vcvt_n_s32_v(__gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vcvt_n_s64_v(__gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vcvt_n_u16_f16(__gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vcvt_n_u32_v(__gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vcvt_n_u64_v(__gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vcvt_s16_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvt_s32_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvt_s64_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvt_u16_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvt_u32_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvt_u64_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvta_s16_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvta_s32_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvta_s64_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvta_u16_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvta_u32_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvta_u64_v(__gcc_v8qi, int);
+int __builtin_neon_vcvtad_s32_f64(double);
+int64_t double __builtin_neon_vcvtad_s64_f64(void);
+unsigned int __builtin_neon_vcvtad_u32_f64(double);
+unsigned int64_t double __builtin_neon_vcvtad_u64_f64(void);
+__gcc_v16qi __builtin_neon_vcvtaq_s16_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtaq_s32_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtaq_s64_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtaq_u16_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtaq_u32_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtaq_u64_v(__gcc_v16qi, int);
+int __builtin_neon_vcvtas_s32_f32(float);
+int64_t float __builtin_neon_vcvtas_s64_f32(void);
+unsigned int __builtin_neon_vcvtas_u32_f32(float);
+unsigned int64_t float __builtin_neon_vcvtas_u64_f32(void);
+double __builtin_neon_vcvtd_f64_s64(int64_t);
+double __builtin_neon_vcvtd_f64_u64(unsigned int64_t);
+double __builtin_neon_vcvtd_n_f64_s64(int64_t int);
+double __builtin_neon_vcvtd_n_f64_u64(unsigned int64_t int);
+int64_t double __builtin_neon_vcvtd_n_s64_f64(int);
+unsigned int64_t double __builtin_neon_vcvtd_n_u64_f64(int);
+int __builtin_neon_vcvtd_s32_f64(double);
+int64_t double __builtin_neon_vcvtd_s64_f64(void);
+unsigned int __builtin_neon_vcvtd_u32_f64(double);
+unsigned int64_t double __builtin_neon_vcvtd_u64_f64(void);
+__bf16 __builtin_neon_vcvth_bf16_f32(float);
+__gcc_v8qi __builtin_neon_vcvtm_s16_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvtm_s32_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvtm_s64_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvtm_u16_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvtm_u32_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvtm_u64_v(__gcc_v8qi, int);
+int __builtin_neon_vcvtmd_s32_f64(double);
+int64_t double __builtin_neon_vcvtmd_s64_f64(void);
+unsigned int __builtin_neon_vcvtmd_u32_f64(double);
+unsigned int64_t double __builtin_neon_vcvtmd_u64_f64(void);
+__gcc_v16qi __builtin_neon_vcvtmq_s16_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtmq_s32_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtmq_s64_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtmq_u16_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtmq_u32_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtmq_u64_v(__gcc_v16qi, int);
+int __builtin_neon_vcvtms_s32_f32(float);
+int64_t float __builtin_neon_vcvtms_s64_f32(void);
+unsigned int __builtin_neon_vcvtms_u32_f32(float);
+unsigned int64_t float __builtin_neon_vcvtms_u64_f32(void);
+__gcc_v8qi __builtin_neon_vcvtn_s16_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvtn_s32_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvtn_s64_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvtn_u16_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvtn_u32_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvtn_u64_v(__gcc_v8qi, int);
+int __builtin_neon_vcvtnd_s32_f64(double);
+int64_t double __builtin_neon_vcvtnd_s64_f64(void);
+unsigned int __builtin_neon_vcvtnd_u32_f64(double);
+unsigned int64_t double __builtin_neon_vcvtnd_u64_f64(void);
+__gcc_v16qi __builtin_neon_vcvtnq_s16_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtnq_s32_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtnq_s64_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtnq_u16_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtnq_u32_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtnq_u64_v(__gcc_v16qi, int);
+int __builtin_neon_vcvtns_s32_f32(float);
+int64_t float __builtin_neon_vcvtns_s64_f32(void);
+unsigned int __builtin_neon_vcvtns_u32_f32(float);
+unsigned int64_t float __builtin_neon_vcvtns_u64_f32(void);
+__gcc_v8qi __builtin_neon_vcvtp_s16_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvtp_s32_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvtp_s64_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvtp_u16_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvtp_u32_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vcvtp_u64_v(__gcc_v8qi, int);
+int __builtin_neon_vcvtpd_s32_f64(double);
+int64_t double __builtin_neon_vcvtpd_s64_f64(void);
+unsigned int __builtin_neon_vcvtpd_u32_f64(double);
+unsigned int64_t double __builtin_neon_vcvtpd_u64_f64(void);
+__gcc_v16qi __builtin_neon_vcvtpq_s16_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtpq_s32_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtpq_s64_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtpq_u16_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtpq_u32_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtpq_u64_v(__gcc_v16qi, int);
+int __builtin_neon_vcvtps_s32_f32(float);
+int64_t float __builtin_neon_vcvtps_s64_f32(void);
+unsigned int __builtin_neon_vcvtps_u32_f32(float);
+unsigned int64_t float __builtin_neon_vcvtps_u64_f32(void);
+__gcc_v16qi __builtin_neon_vcvtq_f16_s16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtq_f16_u16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtq_f32_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtq_f64_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtq_high_bf16_f32(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtq_low_bf16_f32(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtq_n_f16_s16(__gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vcvtq_n_f16_u16(__gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vcvtq_n_f32_v(__gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vcvtq_n_f64_v(__gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vcvtq_n_s16_f16(__gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vcvtq_n_s32_v(__gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vcvtq_n_s64_v(__gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vcvtq_n_u16_f16(__gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vcvtq_n_u32_v(__gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vcvtq_n_u64_v(__gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vcvtq_s16_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtq_s32_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtq_s64_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtq_u16_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtq_u32_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vcvtq_u64_v(__gcc_v16qi, int);
+float __builtin_neon_vcvts_f32_s32(int);
+float __builtin_neon_vcvts_f32_u32(unsigned int);
+float __builtin_neon_vcvts_n_f32_s32(int, int);
+float __builtin_neon_vcvts_n_f32_u32(unsigned int, int);
+int __builtin_neon_vcvts_n_s32_f32(float, int);
+unsigned int __builtin_neon_vcvts_n_u32_f32(float, int);
+int __builtin_neon_vcvts_s32_f32(float);
+int64_t float __builtin_neon_vcvts_s64_f32(void);
+unsigned int __builtin_neon_vcvts_u32_f32(float);
+unsigned int64_t float __builtin_neon_vcvts_u64_f32(void);
+__gcc_v8qi __builtin_neon_vcvtx_f32_v(__gcc_v16qi, int);
+float __builtin_neon_vcvtxd_f32_f64(double);
+__gcc_v8qi __builtin_neon_vdot_f32_f16(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vdot_lane_f32_f16(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vdot_laneq_f32_f16(__gcc_v8qi, __gcc_v8qi, __gcc_v16qi, int, int);
+__gcc_v8qi __builtin_neon_vdot_s32(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vdot_u32(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vdotq_f32_f16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vdotq_lane_f32_f16(__gcc_v16qi, __gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vdotq_laneq_f32_f16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vdotq_s32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vdotq_u32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+unsigned char __builtin_neon_vdupb_lane_i8(__gcc_v8qi, int);
+unsigned char __builtin_neon_vdupb_laneq_i8(__gcc_v16qi, int);
+double __builtin_neon_vdupd_laneq_f64(__gcc_v2df, int);
+unsigned short __builtin_neon_vduph_lane_i16(__gcc_v4hi, int);
+__bf16 __builtin_neon_vduph_laneq_bf16(__gcc_v8hf, int);
+__fp16 __builtin_neon_vduph_laneq_f16(__gcc_v8hf, int);
+unsigned short __builtin_neon_vduph_laneq_i16(__gcc_v8hi, int);
+float __builtin_neon_vdups_lane_f32(__gcc_v2sf, int);
+unsigned int __builtin_neon_vdups_lane_i32(__gcc_v2si, int);
+float __builtin_neon_vdups_laneq_f32(__gcc_v4sf, int);
+unsigned int __builtin_neon_vdups_laneq_i32(__gcc_v4si, int);
+__gcc_v16qi __builtin_neon_veor3q_s16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_veor3q_s32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_veor3q_s64(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_veor3q_s8(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_veor3q_u16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_veor3q_u32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_veor3q_u64(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_veor3q_u8(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vext_v(__gcc_v8qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vextq_v(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v8qi __builtin_neon_vfma_f16(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vfma_lane_f16(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vfma_lane_v(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vfma_laneq_f16(__gcc_v8qi, __gcc_v8qi, __gcc_v16qi, int, int);
+__gcc_v8qi __builtin_neon_vfma_laneq_v(__gcc_v8qi, __gcc_v8qi, __gcc_v16qi, int, int);
+__gcc_v8qi __builtin_neon_vfma_v(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+double __builtin_neon_vfmad_laneq_f64(double, double, __gcc_v2df, int);
+__fp16 __builtin_neon_vfmah_laneq_f16(__fp16, __fp16, __gcc_v8hf, int);
+__gcc_v16qi __builtin_neon_vfmaq_f16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vfmaq_lane_f16(__gcc_v16qi, __gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vfmaq_lane_v(__gcc_v16qi, __gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vfmaq_laneq_f16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vfmaq_laneq_v(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vfmaq_v(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+float __builtin_neon_vfmas_lane_f32(float, float, __gcc_v2sf, int);
+float __builtin_neon_vfmas_laneq_f32(float, float, __gcc_v4sf, int);
+__gcc_v8qi __builtin_neon_vfmlal_high_f16(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vfmlal_low_f16(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vfmlalq_high_f16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vfmlalq_low_f16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vfmlsl_high_f16(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vfmlsl_low_f16(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vfmlslq_high_f16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vfmlslq_low_f16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+float __builtin_neon_vget_lane_f32(__gcc_v2sf, int);
+unsigned short __builtin_neon_vget_lane_i16(__gcc_v4hi, int);
+unsigned int __builtin_neon_vget_lane_i32(__gcc_v2si, int);
+unsigned char __builtin_neon_vget_lane_i8(__gcc_v8qi, int);
+__bf16 __builtin_neon_vgetq_lane_bf16(__gcc_v8hf, int);
+float __builtin_neon_vgetq_lane_f32(__gcc_v4sf, int);
+double __builtin_neon_vgetq_lane_f64(__gcc_v2df, int);
+unsigned short __builtin_neon_vgetq_lane_i16(__gcc_v8hi, int);
+unsigned int __builtin_neon_vgetq_lane_i32(__gcc_v4si, int);
+unsigned char __builtin_neon_vgetq_lane_i8(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vhadd_v(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vhaddq_v(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vhsub_v(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vhsubq_v(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vld1_bf16(const void *, int);
+void __builtin_neon_vld1_bf16_x2(void *, const void *, int);
+void __builtin_neon_vld1_bf16_x3(void *, const void *, int);
+void __builtin_neon_vld1_bf16_x4(void *, const void *, int);
+__gcc_v8qi __builtin_neon_vld1_dup_bf16(const void *, int);
+__gcc_v8qi __builtin_neon_vld1_dup_v(const void *, int);
+__gcc_v8qi __builtin_neon_vld1_lane_bf16(const void *, __gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vld1_lane_v(const void *, __gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vld1_v(const void *, int);
+void __builtin_neon_vld1_x2_v(void *, const void *, int);
+void __builtin_neon_vld1_x3_v(void *, const void *, int);
+void __builtin_neon_vld1_x4_v(void *, const void *, int);
+__gcc_v16qi __builtin_neon_vld1q_bf16(const void *, int);
+void __builtin_neon_vld1q_bf16_x2(void *, const void *, int);
+void __builtin_neon_vld1q_bf16_x3(void *, const void *, int);
+void __builtin_neon_vld1q_bf16_x4(void *, const void *, int);
+__gcc_v16qi __builtin_neon_vld1q_dup_bf16(const void *, int);
+__gcc_v16qi __builtin_neon_vld1q_dup_v(const void *, int);
+__gcc_v16qi __builtin_neon_vld1q_lane_bf16(const void *, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vld1q_lane_v(const void *, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vld1q_v(const void *, int);
+void __builtin_neon_vld1q_x2_v(void *, const void *, int);
+void __builtin_neon_vld1q_x3_v(void *, const void *, int);
+void __builtin_neon_vld1q_x4_v(void *, const void *, int);
+void __builtin_neon_vld2_bf16(void *, const void *, int);
+void __builtin_neon_vld2_dup_bf16(void *, const void *, int);
+void __builtin_neon_vld2_dup_v(void *, const void *, int);
+void __builtin_neon_vld2_lane_bf16(void *, const void *, __gcc_v8qi, __gcc_v8qi, int, int);
+void __builtin_neon_vld2_lane_v(void *, const void *, __gcc_v8qi, __gcc_v8qi, int, int);
+void __builtin_neon_vld2_v(void *, const void *, int);
+void __builtin_neon_vld2q_bf16(void *, const void *, int);
+void __builtin_neon_vld2q_dup_bf16(void *, const void *, int);
+void __builtin_neon_vld2q_dup_v(void *, const void *, int);
+void __builtin_neon_vld2q_lane_bf16(void *, const void *, __gcc_v16qi, __gcc_v16qi, int, int);
+void __builtin_neon_vld2q_lane_v(void *, const void *, __gcc_v16qi, __gcc_v16qi, int, int);
+void __builtin_neon_vld2q_v(void *, const void *, int);
+void __builtin_neon_vld3_bf16(void *, const void *, int);
+void __builtin_neon_vld3_dup_bf16(void *, const void *, int);
+void __builtin_neon_vld3_dup_v(void *, const void *, int);
+void __builtin_neon_vld3_lane_bf16(void *, const void *, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int, int);
+void __builtin_neon_vld3_lane_v(void *, const void *, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int, int);
+void __builtin_neon_vld3_v(void *, const void *, int);
+void __builtin_neon_vld3q_bf16(void *, const void *, int);
+void __builtin_neon_vld3q_dup_bf16(void *, const void *, int);
+void __builtin_neon_vld3q_dup_v(void *, const void *, int);
+void __builtin_neon_vld3q_lane_bf16(void *, const void *, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+void __builtin_neon_vld3q_lane_v(void *, const void *, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+void __builtin_neon_vld3q_v(void *, const void *, int);
+void __builtin_neon_vld4_bf16(void *, const void *, int);
+void __builtin_neon_vld4_dup_bf16(void *, const void *, int);
+void __builtin_neon_vld4_dup_v(void *, const void *, int);
+void __builtin_neon_vld4_lane_bf16(void *, const void *, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int, int);
+void __builtin_neon_vld4_lane_v(void *, const void *, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int, int);
+void __builtin_neon_vld4_v(void *, const void *, int);
+void __builtin_neon_vld4q_bf16(void *, const void *, int);
+void __builtin_neon_vld4q_dup_bf16(void *, const void *, int);
+void __builtin_neon_vld4q_dup_v(void *, const void *, int);
+void __builtin_neon_vld4q_lane_bf16(void *, const void *, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+void __builtin_neon_vld4q_lane_v(void *, const void *, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+void __builtin_neon_vld4q_v(void *, const void *, int);
+__gcc_v8qi __builtin_neon_vldap1_lane_f64(const void *, __gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vldap1_lane_p64(const void *, __gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vldap1_lane_s64(const void *, __gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vldap1_lane_u64(const void *, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vldap1q_lane_f64(const void *, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vldap1q_lane_p64(const void *, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vldap1q_lane_s64(const void *, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vldap1q_lane_u64(const void *, __gcc_v16qi, int, int);
+const unsigned __int128_t void * __builtin_neon_vldrq_p128(void);
+__gcc_v16qi __builtin_neon_vluti2_lane_bf16(__gcc_v8qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_lane_f16(__gcc_v8qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_lane_mf8(__gcc_v8qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_lane_p16(__gcc_v8qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_lane_p8(__gcc_v8qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_lane_s16(__gcc_v8qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_lane_s8(__gcc_v8qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_lane_u16(__gcc_v8qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_lane_u8(__gcc_v8qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_laneq_bf16(__gcc_v8qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_laneq_f16(__gcc_v8qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_laneq_mf8(__gcc_v8qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_laneq_p16(__gcc_v8qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_laneq_p8(__gcc_v8qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_laneq_s16(__gcc_v8qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_laneq_s8(__gcc_v8qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_laneq_u16(__gcc_v8qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2_laneq_u8(__gcc_v8qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_lane_bf16(__gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_lane_f16(__gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_lane_mf8(__gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_lane_p16(__gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_lane_p8(__gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_lane_s16(__gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_lane_s8(__gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_lane_u16(__gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_lane_u8(__gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_laneq_bf16(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_laneq_f16(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_laneq_mf8(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_laneq_p16(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_laneq_p8(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_laneq_s16(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_laneq_s8(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_laneq_u16(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti2q_laneq_u8(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_lane_bf16_x2(__gcc_v16qi, __gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_lane_f16_x2(__gcc_v16qi, __gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_lane_mf8(__gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_lane_p16_x2(__gcc_v16qi, __gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_lane_p8(__gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_lane_s16_x2(__gcc_v16qi, __gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_lane_s8(__gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_lane_u16_x2(__gcc_v16qi, __gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_lane_u8(__gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_laneq_bf16_x2(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_laneq_f16_x2(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_laneq_mf8(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_laneq_p16_x2(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_laneq_p8(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_laneq_s16_x2(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_laneq_s8(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_laneq_u16_x2(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vluti4q_laneq_u8(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v8qi __builtin_neon_vmax_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vmax_v(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vmaxnm_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vmaxnm_v(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vmaxnmq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vmaxnmq_v(__gcc_v16qi, __gcc_v16qi, int);
+__fp16 __builtin_neon_vmaxnmv_f16(__gcc_v8qi);
+float __builtin_neon_vmaxnmv_f32(__gcc_v2sf);
+__fp16 __builtin_neon_vmaxnmvq_f16(__gcc_v16qi);
+float __builtin_neon_vmaxnmvq_f32(__gcc_v4sf);
+double __builtin_neon_vmaxnmvq_f64(__gcc_v2df);
+__gcc_v16qi __builtin_neon_vmaxq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vmaxq_v(__gcc_v16qi, __gcc_v16qi, int);
+__fp16 __builtin_neon_vmaxv_f16(__gcc_v8qi);
+float __builtin_neon_vmaxv_f32(__gcc_v2sf);
+short __builtin_neon_vmaxv_s16(__gcc_v4hi);
+int __builtin_neon_vmaxv_s32(__gcc_v2si);
+signed char __builtin_neon_vmaxv_s8(__gcc_v8qi);
+__fp16 __builtin_neon_vmaxvq_f16(__gcc_v16qi);
+float __builtin_neon_vmaxvq_f32(__gcc_v4sf);
+double __builtin_neon_vmaxvq_f64(__gcc_v2df);
+short __builtin_neon_vmaxvq_s16(__gcc_v8hi);
+int __builtin_neon_vmaxvq_s32(__gcc_v4si);
+signed char __builtin_neon_vmaxvq_s8(__gcc_v16qi);
+unsigned short __builtin_neon_vmaxvq_u16(__gcc_v8uhi);
+unsigned int __builtin_neon_vmaxvq_u32(__gcc_v4usi);
+__gcc_v8qi __builtin_neon_vmin_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vmin_v(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vminnm_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vminnm_v(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vminnmq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vminnmq_v(__gcc_v16qi, __gcc_v16qi, int);
+__fp16 __builtin_neon_vminnmv_f16(__gcc_v8qi);
+float __builtin_neon_vminnmv_f32(__gcc_v2sf);
+__fp16 __builtin_neon_vminnmvq_f16(__gcc_v16qi);
+float __builtin_neon_vminnmvq_f32(__gcc_v4sf);
+double __builtin_neon_vminnmvq_f64(__gcc_v2df);
+__gcc_v16qi __builtin_neon_vminq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vminq_v(__gcc_v16qi, __gcc_v16qi, int);
+__fp16 __builtin_neon_vminv_f16(__gcc_v8qi);
+float __builtin_neon_vminv_f32(__gcc_v2sf);
+short __builtin_neon_vminv_s16(__gcc_v4hi);
+int __builtin_neon_vminv_s32(__gcc_v2si);
+signed char __builtin_neon_vminv_s8(__gcc_v8qi);
+__fp16 __builtin_neon_vminvq_f16(__gcc_v16qi);
+float __builtin_neon_vminvq_f32(__gcc_v4sf);
+double __builtin_neon_vminvq_f64(__gcc_v2df);
+short __builtin_neon_vminvq_s16(__gcc_v8hi);
+int __builtin_neon_vminvq_s32(__gcc_v4si);
+signed char __builtin_neon_vminvq_s8(__gcc_v16qi);
+unsigned short __builtin_neon_vminvq_u16(__gcc_v8uhi);
+unsigned int __builtin_neon_vminvq_u32(__gcc_v4usi);
+__gcc_v16qi __builtin_neon_vmmlaq_f16_f16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vmmlaq_f32_f16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vmmlaq_s32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vmmlaq_u32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vmovl_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vmovn_v(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vmul_lane_v(__gcc_v8qi, __gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vmul_laneq_v(__gcc_v8qi, __gcc_v16qi, int, int);
+__gcc_v8qi __builtin_neon_vmul_v(__gcc_v8qi, __gcc_v8qi, int);
+unsigned __int128_t unsigned int64_t unsigned int64_t __builtin_neon_vmull_p64(void);
+__gcc_v16qi __builtin_neon_vmull_v(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vmulq_v(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vmulx_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vmulx_v(__gcc_v8qi, __gcc_v8qi, int);
+double __builtin_neon_vmulxd_f64(double, double);
+__fp16 __builtin_neon_vmulxh_laneq_f16(__fp16, __gcc_v8hf, int);
+__gcc_v16qi __builtin_neon_vmulxq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vmulxq_v(__gcc_v16qi, __gcc_v16qi, int);
+float __builtin_neon_vmulxs_f32(float, float);
+int64_t int64_t __builtin_neon_vnegd_s64(void);
+__gcc_v8qi __builtin_neon_vpadal_v(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vpadalq_v(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vpadd_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vpadd_v(__gcc_v8qi, __gcc_v8qi, int);
+double __builtin_neon_vpaddd_f64(__gcc_v2df);
+int64_t __gcc_v2di __builtin_neon_vpaddd_s64(void);
+unsigned int64_t __gcc_v2udi __builtin_neon_vpaddd_u64(void);
+__gcc_v8qi __builtin_neon_vpaddl_v(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vpaddlq_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vpaddq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vpaddq_v(__gcc_v16qi, __gcc_v16qi, int);
+float __builtin_neon_vpadds_f32(__gcc_v2sf);
+__gcc_v8qi __builtin_neon_vpmax_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vpmax_v(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vpmaxnm_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vpmaxnm_v(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vpmaxnmq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vpmaxnmq_v(__gcc_v16qi, __gcc_v16qi, int);
+double __builtin_neon_vpmaxnmqd_f64(__gcc_v2df);
+float __builtin_neon_vpmaxnms_f32(__gcc_v2sf);
+__gcc_v16qi __builtin_neon_vpmaxq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vpmaxq_v(__gcc_v16qi, __gcc_v16qi, int);
+double __builtin_neon_vpmaxqd_f64(__gcc_v2df);
+float __builtin_neon_vpmaxs_f32(__gcc_v2sf);
+__gcc_v8qi __builtin_neon_vpmin_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vpmin_v(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vpminnm_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vpminnm_v(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vpminnmq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vpminnmq_v(__gcc_v16qi, __gcc_v16qi, int);
+double __builtin_neon_vpminnmqd_f64(__gcc_v2df);
+float __builtin_neon_vpminnms_f32(__gcc_v2sf);
+__gcc_v16qi __builtin_neon_vpminq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vpminq_v(__gcc_v16qi, __gcc_v16qi, int);
+double __builtin_neon_vpminqd_f64(__gcc_v2df);
+float __builtin_neon_vpmins_f32(__gcc_v2sf);
+__gcc_v8qi __builtin_neon_vqabs_v(__gcc_v8qi, int);
+signed char __builtin_neon_vqabsb_s8(signed char);
+int64_t int64_t __builtin_neon_vqabsd_s64(void);
+short __builtin_neon_vqabsh_s16(short);
+__gcc_v16qi __builtin_neon_vqabsq_v(__gcc_v16qi, int);
+int __builtin_neon_vqabss_s32(int);
+__gcc_v8qi __builtin_neon_vqadd_v(__gcc_v8qi, __gcc_v8qi, int);
+signed char __builtin_neon_vqaddb_s8(signed char, signed char);
+unsigned char __builtin_neon_vqaddb_u8(unsigned char, unsigned char);
+int64_t int64_t int64_t __builtin_neon_vqaddd_s64(void);
+unsigned int64_t unsigned int64_t unsigned int64_t __builtin_neon_vqaddd_u64(void);
+short __builtin_neon_vqaddh_s16(short, short);
+unsigned short __builtin_neon_vqaddh_u16(unsigned short, unsigned short);
+__gcc_v16qi __builtin_neon_vqaddq_v(__gcc_v16qi, __gcc_v16qi, int);
+int __builtin_neon_vqadds_s32(int, int);
+unsigned int __builtin_neon_vqadds_u32(unsigned int, unsigned int);
+__gcc_v16qi __builtin_neon_vqdmlal_v(__gcc_v16qi, __gcc_v8qi, __gcc_v8qi, int);
+int __builtin_neon_vqdmlalh_lane_s16(int, short, __gcc_v4hi, int);
+int __builtin_neon_vqdmlalh_laneq_s16(int, short, __gcc_v8hi, int);
+int __builtin_neon_vqdmlalh_s16(int, short, short);
+int64_t int64_t int __builtin_neon_vqdmlals_lane_s32(__gcc_v2si, int);
+int64_t int64_t int __builtin_neon_vqdmlals_laneq_s32(__gcc_v4si, int);
+int64_t int64_t int __builtin_neon_vqdmlals_s32(int);
+__gcc_v16qi __builtin_neon_vqdmlsl_v(__gcc_v16qi, __gcc_v8qi, __gcc_v8qi, int);
+int __builtin_neon_vqdmlslh_lane_s16(int, short, __gcc_v4hi, int);
+int __builtin_neon_vqdmlslh_laneq_s16(int, short, __gcc_v8hi, int);
+int __builtin_neon_vqdmlslh_s16(int, short, short);
+int64_t int64_t int __builtin_neon_vqdmlsls_lane_s32(__gcc_v2si, int);
+int64_t int64_t int __builtin_neon_vqdmlsls_laneq_s32(__gcc_v4si, int);
+int64_t int64_t int __builtin_neon_vqdmlsls_s32(int);
+__gcc_v8qi __builtin_neon_vqdmulh_lane_v(__gcc_v8qi, __gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vqdmulh_laneq_v(__gcc_v8qi, __gcc_v16qi, int, int);
+__gcc_v8qi __builtin_neon_vqdmulh_v(__gcc_v8qi, __gcc_v8qi, int);
+short __builtin_neon_vqdmulhh_s16(short, short);
+__gcc_v16qi __builtin_neon_vqdmulhq_lane_v(__gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vqdmulhq_laneq_v(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vqdmulhq_v(__gcc_v16qi, __gcc_v16qi, int);
+int __builtin_neon_vqdmulhs_s32(int, int);
+__gcc_v16qi __builtin_neon_vqdmull_v(__gcc_v8qi, __gcc_v8qi, int);
+int __builtin_neon_vqdmullh_s16(short, short);
+int64_t int __builtin_neon_vqdmulls_s32(int);
+__gcc_v8qi __builtin_neon_vqmovn_v(__gcc_v16qi, int);
+int __builtin_neon_vqmovnd_s64(int64_t);
+unsigned int __builtin_neon_vqmovnd_u64(unsigned int64_t);
+signed char __builtin_neon_vqmovnh_s16(short);
+unsigned char __builtin_neon_vqmovnh_u16(unsigned short);
+short __builtin_neon_vqmovns_s32(int);
+unsigned short __builtin_neon_vqmovns_u32(unsigned int);
+__gcc_v8qi __builtin_neon_vqmovun_v(__gcc_v16qi, int);
+unsigned int __builtin_neon_vqmovund_s64(int64_t);
+unsigned char __builtin_neon_vqmovunh_s16(short);
+unsigned short __builtin_neon_vqmovuns_s32(int);
+__gcc_v8qi __builtin_neon_vqneg_v(__gcc_v8qi, int);
+signed char __builtin_neon_vqnegb_s8(signed char);
+int64_t int64_t __builtin_neon_vqnegd_s64(void);
+short __builtin_neon_vqnegh_s16(short);
+__gcc_v16qi __builtin_neon_vqnegq_v(__gcc_v16qi, int);
+int __builtin_neon_vqnegs_s32(int);
+__gcc_v8qi __builtin_neon_vqrdmlah_s16(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vqrdmlah_s32(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+short __builtin_neon_vqrdmlahh_s16(short, short, short);
+__gcc_v16qi __builtin_neon_vqrdmlahq_s16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vqrdmlahq_s32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+int __builtin_neon_vqrdmlahs_s32(int, int, int);
+__gcc_v8qi __builtin_neon_vqrdmlsh_s16(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vqrdmlsh_s32(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+short __builtin_neon_vqrdmlshh_s16(short, short, short);
+__gcc_v16qi __builtin_neon_vqrdmlshq_s16(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vqrdmlshq_s32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+int __builtin_neon_vqrdmlshs_s32(int, int, int);
+__gcc_v8qi __builtin_neon_vqrdmulh_lane_v(__gcc_v8qi, __gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vqrdmulh_laneq_v(__gcc_v8qi, __gcc_v16qi, int, int);
+__gcc_v8qi __builtin_neon_vqrdmulh_v(__gcc_v8qi, __gcc_v8qi, int);
+short __builtin_neon_vqrdmulhh_s16(short, short);
+__gcc_v16qi __builtin_neon_vqrdmulhq_lane_v(__gcc_v16qi, __gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vqrdmulhq_laneq_v(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vqrdmulhq_v(__gcc_v16qi, __gcc_v16qi, int);
+int __builtin_neon_vqrdmulhs_s32(int, int);
+__gcc_v8qi __builtin_neon_vqrshl_v(__gcc_v8qi, __gcc_v8qi, int);
+signed char __builtin_neon_vqrshlb_s8(signed char, signed char);
+unsigned char __builtin_neon_vqrshlb_u8(unsigned char, signed char);
+int64_t int64_t int64_t __builtin_neon_vqrshld_s64(void);
+unsigned int64_t unsigned int64_t int64_t __builtin_neon_vqrshld_u64(void);
+short __builtin_neon_vqrshlh_s16(short, short);
+unsigned short __builtin_neon_vqrshlh_u16(unsigned short, short);
+__gcc_v16qi __builtin_neon_vqrshlq_v(__gcc_v16qi, __gcc_v16qi, int);
+int __builtin_neon_vqrshls_s32(int, int);
+unsigned int __builtin_neon_vqrshls_u32(unsigned int, int);
+__gcc_v8qi __builtin_neon_vqrshrn_n_v(__gcc_v16qi, int, int);
+int __builtin_neon_vqrshrnd_n_s64(int64_t int);
+unsigned int __builtin_neon_vqrshrnd_n_u64(unsigned int64_t int);
+signed char __builtin_neon_vqrshrnh_n_s16(short, int);
+unsigned char __builtin_neon_vqrshrnh_n_u16(unsigned short, int);
+short __builtin_neon_vqrshrns_n_s32(int, int);
+unsigned short __builtin_neon_vqrshrns_n_u32(unsigned int, int);
+__gcc_v8qi __builtin_neon_vqrshrun_n_v(__gcc_v16qi, int, int);
+unsigned int __builtin_neon_vqrshrund_n_s64(int64_t int);
+unsigned char __builtin_neon_vqrshrunh_n_s16(short, int);
+unsigned short __builtin_neon_vqrshruns_n_s32(int, int);
+__gcc_v8qi __builtin_neon_vqshl_n_v(__gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vqshl_v(__gcc_v8qi, __gcc_v8qi, int);
+signed char __builtin_neon_vqshlb_n_s8(signed char, int);
+unsigned char __builtin_neon_vqshlb_n_u8(unsigned char, int);
+signed char __builtin_neon_vqshlb_s8(signed char, signed char);
+unsigned char __builtin_neon_vqshlb_u8(unsigned char, signed char);
+int64_t int64_t int __builtin_neon_vqshld_n_s64(void);
+unsigned int64_t unsigned int64_t int __builtin_neon_vqshld_n_u64(void);
+int64_t int64_t int64_t __builtin_neon_vqshld_s64(void);
+unsigned int64_t unsigned int64_t int64_t __builtin_neon_vqshld_u64(void);
+short __builtin_neon_vqshlh_n_s16(short, int);
+unsigned short __builtin_neon_vqshlh_n_u16(unsigned short, int);
+short __builtin_neon_vqshlh_s16(short, short);
+unsigned short __builtin_neon_vqshlh_u16(unsigned short, short);
+__gcc_v16qi __builtin_neon_vqshlq_n_v(__gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vqshlq_v(__gcc_v16qi, __gcc_v16qi, int);
+int __builtin_neon_vqshls_n_s32(int, int);
+unsigned int __builtin_neon_vqshls_n_u32(unsigned int, int);
+int __builtin_neon_vqshls_s32(int, int);
+unsigned int __builtin_neon_vqshls_u32(unsigned int, int);
+__gcc_v8qi __builtin_neon_vqshlu_n_v(__gcc_v8qi, int, int);
+signed char __builtin_neon_vqshlub_n_s8(signed char, int);
+int64_t int64_t int __builtin_neon_vqshlud_n_s64(void);
+short __builtin_neon_vqshluh_n_s16(short, int);
+__gcc_v16qi __builtin_neon_vqshluq_n_v(__gcc_v16qi, int, int);
+int __builtin_neon_vqshlus_n_s32(int, int);
+__gcc_v8qi __builtin_neon_vqshrn_n_v(__gcc_v16qi, int, int);
+int __builtin_neon_vqshrnd_n_s64(int64_t int);
+unsigned int __builtin_neon_vqshrnd_n_u64(unsigned int64_t int);
+signed char __builtin_neon_vqshrnh_n_s16(short, int);
+unsigned char __builtin_neon_vqshrnh_n_u16(unsigned short, int);
+short __builtin_neon_vqshrns_n_s32(int, int);
+unsigned short __builtin_neon_vqshrns_n_u32(unsigned int, int);
+__gcc_v8qi __builtin_neon_vqshrun_n_v(__gcc_v16qi, int, int);
+unsigned int __builtin_neon_vqshrund_n_s64(int64_t int);
+unsigned char __builtin_neon_vqshrunh_n_s16(short, int);
+unsigned short __builtin_neon_vqshruns_n_s32(int, int);
+__gcc_v8qi __builtin_neon_vqsub_v(__gcc_v8qi, __gcc_v8qi, int);
+signed char __builtin_neon_vqsubb_s8(signed char, signed char);
+unsigned char __builtin_neon_vqsubb_u8(unsigned char, unsigned char);
+int64_t int64_t int64_t __builtin_neon_vqsubd_s64(void);
+unsigned int64_t unsigned int64_t unsigned int64_t __builtin_neon_vqsubd_u64(void);
+short __builtin_neon_vqsubh_s16(short, short);
+unsigned short __builtin_neon_vqsubh_u16(unsigned short, unsigned short);
+__gcc_v16qi __builtin_neon_vqsubq_v(__gcc_v16qi, __gcc_v16qi, int);
+int __builtin_neon_vqsubs_s32(int, int);
+unsigned int __builtin_neon_vqsubs_u32(unsigned int, unsigned int);
+__gcc_v8qi __builtin_neon_vqtbl1_v(__gcc_v16qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vqtbl1q_v(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vqtbl2_v(__gcc_v16qi, __gcc_v16qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vqtbl2q_v(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vqtbl3_v(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vqtbl3q_v(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vqtbl4_v(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vqtbl4q_v(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vqtbx1_v(__gcc_v8qi, __gcc_v16qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vqtbx1q_v(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vqtbx2_v(__gcc_v8qi, __gcc_v16qi, __gcc_v16qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vqtbx2q_v(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vqtbx3_v(__gcc_v8qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vqtbx3q_v(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vqtbx4_v(__gcc_v8qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vqtbx4q_v(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vraddhn_v(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrax1q_u64(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vrbit_v(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vrbitq_v(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vrecpe_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vrecpe_v(__gcc_v8qi, int);
+double __builtin_neon_vrecped_f64(double);
+__gcc_v16qi __builtin_neon_vrecpeq_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrecpeq_v(__gcc_v16qi, int);
+float __builtin_neon_vrecpes_f32(float);
+__gcc_v8qi __builtin_neon_vrecps_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vrecps_v(__gcc_v8qi, __gcc_v8qi, int);
+double __builtin_neon_vrecpsd_f64(double, double);
+__gcc_v16qi __builtin_neon_vrecpsq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrecpsq_v(__gcc_v16qi, __gcc_v16qi, int);
+float __builtin_neon_vrecpss_f32(float, float);
+double __builtin_neon_vrecpxd_f64(double);
+float __builtin_neon_vrecpxs_f32(float);
+__gcc_v8qi __builtin_neon_vrhadd_v(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vrhaddq_v(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vrnd32x_f32(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vrnd32x_f64(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vrnd32xq_f32(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrnd32xq_f64(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vrnd32z_f32(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vrnd32z_f64(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vrnd32zq_f32(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrnd32zq_f64(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vrnd64x_f32(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vrnd64x_f64(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vrnd64xq_f32(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrnd64xq_f64(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vrnd64z_f32(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vrnd64z_f64(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vrnd64zq_f32(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrnd64zq_f64(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vrnd_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vrnd_v(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vrnda_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vrnda_v(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vrndaq_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrndaq_v(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vrndi_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vrndi_v(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vrndiq_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrndiq_v(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vrndm_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vrndm_v(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vrndmq_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrndmq_v(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vrndn_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vrndn_v(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vrndnq_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrndnq_v(__gcc_v16qi, int);
+float __builtin_neon_vrndns_f32(float);
+__gcc_v8qi __builtin_neon_vrndp_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vrndp_v(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vrndpq_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrndpq_v(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrndq_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrndq_v(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vrndx_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vrndx_v(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vrndxq_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrndxq_v(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vrshl_v(__gcc_v8qi, __gcc_v8qi, int);
+int64_t int64_t int64_t __builtin_neon_vrshld_s64(void);
+unsigned int64_t unsigned int64_t int64_t __builtin_neon_vrshld_u64(void);
+__gcc_v16qi __builtin_neon_vrshlq_v(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vrshr_n_v(__gcc_v8qi, int, int);
+int64_t int64_t int __builtin_neon_vrshrd_n_s64(void);
+unsigned int64_t unsigned int64_t int __builtin_neon_vrshrd_n_u64(void);
+__gcc_v8qi __builtin_neon_vrshrn_n_v(__gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vrshrq_n_v(__gcc_v16qi, int, int);
+__gcc_v8qi __builtin_neon_vrsqrte_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vrsqrte_v(__gcc_v8qi, int);
+double __builtin_neon_vrsqrted_f64(double);
+__gcc_v16qi __builtin_neon_vrsqrteq_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrsqrteq_v(__gcc_v16qi, int);
+float __builtin_neon_vrsqrtes_f32(float);
+__gcc_v8qi __builtin_neon_vrsqrts_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vrsqrts_v(__gcc_v8qi, __gcc_v8qi, int);
+double __builtin_neon_vrsqrtsd_f64(double, double);
+__gcc_v16qi __builtin_neon_vrsqrtsq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vrsqrtsq_v(__gcc_v16qi, __gcc_v16qi, int);
+float __builtin_neon_vrsqrtss_f32(float, float);
+__gcc_v8qi __builtin_neon_vrsra_n_v(__gcc_v8qi, __gcc_v8qi, int, int);
+int64_t int64_t int64_t int __builtin_neon_vrsrad_n_s64(void);
+unsigned int64_t unsigned int64_t unsigned int64_t int __builtin_neon_vrsrad_n_u64(void);
+__gcc_v16qi __builtin_neon_vrsraq_n_v(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v8qi __builtin_neon_vrsubhn_v(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vscale_f16(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vscale_f32(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vscaleq_f16(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vscaleq_f32(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vscaleq_f64(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v2sf __builtin_neon_vset_lane_f32(float, __gcc_v2sf, int);
+__gcc_v4hi __builtin_neon_vset_lane_i16(short, __gcc_v4hi, int);
+__gcc_v2si __builtin_neon_vset_lane_i32(int, __gcc_v2si, int);
+__gcc_v8qi __builtin_neon_vset_lane_i8(signed char, __gcc_v8qi, int);
+__gcc_v8hf __builtin_neon_vsetq_lane_bf16(__bf16, __gcc_v8hf, int);
+__gcc_v4sf __builtin_neon_vsetq_lane_f32(float, __gcc_v4sf, int);
+__gcc_v2df __builtin_neon_vsetq_lane_f64(double, __gcc_v2df, int);
+__gcc_v8hi __builtin_neon_vsetq_lane_i16(short, __gcc_v8hi, int);
+__gcc_v4si __builtin_neon_vsetq_lane_i32(int, __gcc_v4si, int);
+__gcc_v16qi __builtin_neon_vsetq_lane_i8(signed char, __gcc_v16qi, int);
+__gcc_v4si __builtin_neon_vsha1cq_u32(__gcc_v4usi, unsigned int, __gcc_v4usi);
+unsigned int __builtin_neon_vsha1h_u32(unsigned int);
+__gcc_v4si __builtin_neon_vsha1mq_u32(__gcc_v4usi, unsigned int, __gcc_v4usi);
+__gcc_v4si __builtin_neon_vsha1pq_u32(__gcc_v4usi, unsigned int, __gcc_v4usi);
+__gcc_v16qi __builtin_neon_vsha1su0q_u32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vsha1su1q_u32(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vsha256h2q_u32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vsha256hq_u32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vsha256su0q_u32(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vsha256su1q_u32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vsha512h2q_u64(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vsha512hq_u64(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vsha512su0q_u64(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vsha512su1q_u64(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vshl_n_v(__gcc_v8qi, int, int);
+__gcc_v8qi __builtin_neon_vshl_v(__gcc_v8qi, __gcc_v8qi, int);
+int64_t int64_t int __builtin_neon_vshld_n_s64(void);
+unsigned int64_t unsigned int64_t int __builtin_neon_vshld_n_u64(void);
+int64_t int64_t int64_t __builtin_neon_vshld_s64(void);
+unsigned int64_t unsigned int64_t int64_t __builtin_neon_vshld_u64(void);
+__gcc_v16qi __builtin_neon_vshll_n_v(__gcc_v8qi, int, int);
+__gcc_v16qi __builtin_neon_vshlq_n_v(__gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vshlq_v(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vshr_n_v(__gcc_v8qi, int, int);
+int64_t int64_t int __builtin_neon_vshrd_n_s64(void);
+unsigned int64_t unsigned int64_t int __builtin_neon_vshrd_n_u64(void);
+__gcc_v8qi __builtin_neon_vshrn_n_v(__gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vshrq_n_v(__gcc_v16qi, int, int);
+__gcc_v8qi __builtin_neon_vsli_n_v(__gcc_v8qi, __gcc_v8qi, int, int);
+int64_t int64_t int64_t int __builtin_neon_vslid_n_s64(void);
+unsigned int64_t unsigned int64_t unsigned int64_t int __builtin_neon_vslid_n_u64(void);
+__gcc_v16qi __builtin_neon_vsliq_n_v(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vsm3partw1q_u32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vsm3partw2q_u32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vsm3ss1q_u32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vsm3tt1aq_u32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vsm3tt1bq_u32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vsm3tt2aq_u32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vsm3tt2bq_u32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v16qi __builtin_neon_vsm4ekeyq_u32(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vsm4eq_u32(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vsqadd_v(__gcc_v8qi, __gcc_v8qi, int);
+unsigned char __builtin_neon_vsqaddb_u8(unsigned char, signed char);
+unsigned int64_t unsigned int64_t int64_t __builtin_neon_vsqaddd_u64(void);
+unsigned short __builtin_neon_vsqaddh_u16(unsigned short, short);
+__gcc_v16qi __builtin_neon_vsqaddq_v(__gcc_v16qi, __gcc_v16qi, int);
+unsigned int __builtin_neon_vsqadds_u32(unsigned int, int);
+__gcc_v8qi __builtin_neon_vsqrt_f16(__gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vsqrt_v(__gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vsqrtq_f16(__gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vsqrtq_v(__gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vsra_n_v(__gcc_v8qi, __gcc_v8qi, int, int);
+int64_t int64_t int64_t int __builtin_neon_vsrad_n_s64(void);
+unsigned int64_t unsigned int64_t unsigned int64_t int __builtin_neon_vsrad_n_u64(void);
+__gcc_v16qi __builtin_neon_vsraq_n_v(__gcc_v16qi, __gcc_v16qi, int, int);
+__gcc_v8qi __builtin_neon_vsri_n_v(__gcc_v8qi, __gcc_v8qi, int, int);
+int64_t int64_t int64_t int __builtin_neon_vsrid_n_s64(void);
+unsigned int64_t unsigned int64_t unsigned int64_t int __builtin_neon_vsrid_n_u64(void);
+__gcc_v16qi __builtin_neon_vsriq_n_v(__gcc_v16qi, __gcc_v16qi, int, int);
+void __builtin_neon_vst1_bf16(void *, __gcc_v8qi, int);
+void __builtin_neon_vst1_bf16_x2(void *, __gcc_v8qi, __gcc_v8qi, int);
+void __builtin_neon_vst1_bf16_x3(void *, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+void __builtin_neon_vst1_bf16_x4(void *, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+void __builtin_neon_vst1_lane_bf16(void *, __gcc_v8qi, int, int);
+void __builtin_neon_vst1_lane_v(void *, __gcc_v8qi, int, int);
+void __builtin_neon_vst1_v(void *, __gcc_v8qi, int);
+void __builtin_neon_vst1_x2_v(void *, __gcc_v8qi, __gcc_v8qi, int);
+void __builtin_neon_vst1_x3_v(void *, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+void __builtin_neon_vst1_x4_v(void *, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+void __builtin_neon_vst1q_bf16(void *, __gcc_v16qi, int);
+void __builtin_neon_vst1q_bf16_x2(void *, __gcc_v16qi, __gcc_v16qi, int);
+void __builtin_neon_vst1q_bf16_x3(void *, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+void __builtin_neon_vst1q_bf16_x4(void *, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+void __builtin_neon_vst1q_lane_bf16(void *, __gcc_v16qi, int, int);
+void __builtin_neon_vst1q_lane_v(void *, __gcc_v16qi, int, int);
+void __builtin_neon_vst1q_v(void *, __gcc_v16qi, int);
+void __builtin_neon_vst1q_x2_v(void *, __gcc_v16qi, __gcc_v16qi, int);
+void __builtin_neon_vst1q_x3_v(void *, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+void __builtin_neon_vst1q_x4_v(void *, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+void __builtin_neon_vst2_bf16(void *, __gcc_v8qi, __gcc_v8qi, int);
+void __builtin_neon_vst2_lane_bf16(void *, __gcc_v8qi, __gcc_v8qi, int, int);
+void __builtin_neon_vst2_lane_v(void *, __gcc_v8qi, __gcc_v8qi, int, int);
+void __builtin_neon_vst2_v(void *, __gcc_v8qi, __gcc_v8qi, int);
+void __builtin_neon_vst2q_bf16(void *, __gcc_v16qi, __gcc_v16qi, int);
+void __builtin_neon_vst2q_lane_bf16(void *, __gcc_v16qi, __gcc_v16qi, int, int);
+void __builtin_neon_vst2q_lane_v(void *, __gcc_v16qi, __gcc_v16qi, int, int);
+void __builtin_neon_vst2q_v(void *, __gcc_v16qi, __gcc_v16qi, int);
+void __builtin_neon_vst3_bf16(void *, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+void __builtin_neon_vst3_lane_bf16(void *, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int, int);
+void __builtin_neon_vst3_lane_v(void *, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int, int);
+void __builtin_neon_vst3_v(void *, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+void __builtin_neon_vst3q_bf16(void *, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+void __builtin_neon_vst3q_lane_bf16(void *, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+void __builtin_neon_vst3q_lane_v(void *, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+void __builtin_neon_vst3q_v(void *, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+void __builtin_neon_vst4_bf16(void *, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+void __builtin_neon_vst4_lane_bf16(void *, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int, int);
+void __builtin_neon_vst4_lane_v(void *, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int, int);
+void __builtin_neon_vst4_v(void *, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+void __builtin_neon_vst4q_bf16(void *, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+void __builtin_neon_vst4q_lane_bf16(void *, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+void __builtin_neon_vst4q_lane_v(void *, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int, int);
+void __builtin_neon_vst4q_v(void *, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+void __builtin_neon_vstl1_lane_f64(void *, __gcc_v8qi, int, int);
+void __builtin_neon_vstl1_lane_p64(void *, __gcc_v8qi, int, int);
+void __builtin_neon_vstl1_lane_s64(void *, __gcc_v8qi, int, int);
+void __builtin_neon_vstl1_lane_u64(void *, __gcc_v8qi, int, int);
+void __builtin_neon_vstl1q_lane_f64(void *, __gcc_v16qi, int, int);
+void __builtin_neon_vstl1q_lane_p64(void *, __gcc_v16qi, int, int);
+void __builtin_neon_vstl1q_lane_s64(void *, __gcc_v16qi, int, int);
+void __builtin_neon_vstl1q_lane_u64(void *, __gcc_v16qi, int, int);
+void __builtin_neon_vstrq_p128(void *, unsigned __int128_t);
+int64_t int64_t int64_t __builtin_neon_vsubd_s64(void);
+unsigned int64_t unsigned int64_t unsigned int64_t __builtin_neon_vsubd_u64(void);
+__gcc_v8qi __builtin_neon_vsubhn_v(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vtbl1_v(__gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vtbl2_v(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vtbl3_v(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vtbl4_v(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vtbx1_v(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vtbx2_v(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vtbx3_v(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v8qi __builtin_neon_vtbx4_v(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+void __builtin_neon_vtrn_v(void *, __gcc_v8qi, __gcc_v8qi, int);
+void __builtin_neon_vtrnq_v(void *, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vtst_v(__gcc_v8qi, __gcc_v8qi, int);
+unsigned int64_t int64_t int64_t __builtin_neon_vtstd_s64(void);
+unsigned int64_t unsigned int64_t unsigned int64_t __builtin_neon_vtstd_u64(void);
+__gcc_v16qi __builtin_neon_vtstq_v(__gcc_v16qi, __gcc_v16qi, int);
+__gcc_v8qi __builtin_neon_vuqadd_v(__gcc_v8qi, __gcc_v8qi, int);
+signed char __builtin_neon_vuqaddb_s8(signed char, unsigned char);
+int64_t int64_t unsigned int64_t __builtin_neon_vuqaddd_s64(void);
+short __builtin_neon_vuqaddh_s16(short, unsigned short);
+__gcc_v16qi __builtin_neon_vuqaddq_v(__gcc_v16qi, __gcc_v16qi, int);
+int __builtin_neon_vuqadds_s32(int, unsigned int);
+__gcc_v8qi __builtin_neon_vusdot_s32(__gcc_v8qi, __gcc_v8qi, __gcc_v8qi, int);
+__gcc_v16qi __builtin_neon_vusdotq_s32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vusmmlaq_s32(__gcc_v16qi, __gcc_v16qi, __gcc_v16qi, int);
+void __builtin_neon_vuzp_v(void *, __gcc_v8qi, __gcc_v8qi, int);
+void __builtin_neon_vuzpq_v(void *, __gcc_v16qi, __gcc_v16qi, int);
+__gcc_v16qi __builtin_neon_vxarq_u64(__gcc_v16qi, __gcc_v16qi, int, int);
+void __builtin_neon_vzip_v(void *, __gcc_v8qi, __gcc_v8qi, int);
+void __builtin_neon_vzipq_v(void *, __gcc_v16qi, __gcc_v16qi, int);
+// clang-format on
diff --git a/src/ansi-c/library/arm_neon.c b/src/ansi-c/library/arm_neon.c
new file mode 100644
index 00000000000..2cab4c2ad10
--- /dev/null
+++ b/src/ansi-c/library/arm_neon.c
@@ -0,0 +1,1895 @@
+/* FUNCTION: __builtin_neon_vabd_v */
+
+// Arm instruction(s): SABD, UABD (per ACLE advsimd.md)
+
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+typedef signed char __gcc_v8qi_s __attribute__((__vector_size__(8)));
+typedef short __gcc_v4hi_s __attribute__((__vector_size__(8)));
+typedef int __gcc_v2si_s __attribute__((__vector_size__(8)));
+typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
+typedef unsigned short __gcc_v4hi_u __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+
+__gcc_v8qi __builtin_neon_vabd_v(__gcc_v8qi a, __gcc_v8qi b, int type)
+{
+  switch(type)
+  {
+  case 0:
+  {
+    __gcc_v8qi_s x = (__gcc_v8qi_s)a, y = (__gcc_v8qi_s)b, r;
+    for(int i = 0; i < 8; i++)
+    {
+      int d = (int)x[i] - (int)y[i];
+      r[i] = d < 0 ? -d : d;
+    }
+    return (__gcc_v8qi)r;
+  }
+  case 1:
+  {
+    __gcc_v4hi_s x = (__gcc_v4hi_s)a, y = (__gcc_v4hi_s)b, r;
+    for(int i = 0; i < 4; i++)
+    {
+      int d = (int)x[i] - (int)y[i];
+      r[i] = d < 0 ? -d : d;
+    }
+    return (__gcc_v8qi)r;
+  }
+  case 2:
+  {
+    __gcc_v2si_s x = (__gcc_v2si_s)a, y = (__gcc_v2si_s)b, r;
+    for(int i = 0; i < 2; i++)
+    {
+      long long d = (long long)x[i] - (long long)y[i];
+      r[i] = d < 0 ? -d : d;
+    }
+    return (__gcc_v8qi)r;
+  }
+  case 16:
+  {
+    __gcc_v8qi_u x = (__gcc_v8qi_u)a, y = (__gcc_v8qi_u)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = x[i] > y[i] ? x[i] - y[i] : y[i] - x[i];
+    return (__gcc_v8qi)r;
+  }
+  case 17:
+  {
+    __gcc_v4hi_u x = (__gcc_v4hi_u)a, y = (__gcc_v4hi_u)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = x[i] > y[i] ? x[i] - y[i] : y[i] - x[i];
+    return (__gcc_v8qi)r;
+  }
+  case 18:
+  {
+    __gcc_v2si_u x = (__gcc_v2si_u)a, y = (__gcc_v2si_u)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = x[i] > y[i] ? x[i] - y[i] : y[i] - x[i];
+    return (__gcc_v8qi)r;
+  }
+  }
+
+  __gcc_v8qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vabdq_v */
+
+// Arm instruction(s): SABD, UABD (per ACLE advsimd.md)
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+typedef short __gcc_v8hi_s __attribute__((__vector_size__(16)));
+typedef int __gcc_v4si_s __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_neon_vabdq_v(__gcc_v16qi a, __gcc_v16qi b, int type)
+{
+  switch(type)
+  {
+  case 32:
+  {
+    __gcc_v16qi_s x = (__gcc_v16qi_s)a, y = (__gcc_v16qi_s)b, r;
+    for(int i = 0; i < 16; i++)
+    {
+      int d = (int)x[i] - (int)y[i];
+      r[i] = d < 0 ? -d : d;
+    }
+    return (__gcc_v16qi)r;
+  }
+  case 33:
+  {
+    __gcc_v8hi_s x = (__gcc_v8hi_s)a, y = (__gcc_v8hi_s)b, r;
+    for(int i = 0; i < 8; i++)
+    {
+      int d = (int)x[i] - (int)y[i];
+      r[i] = d < 0 ? -d : d;
+    }
+    return (__gcc_v16qi)r;
+  }
+  case 34:
+  {
+    __gcc_v4si_s x = (__gcc_v4si_s)a, y = (__gcc_v4si_s)b, r;
+    for(int i = 0; i < 4; i++)
+    {
+      long long d = (long long)x[i] - (long long)y[i];
+      r[i] = d < 0 ? -d : d;
+    }
+    return (__gcc_v16qi)r;
+  }
+  case 48:
+  {
+    __gcc_v16qi_u x = (__gcc_v16qi_u)a, y = (__gcc_v16qi_u)b, r;
+    for(int i = 0; i < 16; i++)
+      r[i] = x[i] > y[i] ? x[i] - y[i] : y[i] - x[i];
+    return (__gcc_v16qi)r;
+  }
+  case 49:
+  {
+    __gcc_v8hi_u x = (__gcc_v8hi_u)a, y = (__gcc_v8hi_u)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = x[i] > y[i] ? x[i] - y[i] : y[i] - x[i];
+    return (__gcc_v16qi)r;
+  }
+  case 50:
+  {
+    __gcc_v4si_u x = (__gcc_v4si_u)a, y = (__gcc_v4si_u)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = x[i] > y[i] ? x[i] - y[i] : y[i] - x[i];
+    return (__gcc_v16qi)r;
+  }
+  }
+
+  __gcc_v16qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vbsl_v */
+
+// Arm instruction(s): BSL (per ACLE advsimd.md)
+
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+
+__gcc_v8qi
+__builtin_neon_vbsl_v(__gcc_v8qi mask, __gcc_v8qi a, __gcc_v8qi b, int type)
+{
+  (void)type;
+  return (mask & a) | (~mask & b);
+}
+
+/* FUNCTION: __builtin_neon_vbslq_v */
+
+// Arm instruction(s): BSL (per ACLE advsimd.md)
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+
+__gcc_v16qi
+__builtin_neon_vbslq_v(__gcc_v16qi mask, __gcc_v16qi a, __gcc_v16qi b, int type)
+{
+  (void)type;
+  return (mask & a) | (~mask & b);
+}
+
+/* FUNCTION: __builtin_neon_vhadd_v */
+
+// Arm instruction(s): SHADD, UHADD (per ACLE advsimd.md)
+
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+typedef signed char __gcc_v8qi_s __attribute__((__vector_size__(8)));
+typedef short __gcc_v4hi_s __attribute__((__vector_size__(8)));
+typedef int __gcc_v2si_s __attribute__((__vector_size__(8)));
+typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
+typedef unsigned short __gcc_v4hi_u __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+
+__gcc_v8qi __builtin_neon_vhadd_v(__gcc_v8qi a, __gcc_v8qi b, int type)
+{
+  switch(type)
+  {
+  case 0:
+  {
+    __gcc_v8qi_s x = (__gcc_v8qi_s)a, y = (__gcc_v8qi_s)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = ((int)x[i] + (int)y[i]) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  case 1:
+  {
+    __gcc_v4hi_s x = (__gcc_v4hi_s)a, y = (__gcc_v4hi_s)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = ((int)x[i] + (int)y[i]) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  case 2:
+  {
+    __gcc_v2si_s x = (__gcc_v2si_s)a, y = (__gcc_v2si_s)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = ((long long)x[i] + (long long)y[i]) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  case 16:
+  {
+    __gcc_v8qi_u x = (__gcc_v8qi_u)a, y = (__gcc_v8qi_u)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = ((int)x[i] + (int)y[i]) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  case 17:
+  {
+    __gcc_v4hi_u x = (__gcc_v4hi_u)a, y = (__gcc_v4hi_u)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = ((int)x[i] + (int)y[i]) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  case 18:
+  {
+    __gcc_v2si_u x = (__gcc_v2si_u)a, y = (__gcc_v2si_u)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = ((long long)x[i] + (long long)y[i]) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  }
+
+  __gcc_v8qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vhaddq_v */
+
+// Arm instruction(s): SHADD, UHADD (per ACLE advsimd.md)
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+typedef short __gcc_v8hi_s __attribute__((__vector_size__(16)));
+typedef int __gcc_v4si_s __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_neon_vhaddq_v(__gcc_v16qi a, __gcc_v16qi b, int type)
+{
+  switch(type)
+  {
+  case 32:
+  {
+    __gcc_v16qi_s x = (__gcc_v16qi_s)a, y = (__gcc_v16qi_s)b, r;
+    for(int i = 0; i < 16; i++)
+      r[i] = ((int)x[i] + (int)y[i]) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  case 33:
+  {
+    __gcc_v8hi_s x = (__gcc_v8hi_s)a, y = (__gcc_v8hi_s)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = ((int)x[i] + (int)y[i]) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  case 34:
+  {
+    __gcc_v4si_s x = (__gcc_v4si_s)a, y = (__gcc_v4si_s)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = ((long long)x[i] + (long long)y[i]) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  case 48:
+  {
+    __gcc_v16qi_u x = (__gcc_v16qi_u)a, y = (__gcc_v16qi_u)b, r;
+    for(int i = 0; i < 16; i++)
+      r[i] = ((int)x[i] + (int)y[i]) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  case 49:
+  {
+    __gcc_v8hi_u x = (__gcc_v8hi_u)a, y = (__gcc_v8hi_u)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = ((int)x[i] + (int)y[i]) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  case 50:
+  {
+    __gcc_v4si_u x = (__gcc_v4si_u)a, y = (__gcc_v4si_u)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = ((long long)x[i] + (long long)y[i]) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  }
+
+  __gcc_v16qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vhsub_v */
+
+// Arm instruction(s): SHSUB, UHSUB (per ACLE advsimd.md)
+
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+typedef signed char __gcc_v8qi_s __attribute__((__vector_size__(8)));
+typedef short __gcc_v4hi_s __attribute__((__vector_size__(8)));
+typedef int __gcc_v2si_s __attribute__((__vector_size__(8)));
+typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
+typedef unsigned short __gcc_v4hi_u __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+
+__gcc_v8qi __builtin_neon_vhsub_v(__gcc_v8qi a, __gcc_v8qi b, int type)
+{
+  switch(type)
+  {
+  case 0:
+  {
+    __gcc_v8qi_s x = (__gcc_v8qi_s)a, y = (__gcc_v8qi_s)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = ((int)x[i] - (int)y[i]) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  case 1:
+  {
+    __gcc_v4hi_s x = (__gcc_v4hi_s)a, y = (__gcc_v4hi_s)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = ((int)x[i] - (int)y[i]) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  case 2:
+  {
+    __gcc_v2si_s x = (__gcc_v2si_s)a, y = (__gcc_v2si_s)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = ((long long)x[i] - (long long)y[i]) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  case 16:
+  {
+    __gcc_v8qi_u x = (__gcc_v8qi_u)a, y = (__gcc_v8qi_u)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = ((int)x[i] - (int)y[i]) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  case 17:
+  {
+    __gcc_v4hi_u x = (__gcc_v4hi_u)a, y = (__gcc_v4hi_u)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = ((int)x[i] - (int)y[i]) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  case 18:
+  {
+    __gcc_v2si_u x = (__gcc_v2si_u)a, y = (__gcc_v2si_u)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = ((long long)x[i] - (long long)y[i]) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  }
+
+  __gcc_v8qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vhsubq_v */
+
+// Arm instruction(s): SHSUB, UHSUB (per ACLE advsimd.md)
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+typedef short __gcc_v8hi_s __attribute__((__vector_size__(16)));
+typedef int __gcc_v4si_s __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_neon_vhsubq_v(__gcc_v16qi a, __gcc_v16qi b, int type)
+{
+  switch(type)
+  {
+  case 32:
+  {
+    __gcc_v16qi_s x = (__gcc_v16qi_s)a, y = (__gcc_v16qi_s)b, r;
+    for(int i = 0; i < 16; i++)
+      r[i] = ((int)x[i] - (int)y[i]) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  case 33:
+  {
+    __gcc_v8hi_s x = (__gcc_v8hi_s)a, y = (__gcc_v8hi_s)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = ((int)x[i] - (int)y[i]) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  case 34:
+  {
+    __gcc_v4si_s x = (__gcc_v4si_s)a, y = (__gcc_v4si_s)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = ((long long)x[i] - (long long)y[i]) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  case 48:
+  {
+    __gcc_v16qi_u x = (__gcc_v16qi_u)a, y = (__gcc_v16qi_u)b, r;
+    for(int i = 0; i < 16; i++)
+      r[i] = ((int)x[i] - (int)y[i]) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  case 49:
+  {
+    __gcc_v8hi_u x = (__gcc_v8hi_u)a, y = (__gcc_v8hi_u)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = ((int)x[i] - (int)y[i]) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  case 50:
+  {
+    __gcc_v4si_u x = (__gcc_v4si_u)a, y = (__gcc_v4si_u)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = ((long long)x[i] - (long long)y[i]) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  }
+
+  __gcc_v16qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vmax_v */
+
+// Arm instruction(s): SMAX, UMAX (per ACLE advsimd.md)
+
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+typedef signed char __gcc_v8qi_s __attribute__((__vector_size__(8)));
+typedef short __gcc_v4hi_s __attribute__((__vector_size__(8)));
+typedef int __gcc_v2si_s __attribute__((__vector_size__(8)));
+typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
+typedef unsigned short __gcc_v4hi_u __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+
+__gcc_v8qi __builtin_neon_vmax_v(__gcc_v8qi a, __gcc_v8qi b, int type)
+{
+  switch(type)
+  {
+  case 0:
+  {
+    __gcc_v8qi_s x = (__gcc_v8qi_s)a, y = (__gcc_v8qi_s)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = x[i] > y[i] ? x[i] : y[i];
+    return (__gcc_v8qi)r;
+  }
+  case 1:
+  {
+    __gcc_v4hi_s x = (__gcc_v4hi_s)a, y = (__gcc_v4hi_s)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = x[i] > y[i] ? x[i] : y[i];
+    return (__gcc_v8qi)r;
+  }
+  case 2:
+  {
+    __gcc_v2si_s x = (__gcc_v2si_s)a, y = (__gcc_v2si_s)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = x[i] > y[i] ? x[i] : y[i];
+    return (__gcc_v8qi)r;
+  }
+  case 16:
+  {
+    __gcc_v8qi_u x = (__gcc_v8qi_u)a, y = (__gcc_v8qi_u)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = x[i] > y[i] ? x[i] : y[i];
+    return (__gcc_v8qi)r;
+  }
+  case 17:
+  {
+    __gcc_v4hi_u x = (__gcc_v4hi_u)a, y = (__gcc_v4hi_u)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = x[i] > y[i] ? x[i] : y[i];
+    return (__gcc_v8qi)r;
+  }
+  case 18:
+  {
+    __gcc_v2si_u x = (__gcc_v2si_u)a, y = (__gcc_v2si_u)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = x[i] > y[i] ? x[i] : y[i];
+    return (__gcc_v8qi)r;
+  }
+  }
+
+  __gcc_v8qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vmaxq_v */
+
+// Arm instruction(s): SMAX, UMAX (per ACLE advsimd.md)
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+typedef short __gcc_v8hi_s __attribute__((__vector_size__(16)));
+typedef int __gcc_v4si_s __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_neon_vmaxq_v(__gcc_v16qi a, __gcc_v16qi b, int type)
+{
+  switch(type)
+  {
+  case 32:
+  {
+    __gcc_v16qi_s x = (__gcc_v16qi_s)a, y = (__gcc_v16qi_s)b, r;
+    for(int i = 0; i < 16; i++)
+      r[i] = x[i] > y[i] ? x[i] : y[i];
+    return (__gcc_v16qi)r;
+  }
+  case 33:
+  {
+    __gcc_v8hi_s x = (__gcc_v8hi_s)a, y = (__gcc_v8hi_s)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = x[i] > y[i] ? x[i] : y[i];
+    return (__gcc_v16qi)r;
+  }
+  case 34:
+  {
+    __gcc_v4si_s x = (__gcc_v4si_s)a, y = (__gcc_v4si_s)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = x[i] > y[i] ? x[i] : y[i];
+    return (__gcc_v16qi)r;
+  }
+  case 48:
+  {
+    __gcc_v16qi_u x = (__gcc_v16qi_u)a, y = (__gcc_v16qi_u)b, r;
+    for(int i = 0; i < 16; i++)
+      r[i] = x[i] > y[i] ? x[i] : y[i];
+    return (__gcc_v16qi)r;
+  }
+  case 49:
+  {
+    __gcc_v8hi_u x = (__gcc_v8hi_u)a, y = (__gcc_v8hi_u)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = x[i] > y[i] ? x[i] : y[i];
+    return (__gcc_v16qi)r;
+  }
+  case 50:
+  {
+    __gcc_v4si_u x = (__gcc_v4si_u)a, y = (__gcc_v4si_u)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = x[i] > y[i] ? x[i] : y[i];
+    return (__gcc_v16qi)r;
+  }
+  }
+
+  __gcc_v16qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vmin_v */
+
+// Arm instruction(s): SMIN, UMIN (per ACLE advsimd.md)
+
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+typedef signed char __gcc_v8qi_s __attribute__((__vector_size__(8)));
+typedef short __gcc_v4hi_s __attribute__((__vector_size__(8)));
+typedef int __gcc_v2si_s __attribute__((__vector_size__(8)));
+typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
+typedef unsigned short __gcc_v4hi_u __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+
+__gcc_v8qi __builtin_neon_vmin_v(__gcc_v8qi a, __gcc_v8qi b, int type)
+{
+  switch(type)
+  {
+  case 0:
+  {
+    __gcc_v8qi_s x = (__gcc_v8qi_s)a, y = (__gcc_v8qi_s)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = x[i] < y[i] ? x[i] : y[i];
+    return (__gcc_v8qi)r;
+  }
+  case 1:
+  {
+    __gcc_v4hi_s x = (__gcc_v4hi_s)a, y = (__gcc_v4hi_s)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = x[i] < y[i] ? x[i] : y[i];
+    return (__gcc_v8qi)r;
+  }
+  case 2:
+  {
+    __gcc_v2si_s x = (__gcc_v2si_s)a, y = (__gcc_v2si_s)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = x[i] < y[i] ? x[i] : y[i];
+    return (__gcc_v8qi)r;
+  }
+  case 16:
+  {
+    __gcc_v8qi_u x = (__gcc_v8qi_u)a, y = (__gcc_v8qi_u)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = x[i] < y[i] ? x[i] : y[i];
+    return (__gcc_v8qi)r;
+  }
+  case 17:
+  {
+    __gcc_v4hi_u x = (__gcc_v4hi_u)a, y = (__gcc_v4hi_u)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = x[i] < y[i] ? x[i] : y[i];
+    return (__gcc_v8qi)r;
+  }
+  case 18:
+  {
+    __gcc_v2si_u x = (__gcc_v2si_u)a, y = (__gcc_v2si_u)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = x[i] < y[i] ? x[i] : y[i];
+    return (__gcc_v8qi)r;
+  }
+  }
+
+  __gcc_v8qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vminq_v */
+
+// Arm instruction(s): SMIN, UMIN (per ACLE advsimd.md)
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+typedef short __gcc_v8hi_s __attribute__((__vector_size__(16)));
+typedef int __gcc_v4si_s __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_neon_vminq_v(__gcc_v16qi a, __gcc_v16qi b, int type)
+{
+  switch(type)
+  {
+  case 32:
+  {
+    __gcc_v16qi_s x = (__gcc_v16qi_s)a, y = (__gcc_v16qi_s)b, r;
+    for(int i = 0; i < 16; i++)
+      r[i] = x[i] < y[i] ? x[i] : y[i];
+    return (__gcc_v16qi)r;
+  }
+  case 33:
+  {
+    __gcc_v8hi_s x = (__gcc_v8hi_s)a, y = (__gcc_v8hi_s)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = x[i] < y[i] ? x[i] : y[i];
+    return (__gcc_v16qi)r;
+  }
+  case 34:
+  {
+    __gcc_v4si_s x = (__gcc_v4si_s)a, y = (__gcc_v4si_s)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = x[i] < y[i] ? x[i] : y[i];
+    return (__gcc_v16qi)r;
+  }
+  case 48:
+  {
+    __gcc_v16qi_u x = (__gcc_v16qi_u)a, y = (__gcc_v16qi_u)b, r;
+    for(int i = 0; i < 16; i++)
+      r[i] = x[i] < y[i] ? x[i] : y[i];
+    return (__gcc_v16qi)r;
+  }
+  case 49:
+  {
+    __gcc_v8hi_u x = (__gcc_v8hi_u)a, y = (__gcc_v8hi_u)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = x[i] < y[i] ? x[i] : y[i];
+    return (__gcc_v16qi)r;
+  }
+  case 50:
+  {
+    __gcc_v4si_u x = (__gcc_v4si_u)a, y = (__gcc_v4si_u)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = x[i] < y[i] ? x[i] : y[i];
+    return (__gcc_v16qi)r;
+  }
+  }
+
+  __gcc_v16qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vpadd_v */
+
+// Arm instruction(s): ADDP (per ACLE advsimd.md)
+
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+typedef signed char __gcc_v8qi_s __attribute__((__vector_size__(8)));
+typedef short __gcc_v4hi_s __attribute__((__vector_size__(8)));
+typedef int __gcc_v2si_s __attribute__((__vector_size__(8)));
+typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
+typedef unsigned short __gcc_v4hi_u __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+
+__gcc_v8qi __builtin_neon_vpadd_v(__gcc_v8qi a, __gcc_v8qi b, int type)
+{
+  switch(type)
+  {
+  case 0:
+  {
+    __gcc_v8qi_s x = (__gcc_v8qi_s)a, y = (__gcc_v8qi_s)b, r;
+    int h = 8 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = (unsigned char)x[2 * i] + (unsigned char)x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = (unsigned char)y[2 * i] + (unsigned char)y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  case 1:
+  {
+    __gcc_v4hi_s x = (__gcc_v4hi_s)a, y = (__gcc_v4hi_s)b, r;
+    int h = 4 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = (unsigned short)x[2 * i] + (unsigned short)x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = (unsigned short)y[2 * i] + (unsigned short)y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  case 2:
+  {
+    __gcc_v2si_s x = (__gcc_v2si_s)a, y = (__gcc_v2si_s)b, r;
+    int h = 2 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = (unsigned int)x[2 * i] + (unsigned int)x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = (unsigned int)y[2 * i] + (unsigned int)y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  case 16:
+  {
+    __gcc_v8qi_u x = (__gcc_v8qi_u)a, y = (__gcc_v8qi_u)b, r;
+    int h = 8 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = (unsigned char)x[2 * i] + (unsigned char)x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = (unsigned char)y[2 * i] + (unsigned char)y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  case 17:
+  {
+    __gcc_v4hi_u x = (__gcc_v4hi_u)a, y = (__gcc_v4hi_u)b, r;
+    int h = 4 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = (unsigned short)x[2 * i] + (unsigned short)x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = (unsigned short)y[2 * i] + (unsigned short)y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  case 18:
+  {
+    __gcc_v2si_u x = (__gcc_v2si_u)a, y = (__gcc_v2si_u)b, r;
+    int h = 2 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = (unsigned int)x[2 * i] + (unsigned int)x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = (unsigned int)y[2 * i] + (unsigned int)y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  }
+
+  __gcc_v8qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vpaddq_v */
+
+// Arm instruction(s): ADDP (per ACLE advsimd.md)
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+typedef short __gcc_v8hi_s __attribute__((__vector_size__(16)));
+typedef int __gcc_v4si_s __attribute__((__vector_size__(16)));
+typedef long long __gcc_v2di_s __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+typedef unsigned long long __gcc_v2di_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_neon_vpaddq_v(__gcc_v16qi a, __gcc_v16qi b, int type)
+{
+  switch(type)
+  {
+  case 32:
+  {
+    __gcc_v16qi_s x = (__gcc_v16qi_s)a, y = (__gcc_v16qi_s)b, r;
+    int h = 16 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = (unsigned char)x[2 * i] + (unsigned char)x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = (unsigned char)y[2 * i] + (unsigned char)y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 33:
+  {
+    __gcc_v8hi_s x = (__gcc_v8hi_s)a, y = (__gcc_v8hi_s)b, r;
+    int h = 8 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = (unsigned short)x[2 * i] + (unsigned short)x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = (unsigned short)y[2 * i] + (unsigned short)y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 34:
+  {
+    __gcc_v4si_s x = (__gcc_v4si_s)a, y = (__gcc_v4si_s)b, r;
+    int h = 4 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = (unsigned int)x[2 * i] + (unsigned int)x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = (unsigned int)y[2 * i] + (unsigned int)y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 35:
+  {
+    __gcc_v2di_s x = (__gcc_v2di_s)a, y = (__gcc_v2di_s)b, r;
+    int h = 2 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = (unsigned long long)x[2 * i] + (unsigned long long)x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] =
+        (unsigned long long)y[2 * i] + (unsigned long long)y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 48:
+  {
+    __gcc_v16qi_u x = (__gcc_v16qi_u)a, y = (__gcc_v16qi_u)b, r;
+    int h = 16 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = (unsigned char)x[2 * i] + (unsigned char)x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = (unsigned char)y[2 * i] + (unsigned char)y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 49:
+  {
+    __gcc_v8hi_u x = (__gcc_v8hi_u)a, y = (__gcc_v8hi_u)b, r;
+    int h = 8 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = (unsigned short)x[2 * i] + (unsigned short)x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = (unsigned short)y[2 * i] + (unsigned short)y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 50:
+  {
+    __gcc_v4si_u x = (__gcc_v4si_u)a, y = (__gcc_v4si_u)b, r;
+    int h = 4 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = (unsigned int)x[2 * i] + (unsigned int)x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = (unsigned int)y[2 * i] + (unsigned int)y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 51:
+  {
+    __gcc_v2di_u x = (__gcc_v2di_u)a, y = (__gcc_v2di_u)b, r;
+    int h = 2 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = (unsigned long long)x[2 * i] + (unsigned long long)x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] =
+        (unsigned long long)y[2 * i] + (unsigned long long)y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  }
+
+  __gcc_v16qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vpmax_v */
+
+// Arm instruction(s): SMAXP, UMAXP (per ACLE advsimd.md)
+
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+typedef signed char __gcc_v8qi_s __attribute__((__vector_size__(8)));
+typedef short __gcc_v4hi_s __attribute__((__vector_size__(8)));
+typedef int __gcc_v2si_s __attribute__((__vector_size__(8)));
+typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
+typedef unsigned short __gcc_v4hi_u __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+
+__gcc_v8qi __builtin_neon_vpmax_v(__gcc_v8qi a, __gcc_v8qi b, int type)
+{
+  switch(type)
+  {
+  case 0:
+  {
+    __gcc_v8qi_s x = (__gcc_v8qi_s)a, y = (__gcc_v8qi_s)b, r;
+    int h = 8 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] > x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] > y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  case 1:
+  {
+    __gcc_v4hi_s x = (__gcc_v4hi_s)a, y = (__gcc_v4hi_s)b, r;
+    int h = 4 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] > x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] > y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  case 2:
+  {
+    __gcc_v2si_s x = (__gcc_v2si_s)a, y = (__gcc_v2si_s)b, r;
+    int h = 2 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] > x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] > y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  case 16:
+  {
+    __gcc_v8qi_u x = (__gcc_v8qi_u)a, y = (__gcc_v8qi_u)b, r;
+    int h = 8 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] > x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] > y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  case 17:
+  {
+    __gcc_v4hi_u x = (__gcc_v4hi_u)a, y = (__gcc_v4hi_u)b, r;
+    int h = 4 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] > x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] > y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  case 18:
+  {
+    __gcc_v2si_u x = (__gcc_v2si_u)a, y = (__gcc_v2si_u)b, r;
+    int h = 2 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] > x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] > y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  }
+
+  __gcc_v8qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vpmaxq_v */
+
+// Arm instruction(s): SMAXP, UMAXP (per ACLE advsimd.md)
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+typedef short __gcc_v8hi_s __attribute__((__vector_size__(16)));
+typedef int __gcc_v4si_s __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_neon_vpmaxq_v(__gcc_v16qi a, __gcc_v16qi b, int type)
+{
+  switch(type)
+  {
+  case 32:
+  {
+    __gcc_v16qi_s x = (__gcc_v16qi_s)a, y = (__gcc_v16qi_s)b, r;
+    int h = 16 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] > x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] > y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 33:
+  {
+    __gcc_v8hi_s x = (__gcc_v8hi_s)a, y = (__gcc_v8hi_s)b, r;
+    int h = 8 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] > x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] > y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 34:
+  {
+    __gcc_v4si_s x = (__gcc_v4si_s)a, y = (__gcc_v4si_s)b, r;
+    int h = 4 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] > x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] > y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 48:
+  {
+    __gcc_v16qi_u x = (__gcc_v16qi_u)a, y = (__gcc_v16qi_u)b, r;
+    int h = 16 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] > x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] > y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 49:
+  {
+    __gcc_v8hi_u x = (__gcc_v8hi_u)a, y = (__gcc_v8hi_u)b, r;
+    int h = 8 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] > x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] > y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 50:
+  {
+    __gcc_v4si_u x = (__gcc_v4si_u)a, y = (__gcc_v4si_u)b, r;
+    int h = 4 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] > x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] > y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  }
+
+  __gcc_v16qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vpmin_v */
+
+// Arm instruction(s): SMINP, UMINP (per ACLE advsimd.md)
+
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+typedef signed char __gcc_v8qi_s __attribute__((__vector_size__(8)));
+typedef short __gcc_v4hi_s __attribute__((__vector_size__(8)));
+typedef int __gcc_v2si_s __attribute__((__vector_size__(8)));
+typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
+typedef unsigned short __gcc_v4hi_u __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+
+__gcc_v8qi __builtin_neon_vpmin_v(__gcc_v8qi a, __gcc_v8qi b, int type)
+{
+  switch(type)
+  {
+  case 0:
+  {
+    __gcc_v8qi_s x = (__gcc_v8qi_s)a, y = (__gcc_v8qi_s)b, r;
+    int h = 8 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] < x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] < y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  case 1:
+  {
+    __gcc_v4hi_s x = (__gcc_v4hi_s)a, y = (__gcc_v4hi_s)b, r;
+    int h = 4 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] < x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] < y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  case 2:
+  {
+    __gcc_v2si_s x = (__gcc_v2si_s)a, y = (__gcc_v2si_s)b, r;
+    int h = 2 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] < x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] < y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  case 16:
+  {
+    __gcc_v8qi_u x = (__gcc_v8qi_u)a, y = (__gcc_v8qi_u)b, r;
+    int h = 8 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] < x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] < y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  case 17:
+  {
+    __gcc_v4hi_u x = (__gcc_v4hi_u)a, y = (__gcc_v4hi_u)b, r;
+    int h = 4 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] < x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] < y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  case 18:
+  {
+    __gcc_v2si_u x = (__gcc_v2si_u)a, y = (__gcc_v2si_u)b, r;
+    int h = 2 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] < x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] < y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v8qi)r;
+  }
+  }
+
+  __gcc_v8qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vpminq_v */
+
+// Arm instruction(s): SMINP, UMINP (per ACLE advsimd.md)
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+typedef short __gcc_v8hi_s __attribute__((__vector_size__(16)));
+typedef int __gcc_v4si_s __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_neon_vpminq_v(__gcc_v16qi a, __gcc_v16qi b, int type)
+{
+  switch(type)
+  {
+  case 32:
+  {
+    __gcc_v16qi_s x = (__gcc_v16qi_s)a, y = (__gcc_v16qi_s)b, r;
+    int h = 16 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] < x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] < y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 33:
+  {
+    __gcc_v8hi_s x = (__gcc_v8hi_s)a, y = (__gcc_v8hi_s)b, r;
+    int h = 8 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] < x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] < y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 34:
+  {
+    __gcc_v4si_s x = (__gcc_v4si_s)a, y = (__gcc_v4si_s)b, r;
+    int h = 4 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] < x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] < y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 48:
+  {
+    __gcc_v16qi_u x = (__gcc_v16qi_u)a, y = (__gcc_v16qi_u)b, r;
+    int h = 16 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] < x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] < y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 49:
+  {
+    __gcc_v8hi_u x = (__gcc_v8hi_u)a, y = (__gcc_v8hi_u)b, r;
+    int h = 8 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] < x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] < y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  case 50:
+  {
+    __gcc_v4si_u x = (__gcc_v4si_u)a, y = (__gcc_v4si_u)b, r;
+    int h = 4 / 2;
+    for(int i = 0; i < h; i++)
+      r[i] = x[2 * i] < x[2 * i + 1] ? x[2 * i] : x[2 * i + 1];
+    for(int i = 0; i < h; i++)
+      r[h + i] = y[2 * i] < y[2 * i + 1] ? y[2 * i] : y[2 * i + 1];
+    return (__gcc_v16qi)r;
+  }
+  }
+
+  __gcc_v16qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vqadd_v */
+
+// Arm instruction(s): SQADD, UQADD (per ACLE advsimd.md)
+
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+typedef signed char __gcc_v8qi_s __attribute__((__vector_size__(8)));
+typedef short __gcc_v4hi_s __attribute__((__vector_size__(8)));
+typedef int __gcc_v2si_s __attribute__((__vector_size__(8)));
+typedef long long __gcc_v1di_s __attribute__((__vector_size__(8)));
+typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
+typedef unsigned short __gcc_v4hi_u __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+typedef unsigned long long __gcc_v1di_u __attribute__((__vector_size__(8)));
+
+__gcc_v8qi __builtin_neon_vqadd_v(__gcc_v8qi a, __gcc_v8qi b, int type)
+{
+  switch(type)
+  {
+  case 0:
+  {
+    __gcc_v8qi_s x = (__gcc_v8qi_s)a, y = (__gcc_v8qi_s)b, r;
+    for(int i = 0; i < 8; i++)
+    {
+      int s = (int)x[i] + (int)y[i];
+      r[i] = s < -128 ? -128 : (s > 127 ? 127 : s);
+    }
+    return (__gcc_v8qi)r;
+  }
+  case 1:
+  {
+    __gcc_v4hi_s x = (__gcc_v4hi_s)a, y = (__gcc_v4hi_s)b, r;
+    for(int i = 0; i < 4; i++)
+    {
+      int s = (int)x[i] + (int)y[i];
+      r[i] = s < -32768 ? -32768 : (s > 32767 ? 32767 : s);
+    }
+    return (__gcc_v8qi)r;
+  }
+  case 2:
+  {
+    __gcc_v2si_s x = (__gcc_v2si_s)a, y = (__gcc_v2si_s)b, r;
+    for(int i = 0; i < 2; i++)
+    {
+      long long s = (long long)x[i] + (long long)y[i];
+      r[i] = s < -2147483648 ? -2147483648 : (s > 2147483647 ? 2147483647 : s);
+    }
+    return (__gcc_v8qi)r;
+  }
+  case 3:
+  {
+    __gcc_v1di_s x = (__gcc_v1di_s)a, y = (__gcc_v1di_s)b, r;
+    for(int i = 0; i < 1; i++)
+    {
+      long long s =
+        (long long)((unsigned long long)x[i] + (unsigned long long)y[i]);
+      r[i] =
+        ((x[i] ^ s) & (y[i] ^ s)) < 0
+          ? (x[i] < 0 ? (-9223372036854775807LL - 1) : 9223372036854775807LL)
+          : s;
+    }
+    return (__gcc_v8qi)r;
+  }
+  case 16:
+  {
+    __gcc_v8qi_u x = (__gcc_v8qi_u)a, y = (__gcc_v8qi_u)b, r;
+    for(int i = 0; i < 8; i++)
+    {
+      int s = (int)x[i] + (int)y[i];
+      r[i] = s > 255 ? 255 : s;
+    }
+    return (__gcc_v8qi)r;
+  }
+  case 17:
+  {
+    __gcc_v4hi_u x = (__gcc_v4hi_u)a, y = (__gcc_v4hi_u)b, r;
+    for(int i = 0; i < 4; i++)
+    {
+      int s = (int)x[i] + (int)y[i];
+      r[i] = s > 65535 ? 65535 : s;
+    }
+    return (__gcc_v8qi)r;
+  }
+  case 18:
+  {
+    __gcc_v2si_u x = (__gcc_v2si_u)a, y = (__gcc_v2si_u)b, r;
+    for(int i = 0; i < 2; i++)
+    {
+      long long s = (long long)x[i] + (long long)y[i];
+      r[i] = s > 4294967295 ? 4294967295 : s;
+    }
+    return (__gcc_v8qi)r;
+  }
+  case 19:
+  {
+    __gcc_v1di_u x = (__gcc_v1di_u)a, y = (__gcc_v1di_u)b, r;
+    for(int i = 0; i < 1; i++)
+    {
+      unsigned long long s = x[i] + y[i];
+      r[i] = s < x[i] ? 18446744073709551615ULL : s;
+    }
+    return (__gcc_v8qi)r;
+  }
+  }
+
+  __gcc_v8qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vqaddq_v */
+
+// Arm instruction(s): SQADD, UQADD (per ACLE advsimd.md)
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+typedef short __gcc_v8hi_s __attribute__((__vector_size__(16)));
+typedef int __gcc_v4si_s __attribute__((__vector_size__(16)));
+typedef long long __gcc_v2di_s __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+typedef unsigned long long __gcc_v2di_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_neon_vqaddq_v(__gcc_v16qi a, __gcc_v16qi b, int type)
+{
+  switch(type)
+  {
+  case 32:
+  {
+    __gcc_v16qi_s x = (__gcc_v16qi_s)a, y = (__gcc_v16qi_s)b, r;
+    for(int i = 0; i < 16; i++)
+    {
+      int s = (int)x[i] + (int)y[i];
+      r[i] = s < -128 ? -128 : (s > 127 ? 127 : s);
+    }
+    return (__gcc_v16qi)r;
+  }
+  case 33:
+  {
+    __gcc_v8hi_s x = (__gcc_v8hi_s)a, y = (__gcc_v8hi_s)b, r;
+    for(int i = 0; i < 8; i++)
+    {
+      int s = (int)x[i] + (int)y[i];
+      r[i] = s < -32768 ? -32768 : (s > 32767 ? 32767 : s);
+    }
+    return (__gcc_v16qi)r;
+  }
+  case 34:
+  {
+    __gcc_v4si_s x = (__gcc_v4si_s)a, y = (__gcc_v4si_s)b, r;
+    for(int i = 0; i < 4; i++)
+    {
+      long long s = (long long)x[i] + (long long)y[i];
+      r[i] = s < -2147483648 ? -2147483648 : (s > 2147483647 ? 2147483647 : s);
+    }
+    return (__gcc_v16qi)r;
+  }
+  case 35:
+  {
+    __gcc_v2di_s x = (__gcc_v2di_s)a, y = (__gcc_v2di_s)b, r;
+    for(int i = 0; i < 2; i++)
+    {
+      long long s =
+        (long long)((unsigned long long)x[i] + (unsigned long long)y[i]);
+      r[i] =
+        ((x[i] ^ s) & (y[i] ^ s)) < 0
+          ? (x[i] < 0 ? (-9223372036854775807LL - 1) : 9223372036854775807LL)
+          : s;
+    }
+    return (__gcc_v16qi)r;
+  }
+  case 48:
+  {
+    __gcc_v16qi_u x = (__gcc_v16qi_u)a, y = (__gcc_v16qi_u)b, r;
+    for(int i = 0; i < 16; i++)
+    {
+      int s = (int)x[i] + (int)y[i];
+      r[i] = s > 255 ? 255 : s;
+    }
+    return (__gcc_v16qi)r;
+  }
+  case 49:
+  {
+    __gcc_v8hi_u x = (__gcc_v8hi_u)a, y = (__gcc_v8hi_u)b, r;
+    for(int i = 0; i < 8; i++)
+    {
+      int s = (int)x[i] + (int)y[i];
+      r[i] = s > 65535 ? 65535 : s;
+    }
+    return (__gcc_v16qi)r;
+  }
+  case 50:
+  {
+    __gcc_v4si_u x = (__gcc_v4si_u)a, y = (__gcc_v4si_u)b, r;
+    for(int i = 0; i < 4; i++)
+    {
+      long long s = (long long)x[i] + (long long)y[i];
+      r[i] = s > 4294967295 ? 4294967295 : s;
+    }
+    return (__gcc_v16qi)r;
+  }
+  case 51:
+  {
+    __gcc_v2di_u x = (__gcc_v2di_u)a, y = (__gcc_v2di_u)b, r;
+    for(int i = 0; i < 2; i++)
+    {
+      unsigned long long s = x[i] + y[i];
+      r[i] = s < x[i] ? 18446744073709551615ULL : s;
+    }
+    return (__gcc_v16qi)r;
+  }
+  }
+
+  __gcc_v16qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vqsub_v */
+
+// Arm instruction(s): SQSUB, UQSUB (per ACLE advsimd.md)
+
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+typedef signed char __gcc_v8qi_s __attribute__((__vector_size__(8)));
+typedef short __gcc_v4hi_s __attribute__((__vector_size__(8)));
+typedef int __gcc_v2si_s __attribute__((__vector_size__(8)));
+typedef long long __gcc_v1di_s __attribute__((__vector_size__(8)));
+typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
+typedef unsigned short __gcc_v4hi_u __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+typedef unsigned long long __gcc_v1di_u __attribute__((__vector_size__(8)));
+
+__gcc_v8qi __builtin_neon_vqsub_v(__gcc_v8qi a, __gcc_v8qi b, int type)
+{
+  switch(type)
+  {
+  case 0:
+  {
+    __gcc_v8qi_s x = (__gcc_v8qi_s)a, y = (__gcc_v8qi_s)b, r;
+    for(int i = 0; i < 8; i++)
+    {
+      int s = (int)x[i] - (int)y[i];
+      r[i] = s < -128 ? -128 : (s > 127 ? 127 : s);
+    }
+    return (__gcc_v8qi)r;
+  }
+  case 1:
+  {
+    __gcc_v4hi_s x = (__gcc_v4hi_s)a, y = (__gcc_v4hi_s)b, r;
+    for(int i = 0; i < 4; i++)
+    {
+      int s = (int)x[i] - (int)y[i];
+      r[i] = s < -32768 ? -32768 : (s > 32767 ? 32767 : s);
+    }
+    return (__gcc_v8qi)r;
+  }
+  case 2:
+  {
+    __gcc_v2si_s x = (__gcc_v2si_s)a, y = (__gcc_v2si_s)b, r;
+    for(int i = 0; i < 2; i++)
+    {
+      long long s = (long long)x[i] - (long long)y[i];
+      r[i] = s < -2147483648 ? -2147483648 : (s > 2147483647 ? 2147483647 : s);
+    }
+    return (__gcc_v8qi)r;
+  }
+  case 3:
+  {
+    __gcc_v1di_s x = (__gcc_v1di_s)a, y = (__gcc_v1di_s)b, r;
+    for(int i = 0; i < 1; i++)
+    {
+      long long d =
+        (long long)((unsigned long long)x[i] - (unsigned long long)y[i]);
+      r[i] =
+        ((x[i] ^ y[i]) & (x[i] ^ d)) < 0
+          ? (x[i] < 0 ? (-9223372036854775807LL - 1) : 9223372036854775807LL)
+          : d;
+    }
+    return (__gcc_v8qi)r;
+  }
+  case 16:
+  {
+    __gcc_v8qi_u x = (__gcc_v8qi_u)a, y = (__gcc_v8qi_u)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = x[i] > y[i] ? x[i] - y[i] : 0;
+    return (__gcc_v8qi)r;
+  }
+  case 17:
+  {
+    __gcc_v4hi_u x = (__gcc_v4hi_u)a, y = (__gcc_v4hi_u)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = x[i] > y[i] ? x[i] - y[i] : 0;
+    return (__gcc_v8qi)r;
+  }
+  case 18:
+  {
+    __gcc_v2si_u x = (__gcc_v2si_u)a, y = (__gcc_v2si_u)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = x[i] > y[i] ? x[i] - y[i] : 0;
+    return (__gcc_v8qi)r;
+  }
+  case 19:
+  {
+    __gcc_v1di_u x = (__gcc_v1di_u)a, y = (__gcc_v1di_u)b, r;
+    for(int i = 0; i < 1; i++)
+      r[i] = x[i] > y[i] ? x[i] - y[i] : 0;
+    return (__gcc_v8qi)r;
+  }
+  }
+
+  __gcc_v8qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vqsubq_v */
+
+// Arm instruction(s): SQSUB, UQSUB (per ACLE advsimd.md)
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+typedef short __gcc_v8hi_s __attribute__((__vector_size__(16)));
+typedef int __gcc_v4si_s __attribute__((__vector_size__(16)));
+typedef long long __gcc_v2di_s __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+typedef unsigned long long __gcc_v2di_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_neon_vqsubq_v(__gcc_v16qi a, __gcc_v16qi b, int type)
+{
+  switch(type)
+  {
+  case 32:
+  {
+    __gcc_v16qi_s x = (__gcc_v16qi_s)a, y = (__gcc_v16qi_s)b, r;
+    for(int i = 0; i < 16; i++)
+    {
+      int s = (int)x[i] - (int)y[i];
+      r[i] = s < -128 ? -128 : (s > 127 ? 127 : s);
+    }
+    return (__gcc_v16qi)r;
+  }
+  case 33:
+  {
+    __gcc_v8hi_s x = (__gcc_v8hi_s)a, y = (__gcc_v8hi_s)b, r;
+    for(int i = 0; i < 8; i++)
+    {
+      int s = (int)x[i] - (int)y[i];
+      r[i] = s < -32768 ? -32768 : (s > 32767 ? 32767 : s);
+    }
+    return (__gcc_v16qi)r;
+  }
+  case 34:
+  {
+    __gcc_v4si_s x = (__gcc_v4si_s)a, y = (__gcc_v4si_s)b, r;
+    for(int i = 0; i < 4; i++)
+    {
+      long long s = (long long)x[i] - (long long)y[i];
+      r[i] = s < -2147483648 ? -2147483648 : (s > 2147483647 ? 2147483647 : s);
+    }
+    return (__gcc_v16qi)r;
+  }
+  case 35:
+  {
+    __gcc_v2di_s x = (__gcc_v2di_s)a, y = (__gcc_v2di_s)b, r;
+    for(int i = 0; i < 2; i++)
+    {
+      long long d =
+        (long long)((unsigned long long)x[i] - (unsigned long long)y[i]);
+      r[i] =
+        ((x[i] ^ y[i]) & (x[i] ^ d)) < 0
+          ? (x[i] < 0 ? (-9223372036854775807LL - 1) : 9223372036854775807LL)
+          : d;
+    }
+    return (__gcc_v16qi)r;
+  }
+  case 48:
+  {
+    __gcc_v16qi_u x = (__gcc_v16qi_u)a, y = (__gcc_v16qi_u)b, r;
+    for(int i = 0; i < 16; i++)
+      r[i] = x[i] > y[i] ? x[i] - y[i] : 0;
+    return (__gcc_v16qi)r;
+  }
+  case 49:
+  {
+    __gcc_v8hi_u x = (__gcc_v8hi_u)a, y = (__gcc_v8hi_u)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = x[i] > y[i] ? x[i] - y[i] : 0;
+    return (__gcc_v16qi)r;
+  }
+  case 50:
+  {
+    __gcc_v4si_u x = (__gcc_v4si_u)a, y = (__gcc_v4si_u)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = x[i] > y[i] ? x[i] - y[i] : 0;
+    return (__gcc_v16qi)r;
+  }
+  case 51:
+  {
+    __gcc_v2di_u x = (__gcc_v2di_u)a, y = (__gcc_v2di_u)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = x[i] > y[i] ? x[i] - y[i] : 0;
+    return (__gcc_v16qi)r;
+  }
+  }
+
+  __gcc_v16qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vrhadd_v */
+
+// Arm instruction(s): SRHADD, URHADD (per ACLE advsimd.md)
+
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+typedef signed char __gcc_v8qi_s __attribute__((__vector_size__(8)));
+typedef short __gcc_v4hi_s __attribute__((__vector_size__(8)));
+typedef int __gcc_v2si_s __attribute__((__vector_size__(8)));
+typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
+typedef unsigned short __gcc_v4hi_u __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+
+__gcc_v8qi __builtin_neon_vrhadd_v(__gcc_v8qi a, __gcc_v8qi b, int type)
+{
+  switch(type)
+  {
+  case 0:
+  {
+    __gcc_v8qi_s x = (__gcc_v8qi_s)a, y = (__gcc_v8qi_s)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = ((int)x[i] + (int)y[i] + 1) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  case 1:
+  {
+    __gcc_v4hi_s x = (__gcc_v4hi_s)a, y = (__gcc_v4hi_s)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = ((int)x[i] + (int)y[i] + 1) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  case 2:
+  {
+    __gcc_v2si_s x = (__gcc_v2si_s)a, y = (__gcc_v2si_s)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = ((long long)x[i] + (long long)y[i] + 1) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  case 16:
+  {
+    __gcc_v8qi_u x = (__gcc_v8qi_u)a, y = (__gcc_v8qi_u)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = ((int)x[i] + (int)y[i] + 1) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  case 17:
+  {
+    __gcc_v4hi_u x = (__gcc_v4hi_u)a, y = (__gcc_v4hi_u)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = ((int)x[i] + (int)y[i] + 1) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  case 18:
+  {
+    __gcc_v2si_u x = (__gcc_v2si_u)a, y = (__gcc_v2si_u)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = ((long long)x[i] + (long long)y[i] + 1) >> 1;
+    return (__gcc_v8qi)r;
+  }
+  }
+
+  __gcc_v8qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vrhaddq_v */
+
+// Arm instruction(s): SRHADD, URHADD (per ACLE advsimd.md)
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+typedef short __gcc_v8hi_s __attribute__((__vector_size__(16)));
+typedef int __gcc_v4si_s __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_neon_vrhaddq_v(__gcc_v16qi a, __gcc_v16qi b, int type)
+{
+  switch(type)
+  {
+  case 32:
+  {
+    __gcc_v16qi_s x = (__gcc_v16qi_s)a, y = (__gcc_v16qi_s)b, r;
+    for(int i = 0; i < 16; i++)
+      r[i] = ((int)x[i] + (int)y[i] + 1) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  case 33:
+  {
+    __gcc_v8hi_s x = (__gcc_v8hi_s)a, y = (__gcc_v8hi_s)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = ((int)x[i] + (int)y[i] + 1) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  case 34:
+  {
+    __gcc_v4si_s x = (__gcc_v4si_s)a, y = (__gcc_v4si_s)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = ((long long)x[i] + (long long)y[i] + 1) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  case 48:
+  {
+    __gcc_v16qi_u x = (__gcc_v16qi_u)a, y = (__gcc_v16qi_u)b, r;
+    for(int i = 0; i < 16; i++)
+      r[i] = ((int)x[i] + (int)y[i] + 1) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  case 49:
+  {
+    __gcc_v8hi_u x = (__gcc_v8hi_u)a, y = (__gcc_v8hi_u)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = ((int)x[i] + (int)y[i] + 1) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  case 50:
+  {
+    __gcc_v4si_u x = (__gcc_v4si_u)a, y = (__gcc_v4si_u)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = ((long long)x[i] + (long long)y[i] + 1) >> 1;
+    return (__gcc_v16qi)r;
+  }
+  }
+
+  __gcc_v16qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vtst_v */
+
+// Arm instruction(s): CMTST (per ACLE advsimd.md)
+
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+typedef signed char __gcc_v8qi_s __attribute__((__vector_size__(8)));
+typedef short __gcc_v4hi_s __attribute__((__vector_size__(8)));
+typedef int __gcc_v2si_s __attribute__((__vector_size__(8)));
+typedef long long __gcc_v1di_s __attribute__((__vector_size__(8)));
+typedef unsigned char __gcc_v8qi_u __attribute__((__vector_size__(8)));
+typedef unsigned short __gcc_v4hi_u __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+typedef unsigned long long __gcc_v1di_u __attribute__((__vector_size__(8)));
+
+__gcc_v8qi __builtin_neon_vtst_v(__gcc_v8qi a, __gcc_v8qi b, int type)
+{
+  switch(type)
+  {
+  case 0:
+  {
+    __gcc_v8qi_s x = (__gcc_v8qi_s)a, y = (__gcc_v8qi_s)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = (x[i] & y[i]) != 0 ? -1 : 0;
+    return (__gcc_v8qi)r;
+  }
+  case 1:
+  {
+    __gcc_v4hi_s x = (__gcc_v4hi_s)a, y = (__gcc_v4hi_s)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = (x[i] & y[i]) != 0 ? -1 : 0;
+    return (__gcc_v8qi)r;
+  }
+  case 2:
+  {
+    __gcc_v2si_s x = (__gcc_v2si_s)a, y = (__gcc_v2si_s)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = (x[i] & y[i]) != 0 ? -1 : 0;
+    return (__gcc_v8qi)r;
+  }
+  case 3:
+  {
+    __gcc_v1di_s x = (__gcc_v1di_s)a, y = (__gcc_v1di_s)b, r;
+    for(int i = 0; i < 1; i++)
+      r[i] = (x[i] & y[i]) != 0 ? -1 : 0;
+    return (__gcc_v8qi)r;
+  }
+  case 16:
+  {
+    __gcc_v8qi_u x = (__gcc_v8qi_u)a, y = (__gcc_v8qi_u)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = (x[i] & y[i]) != 0 ? -1 : 0;
+    return (__gcc_v8qi)r;
+  }
+  case 17:
+  {
+    __gcc_v4hi_u x = (__gcc_v4hi_u)a, y = (__gcc_v4hi_u)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = (x[i] & y[i]) != 0 ? -1 : 0;
+    return (__gcc_v8qi)r;
+  }
+  case 18:
+  {
+    __gcc_v2si_u x = (__gcc_v2si_u)a, y = (__gcc_v2si_u)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = (x[i] & y[i]) != 0 ? -1 : 0;
+    return (__gcc_v8qi)r;
+  }
+  case 19:
+  {
+    __gcc_v1di_u x = (__gcc_v1di_u)a, y = (__gcc_v1di_u)b, r;
+    for(int i = 0; i < 1; i++)
+      r[i] = (x[i] & y[i]) != 0 ? -1 : 0;
+    return (__gcc_v8qi)r;
+  }
+  }
+
+  __gcc_v8qi r = {0};
+  return r;
+}
+
+/* FUNCTION: __builtin_neon_vtstq_v */
+
+// Arm instruction(s): CMTST (per ACLE advsimd.md)
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+typedef short __gcc_v8hi_s __attribute__((__vector_size__(16)));
+typedef int __gcc_v4si_s __attribute__((__vector_size__(16)));
+typedef long long __gcc_v2di_s __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+typedef unsigned long long __gcc_v2di_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_neon_vtstq_v(__gcc_v16qi a, __gcc_v16qi b, int type)
+{
+  switch(type)
+  {
+  case 32:
+  {
+    __gcc_v16qi_s x = (__gcc_v16qi_s)a, y = (__gcc_v16qi_s)b, r;
+    for(int i = 0; i < 16; i++)
+      r[i] = (x[i] & y[i]) != 0 ? -1 : 0;
+    return (__gcc_v16qi)r;
+  }
+  case 33:
+  {
+    __gcc_v8hi_s x = (__gcc_v8hi_s)a, y = (__gcc_v8hi_s)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = (x[i] & y[i]) != 0 ? -1 : 0;
+    return (__gcc_v16qi)r;
+  }
+  case 34:
+  {
+    __gcc_v4si_s x = (__gcc_v4si_s)a, y = (__gcc_v4si_s)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = (x[i] & y[i]) != 0 ? -1 : 0;
+    return (__gcc_v16qi)r;
+  }
+  case 35:
+  {
+    __gcc_v2di_s x = (__gcc_v2di_s)a, y = (__gcc_v2di_s)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = (x[i] & y[i]) != 0 ? -1 : 0;
+    return (__gcc_v16qi)r;
+  }
+  case 48:
+  {
+    __gcc_v16qi_u x = (__gcc_v16qi_u)a, y = (__gcc_v16qi_u)b, r;
+    for(int i = 0; i < 16; i++)
+      r[i] = (x[i] & y[i]) != 0 ? -1 : 0;
+    return (__gcc_v16qi)r;
+  }
+  case 49:
+  {
+    __gcc_v8hi_u x = (__gcc_v8hi_u)a, y = (__gcc_v8hi_u)b, r;
+    for(int i = 0; i < 8; i++)
+      r[i] = (x[i] & y[i]) != 0 ? -1 : 0;
+    return (__gcc_v16qi)r;
+  }
+  case 50:
+  {
+    __gcc_v4si_u x = (__gcc_v4si_u)a, y = (__gcc_v4si_u)b, r;
+    for(int i = 0; i < 4; i++)
+      r[i] = (x[i] & y[i]) != 0 ? -1 : 0;
+    return (__gcc_v16qi)r;
+  }
+  case 51:
+  {
+    __gcc_v2di_u x = (__gcc_v2di_u)a, y = (__gcc_v2di_u)b, r;
+    for(int i = 0; i < 2; i++)
+      r[i] = (x[i] & y[i]) != 0 ? -1 : 0;
+    return (__gcc_v16qi)r;
+  }
+  }
+
+  __gcc_v16qi r = {0};
+  return r;
+}
diff --git a/src/ansi-c/library/x86_intrinsics.c b/src/ansi-c/library/x86_intrinsics.c
new file mode 100644
index 00000000000..bbffaa598b5
--- /dev/null
+++ b/src/ansi-c/library/x86_intrinsics.c
@@ -0,0 +1,3372 @@
+// x86 SIMD intrinsic models for CBMC
+// Generated by scripts/generate_intrinsic_models.py
+// Models: 204
+
+/* FUNCTION: __builtin_ia32_pabsb128 */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_pabsb128(__gcc_v16qi a)
+{
+  __gcc_v16qi_s a_ = (__gcc_v16qi_s)a;
+  __gcc_v16qi_s dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] < 0 ? -a_[j] : a_[j];
+  return (__gcc_v16qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pabsb256 */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef signed char __gcc_v32qi_s __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_pabsb256(__gcc_v32qi a)
+{
+  __gcc_v32qi_s a_ = (__gcc_v32qi_s)a;
+  __gcc_v32qi_s dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = a_[j] < 0 ? -a_[j] : a_[j];
+  return (__gcc_v32qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pabsd128 */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_pabsd128(__gcc_v4si a)
+{
+  __gcc_v4si a_ = a;
+  __gcc_v4si dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = a_[j] < 0 ? (int)(0u - (unsigned)a_[j]) : a_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pabsd256 */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_pabsd256(__gcc_v8si a)
+{
+  __gcc_v8si a_ = a;
+  __gcc_v8si dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] < 0 ? (int)(0u - (unsigned)a_[j]) : a_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pabsw128 */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_pabsw128(__gcc_v8hi a)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] < 0 ? -a_[j] : a_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pabsw256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_pabsw256(__gcc_v16hi a)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] < 0 ? -a_[j] : a_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddb */
+
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+
+__gcc_v8qi __builtin_ia32_paddb(__gcc_v8qi a, __gcc_v8qi b)
+{
+  __gcc_v8qi a_ = a;
+  __gcc_v8qi b_ = b;
+  __gcc_v8qi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] + b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddb128 */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_paddb128(__gcc_v16qi a, __gcc_v16qi b)
+{
+  __gcc_v16qi a_ = a;
+  __gcc_v16qi b_ = b;
+  __gcc_v16qi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] + b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddb128_mask */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_paddb128_mask(
+  __gcc_v16qi a,
+  __gcc_v16qi b,
+  __gcc_v16qi src,
+  unsigned short k)
+{
+  __gcc_v16qi a_ = a;
+  __gcc_v16qi b_ = b;
+  __gcc_v16qi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddb256 */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_paddb256(__gcc_v32qi a, __gcc_v32qi b)
+{
+  __gcc_v32qi a_ = a;
+  __gcc_v32qi b_ = b;
+  __gcc_v32qi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = a_[j] + b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddb256_mask */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_paddb256_mask(
+  __gcc_v32qi a,
+  __gcc_v32qi b,
+  __gcc_v32qi src,
+  unsigned int k)
+{
+  __gcc_v32qi a_ = a;
+  __gcc_v32qi b_ = b;
+  __gcc_v32qi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddb512_mask */
+
+typedef char __gcc_v64qi __attribute__((__vector_size__(64)));
+
+__gcc_v64qi __builtin_ia32_paddb512_mask(
+  __gcc_v64qi a,
+  __gcc_v64qi b,
+  __gcc_v64qi src,
+  unsigned long long k)
+{
+  __gcc_v64qi a_ = a;
+  __gcc_v64qi b_ = b;
+  __gcc_v64qi dst;
+  for(int j = 0; j < 64; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddd */
+
+typedef int __gcc_v2si __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+
+__gcc_v2si __builtin_ia32_paddd(__gcc_v2si a, __gcc_v2si b)
+{
+  __gcc_v2si_u a_ = (__gcc_v2si_u)a;
+  __gcc_v2si_u b_ = (__gcc_v2si_u)b;
+  __gcc_v2si_u dst;
+  for(int j = 0; j < 2; j++)
+    dst[j] = a_[j] + b_[j];
+  return (__gcc_v2si)dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddd128 */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_paddd128(__gcc_v4si a, __gcc_v4si b)
+{
+  __gcc_v4si_u a_ = (__gcc_v4si_u)a;
+  __gcc_v4si_u b_ = (__gcc_v4si_u)b;
+  __gcc_v4si_u dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = a_[j] + b_[j];
+  return (__gcc_v4si)dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddd128_mask */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_paddd128_mask(
+  __gcc_v4si a,
+  __gcc_v4si b,
+  __gcc_v4si src,
+  unsigned char k)
+{
+  __gcc_v4si_u a_ = (__gcc_v4si_u)a;
+  __gcc_v4si_u b_ = (__gcc_v4si_u)b;
+  __gcc_v4si dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddd256 */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+typedef unsigned int __gcc_v8si_u __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_paddd256(__gcc_v8si a, __gcc_v8si b)
+{
+  __gcc_v8si_u a_ = (__gcc_v8si_u)a;
+  __gcc_v8si_u b_ = (__gcc_v8si_u)b;
+  __gcc_v8si_u dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] + b_[j];
+  return (__gcc_v8si)dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddd256_mask */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+typedef unsigned int __gcc_v8si_u __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_paddd256_mask(
+  __gcc_v8si a,
+  __gcc_v8si b,
+  __gcc_v8si src,
+  unsigned char k)
+{
+  __gcc_v8si_u a_ = (__gcc_v8si_u)a;
+  __gcc_v8si_u b_ = (__gcc_v8si_u)b;
+  __gcc_v8si dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddd512_mask */
+
+typedef int __gcc_v16si __attribute__((__vector_size__(64)));
+typedef unsigned int __gcc_v16si_u __attribute__((__vector_size__(64)));
+
+__gcc_v16si __builtin_ia32_paddd512_mask(
+  __gcc_v16si a,
+  __gcc_v16si b,
+  __gcc_v16si src,
+  unsigned short k)
+{
+  __gcc_v16si_u a_ = (__gcc_v16si_u)a;
+  __gcc_v16si_u b_ = (__gcc_v16si_u)b;
+  __gcc_v16si dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddq128 */
+
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+typedef unsigned long long __gcc_v2di_u __attribute__((__vector_size__(16)));
+
+__gcc_v2di __builtin_ia32_paddq128(__gcc_v2di a, __gcc_v2di b)
+{
+  __gcc_v2di_u a_ = (__gcc_v2di_u)a;
+  __gcc_v2di_u b_ = (__gcc_v2di_u)b;
+  __gcc_v2di_u dst;
+  for(int j = 0; j < 2; j++)
+    dst[j] = a_[j] + b_[j];
+  return (__gcc_v2di)dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddq128_mask */
+
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+typedef unsigned long long __gcc_v2di_u __attribute__((__vector_size__(16)));
+
+__gcc_v2di __builtin_ia32_paddq128_mask(
+  __gcc_v2di a,
+  __gcc_v2di b,
+  __gcc_v2di src,
+  unsigned char k)
+{
+  __gcc_v2di_u a_ = (__gcc_v2di_u)a;
+  __gcc_v2di_u b_ = (__gcc_v2di_u)b;
+  __gcc_v2di dst;
+  for(int j = 0; j < 2; j++)
+    dst[j] = (k >> j) & 1 ? (long long)(a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddq256 */
+
+typedef long long __gcc_v4di __attribute__((__vector_size__(32)));
+typedef unsigned long long __gcc_v4di_u __attribute__((__vector_size__(32)));
+
+__gcc_v4di __builtin_ia32_paddq256(__gcc_v4di a, __gcc_v4di b)
+{
+  __gcc_v4di_u a_ = (__gcc_v4di_u)a;
+  __gcc_v4di_u b_ = (__gcc_v4di_u)b;
+  __gcc_v4di_u dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = a_[j] + b_[j];
+  return (__gcc_v4di)dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddq256_mask */
+
+typedef long long __gcc_v4di __attribute__((__vector_size__(32)));
+typedef unsigned long long __gcc_v4di_u __attribute__((__vector_size__(32)));
+
+__gcc_v4di __builtin_ia32_paddq256_mask(
+  __gcc_v4di a,
+  __gcc_v4di b,
+  __gcc_v4di src,
+  unsigned char k)
+{
+  __gcc_v4di_u a_ = (__gcc_v4di_u)a;
+  __gcc_v4di_u b_ = (__gcc_v4di_u)b;
+  __gcc_v4di dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = (k >> j) & 1 ? (long long)(a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddq512_mask */
+
+typedef long long __gcc_v8di __attribute__((__vector_size__(64)));
+typedef unsigned long long __gcc_v8di_u __attribute__((__vector_size__(64)));
+
+__gcc_v8di __builtin_ia32_paddq512_mask(
+  __gcc_v8di a,
+  __gcc_v8di b,
+  __gcc_v8di src,
+  unsigned char k)
+{
+  __gcc_v8di_u a_ = (__gcc_v8di_u)a;
+  __gcc_v8di_u b_ = (__gcc_v8di_u)b;
+  __gcc_v8di dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (long long)(a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddsb128 */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_paddsb128(__gcc_v16qi a, __gcc_v16qi b)
+{
+  __gcc_v16qi_s a_ = (__gcc_v16qi_s)a;
+  __gcc_v16qi_s b_ = (__gcc_v16qi_s)b;
+  __gcc_v16qi_s dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (a_[j] + b_[j]) < -128  ? -128
+             : (a_[j] + b_[j]) > 127 ? 127
+                                     : a_[j] + b_[j];
+  return (__gcc_v16qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddsb128_mask */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_paddsb128_mask(
+  __gcc_v16qi a,
+  __gcc_v16qi b,
+  __gcc_v16qi src,
+  unsigned short k)
+{
+  __gcc_v16qi_s a_ = (__gcc_v16qi_s)a;
+  __gcc_v16qi_s b_ = (__gcc_v16qi_s)b;
+  __gcc_v16qi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (char)((a_[j] + b_[j]) < -128 ? -128 : (a_[j] + b_[j]) > 127 ? 127 : a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddsb256 */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef signed char __gcc_v32qi_s __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_paddsb256(__gcc_v32qi a, __gcc_v32qi b)
+{
+  __gcc_v32qi_s a_ = (__gcc_v32qi_s)a;
+  __gcc_v32qi_s b_ = (__gcc_v32qi_s)b;
+  __gcc_v32qi_s dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (a_[j] + b_[j]) < -128  ? -128
+             : (a_[j] + b_[j]) > 127 ? 127
+                                     : a_[j] + b_[j];
+  return (__gcc_v32qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddsb256_mask */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef signed char __gcc_v32qi_s __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_paddsb256_mask(
+  __gcc_v32qi a,
+  __gcc_v32qi b,
+  __gcc_v32qi src,
+  unsigned int k)
+{
+  __gcc_v32qi_s a_ = (__gcc_v32qi_s)a;
+  __gcc_v32qi_s b_ = (__gcc_v32qi_s)b;
+  __gcc_v32qi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (char)((a_[j] + b_[j]) < -128 ? -128 : (a_[j] + b_[j]) > 127 ? 127 : a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddsb512_mask */
+
+typedef char __gcc_v64qi __attribute__((__vector_size__(64)));
+typedef signed char __gcc_v64qi_s __attribute__((__vector_size__(64)));
+
+__gcc_v64qi __builtin_ia32_paddsb512_mask(
+  __gcc_v64qi a,
+  __gcc_v64qi b,
+  __gcc_v64qi src,
+  unsigned long long k)
+{
+  __gcc_v64qi_s a_ = (__gcc_v64qi_s)a;
+  __gcc_v64qi_s b_ = (__gcc_v64qi_s)b;
+  __gcc_v64qi dst;
+  for(int j = 0; j < 64; j++)
+    dst[j] = (k >> j) & 1 ? (char)((a_[j] + b_[j]) < -128 ? -128 : (a_[j] + b_[j]) > 127 ? 127 : a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddsw128 */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_paddsw128(__gcc_v8hi a, __gcc_v8hi b)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi b_ = b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (a_[j] + b_[j]) < -32768  ? -32768
+             : (a_[j] + b_[j]) > 32767 ? 32767
+                                       : a_[j] + b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddsw128_mask */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_paddsw128_mask(
+  __gcc_v8hi a,
+  __gcc_v8hi b,
+  __gcc_v8hi src,
+  unsigned char k)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi b_ = b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (short)((a_[j] + b_[j]) < -32768 ? -32768 : (a_[j] + b_[j]) > 32767 ? 32767 : a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddsw256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_paddsw256(__gcc_v16hi a, __gcc_v16hi b)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi b_ = b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (a_[j] + b_[j]) < -32768  ? -32768
+             : (a_[j] + b_[j]) > 32767 ? 32767
+                                       : a_[j] + b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddsw256_mask */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_paddsw256_mask(
+  __gcc_v16hi a,
+  __gcc_v16hi b,
+  __gcc_v16hi src,
+  unsigned short k)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi b_ = b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (short)((a_[j] + b_[j]) < -32768 ? -32768 : (a_[j] + b_[j]) > 32767 ? 32767 : a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddsw512_mask */
+
+typedef short __gcc_v32hi __attribute__((__vector_size__(64)));
+
+__gcc_v32hi __builtin_ia32_paddsw512_mask(
+  __gcc_v32hi a,
+  __gcc_v32hi b,
+  __gcc_v32hi src,
+  unsigned int k)
+{
+  __gcc_v32hi a_ = a;
+  __gcc_v32hi b_ = b;
+  __gcc_v32hi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (short)((a_[j] + b_[j]) < -32768 ? -32768 : (a_[j] + b_[j]) > 32767 ? 32767 : a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddusb128 */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_paddusb128(__gcc_v16qi a, __gcc_v16qi b)
+{
+  __gcc_v16qi_u a_ = (__gcc_v16qi_u)a;
+  __gcc_v16qi_u b_ = (__gcc_v16qi_u)b;
+  __gcc_v16qi_u dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (a_[j] + b_[j]) > 255 ? 255 : a_[j] + b_[j];
+  return (__gcc_v16qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddusb128_mask */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_paddusb128_mask(
+  __gcc_v16qi a,
+  __gcc_v16qi b,
+  __gcc_v16qi src,
+  unsigned short k)
+{
+  __gcc_v16qi_u a_ = (__gcc_v16qi_u)a;
+  __gcc_v16qi_u b_ = (__gcc_v16qi_u)b;
+  __gcc_v16qi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (char)((a_[j] + b_[j]) > 255 ? 255 : a_[j] + b_[j])
+                          : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddusb256 */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef unsigned char __gcc_v32qi_u __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_paddusb256(__gcc_v32qi a, __gcc_v32qi b)
+{
+  __gcc_v32qi_u a_ = (__gcc_v32qi_u)a;
+  __gcc_v32qi_u b_ = (__gcc_v32qi_u)b;
+  __gcc_v32qi_u dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (a_[j] + b_[j]) > 255 ? 255 : a_[j] + b_[j];
+  return (__gcc_v32qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddusb256_mask */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef unsigned char __gcc_v32qi_u __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_paddusb256_mask(
+  __gcc_v32qi a,
+  __gcc_v32qi b,
+  __gcc_v32qi src,
+  unsigned int k)
+{
+  __gcc_v32qi_u a_ = (__gcc_v32qi_u)a;
+  __gcc_v32qi_u b_ = (__gcc_v32qi_u)b;
+  __gcc_v32qi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (char)((a_[j] + b_[j]) > 255 ? 255 : a_[j] + b_[j])
+                          : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddusb512_mask */
+
+typedef char __gcc_v64qi __attribute__((__vector_size__(64)));
+typedef unsigned char __gcc_v64qi_u __attribute__((__vector_size__(64)));
+
+__gcc_v64qi __builtin_ia32_paddusb512_mask(
+  __gcc_v64qi a,
+  __gcc_v64qi b,
+  __gcc_v64qi src,
+  unsigned long long k)
+{
+  __gcc_v64qi_u a_ = (__gcc_v64qi_u)a;
+  __gcc_v64qi_u b_ = (__gcc_v64qi_u)b;
+  __gcc_v64qi dst;
+  for(int j = 0; j < 64; j++)
+    dst[j] = (k >> j) & 1 ? (char)((a_[j] + b_[j]) > 255 ? 255 : a_[j] + b_[j])
+                          : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddusw128 */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_paddusw128(__gcc_v8hi a, __gcc_v8hi b)
+{
+  __gcc_v8hi_u a_ = (__gcc_v8hi_u)a;
+  __gcc_v8hi_u b_ = (__gcc_v8hi_u)b;
+  __gcc_v8hi_u dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (a_[j] + b_[j]) > 65535 ? 65535 : a_[j] + b_[j];
+  return (__gcc_v8hi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddusw128_mask */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_paddusw128_mask(
+  __gcc_v8hi a,
+  __gcc_v8hi b,
+  __gcc_v8hi src,
+  unsigned char k)
+{
+  __gcc_v8hi_u a_ = (__gcc_v8hi_u)a;
+  __gcc_v8hi_u b_ = (__gcc_v8hi_u)b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1
+               ? (short)((a_[j] + b_[j]) > 65535 ? 65535 : a_[j] + b_[j])
+               : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddusw256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+typedef unsigned short __gcc_v16hi_u __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_paddusw256(__gcc_v16hi a, __gcc_v16hi b)
+{
+  __gcc_v16hi_u a_ = (__gcc_v16hi_u)a;
+  __gcc_v16hi_u b_ = (__gcc_v16hi_u)b;
+  __gcc_v16hi_u dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (a_[j] + b_[j]) > 65535 ? 65535 : a_[j] + b_[j];
+  return (__gcc_v16hi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddusw256_mask */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+typedef unsigned short __gcc_v16hi_u __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_paddusw256_mask(
+  __gcc_v16hi a,
+  __gcc_v16hi b,
+  __gcc_v16hi src,
+  unsigned short k)
+{
+  __gcc_v16hi_u a_ = (__gcc_v16hi_u)a;
+  __gcc_v16hi_u b_ = (__gcc_v16hi_u)b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1
+               ? (short)((a_[j] + b_[j]) > 65535 ? 65535 : a_[j] + b_[j])
+               : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddusw512_mask */
+
+typedef short __gcc_v32hi __attribute__((__vector_size__(64)));
+typedef unsigned short __gcc_v32hi_u __attribute__((__vector_size__(64)));
+
+__gcc_v32hi __builtin_ia32_paddusw512_mask(
+  __gcc_v32hi a,
+  __gcc_v32hi b,
+  __gcc_v32hi src,
+  unsigned int k)
+{
+  __gcc_v32hi_u a_ = (__gcc_v32hi_u)a;
+  __gcc_v32hi_u b_ = (__gcc_v32hi_u)b;
+  __gcc_v32hi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1
+               ? (short)((a_[j] + b_[j]) > 65535 ? 65535 : a_[j] + b_[j])
+               : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddw */
+
+typedef short __gcc_v4hi __attribute__((__vector_size__(8)));
+
+__gcc_v4hi __builtin_ia32_paddw(__gcc_v4hi a, __gcc_v4hi b)
+{
+  __gcc_v4hi a_ = a;
+  __gcc_v4hi b_ = b;
+  __gcc_v4hi dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = a_[j] + b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddw128 */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_paddw128(__gcc_v8hi a, __gcc_v8hi b)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi b_ = b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] + b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddw128_mask */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_paddw128_mask(
+  __gcc_v8hi a,
+  __gcc_v8hi b,
+  __gcc_v8hi src,
+  unsigned char k)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi b_ = b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddw256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_paddw256(__gcc_v16hi a, __gcc_v16hi b)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi b_ = b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] + b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddw256_mask */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_paddw256_mask(
+  __gcc_v16hi a,
+  __gcc_v16hi b,
+  __gcc_v16hi src,
+  unsigned short k)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi b_ = b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_paddw512_mask */
+
+typedef short __gcc_v32hi __attribute__((__vector_size__(64)));
+
+__gcc_v32hi __builtin_ia32_paddw512_mask(
+  __gcc_v32hi a,
+  __gcc_v32hi b,
+  __gcc_v32hi src,
+  unsigned int k)
+{
+  __gcc_v32hi a_ = a;
+  __gcc_v32hi b_ = b;
+  __gcc_v32hi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] + b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pand128 */
+
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+
+__gcc_v2di __builtin_ia32_pand128(__gcc_v2di a, __gcc_v2di b)
+{
+  __gcc_v2di a_ = a;
+  __gcc_v2di b_ = b;
+  __gcc_v2di dst;
+  for(int j = 0; j < 2; j++)
+    dst[j] = a_[j] & b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pandn128 */
+
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+
+__gcc_v2di __builtin_ia32_pandn128(__gcc_v2di a, __gcc_v2di b)
+{
+  __gcc_v2di a_ = a;
+  __gcc_v2di b_ = b;
+  __gcc_v2di dst;
+  for(int j = 0; j < 2; j++)
+    dst[j] = ~a_[j] & b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pavgb128 */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_pavgb128(__gcc_v16qi a, __gcc_v16qi b)
+{
+  __gcc_v16qi_u a_ = (__gcc_v16qi_u)a;
+  __gcc_v16qi_u b_ = (__gcc_v16qi_u)b;
+  __gcc_v16qi_u dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (a_[j] + b_[j] + 1) >> 1;
+  return (__gcc_v16qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pavgb128_mask */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_pavgb128_mask(
+  __gcc_v16qi a,
+  __gcc_v16qi b,
+  __gcc_v16qi src,
+  unsigned short k)
+{
+  __gcc_v16qi_u a_ = (__gcc_v16qi_u)a;
+  __gcc_v16qi_u b_ = (__gcc_v16qi_u)b;
+  __gcc_v16qi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (char)((a_[j] + b_[j] + 1) >> 1) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pavgb256 */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef unsigned char __gcc_v32qi_u __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_pavgb256(__gcc_v32qi a, __gcc_v32qi b)
+{
+  __gcc_v32qi_u a_ = (__gcc_v32qi_u)a;
+  __gcc_v32qi_u b_ = (__gcc_v32qi_u)b;
+  __gcc_v32qi_u dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (a_[j] + b_[j] + 1) >> 1;
+  return (__gcc_v32qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pavgb256_mask */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef unsigned char __gcc_v32qi_u __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_pavgb256_mask(
+  __gcc_v32qi a,
+  __gcc_v32qi b,
+  __gcc_v32qi src,
+  unsigned int k)
+{
+  __gcc_v32qi_u a_ = (__gcc_v32qi_u)a;
+  __gcc_v32qi_u b_ = (__gcc_v32qi_u)b;
+  __gcc_v32qi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (char)((a_[j] + b_[j] + 1) >> 1) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pavgb512_mask */
+
+typedef char __gcc_v64qi __attribute__((__vector_size__(64)));
+typedef unsigned char __gcc_v64qi_u __attribute__((__vector_size__(64)));
+
+__gcc_v64qi __builtin_ia32_pavgb512_mask(
+  __gcc_v64qi a,
+  __gcc_v64qi b,
+  __gcc_v64qi src,
+  unsigned long long k)
+{
+  __gcc_v64qi_u a_ = (__gcc_v64qi_u)a;
+  __gcc_v64qi_u b_ = (__gcc_v64qi_u)b;
+  __gcc_v64qi dst;
+  for(int j = 0; j < 64; j++)
+    dst[j] = (k >> j) & 1 ? (char)((a_[j] + b_[j] + 1) >> 1) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pavgw128 */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_pavgw128(__gcc_v8hi a, __gcc_v8hi b)
+{
+  __gcc_v8hi_u a_ = (__gcc_v8hi_u)a;
+  __gcc_v8hi_u b_ = (__gcc_v8hi_u)b;
+  __gcc_v8hi_u dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (a_[j] + b_[j] + 1) >> 1;
+  return (__gcc_v8hi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pavgw128_mask */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_pavgw128_mask(
+  __gcc_v8hi a,
+  __gcc_v8hi b,
+  __gcc_v8hi src,
+  unsigned char k)
+{
+  __gcc_v8hi_u a_ = (__gcc_v8hi_u)a;
+  __gcc_v8hi_u b_ = (__gcc_v8hi_u)b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (short)((a_[j] + b_[j] + 1) >> 1) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pavgw256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+typedef unsigned short __gcc_v16hi_u __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_pavgw256(__gcc_v16hi a, __gcc_v16hi b)
+{
+  __gcc_v16hi_u a_ = (__gcc_v16hi_u)a;
+  __gcc_v16hi_u b_ = (__gcc_v16hi_u)b;
+  __gcc_v16hi_u dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (a_[j] + b_[j] + 1) >> 1;
+  return (__gcc_v16hi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pavgw256_mask */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+typedef unsigned short __gcc_v16hi_u __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_pavgw256_mask(
+  __gcc_v16hi a,
+  __gcc_v16hi b,
+  __gcc_v16hi src,
+  unsigned short k)
+{
+  __gcc_v16hi_u a_ = (__gcc_v16hi_u)a;
+  __gcc_v16hi_u b_ = (__gcc_v16hi_u)b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (short)((a_[j] + b_[j] + 1) >> 1) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pavgw512_mask */
+
+typedef short __gcc_v32hi __attribute__((__vector_size__(64)));
+typedef unsigned short __gcc_v32hi_u __attribute__((__vector_size__(64)));
+
+__gcc_v32hi __builtin_ia32_pavgw512_mask(
+  __gcc_v32hi a,
+  __gcc_v32hi b,
+  __gcc_v32hi src,
+  unsigned int k)
+{
+  __gcc_v32hi_u a_ = (__gcc_v32hi_u)a;
+  __gcc_v32hi_u b_ = (__gcc_v32hi_u)b;
+  __gcc_v32hi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (short)((a_[j] + b_[j] + 1) >> 1) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pcmpeqb128 */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_pcmpeqb128(__gcc_v16qi a, __gcc_v16qi b)
+{
+  __gcc_v16qi a_ = a;
+  __gcc_v16qi b_ = b;
+  __gcc_v16qi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] == b_[j] ? -1 : 0;
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pcmpeqb256 */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_pcmpeqb256(__gcc_v32qi a, __gcc_v32qi b)
+{
+  __gcc_v32qi a_ = a;
+  __gcc_v32qi b_ = b;
+  __gcc_v32qi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = a_[j] == b_[j] ? -1 : 0;
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pcmpeqd128 */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_pcmpeqd128(__gcc_v4si a, __gcc_v4si b)
+{
+  __gcc_v4si a_ = a;
+  __gcc_v4si b_ = b;
+  __gcc_v4si dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = a_[j] == b_[j] ? -1 : 0;
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pcmpeqd256 */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_pcmpeqd256(__gcc_v8si a, __gcc_v8si b)
+{
+  __gcc_v8si a_ = a;
+  __gcc_v8si b_ = b;
+  __gcc_v8si dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] == b_[j] ? -1 : 0;
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pcmpeqw128 */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_pcmpeqw128(__gcc_v8hi a, __gcc_v8hi b)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi b_ = b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] == b_[j] ? -1 : 0;
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pcmpeqw256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_pcmpeqw256(__gcc_v16hi a, __gcc_v16hi b)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi b_ = b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] == b_[j] ? -1 : 0;
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pcmpgtb128 */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_pcmpgtb128(__gcc_v16qi a, __gcc_v16qi b)
+{
+  __gcc_v16qi_s a_ = (__gcc_v16qi_s)a;
+  __gcc_v16qi_s b_ = (__gcc_v16qi_s)b;
+  __gcc_v16qi_s dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] > b_[j] ? -1 : 0;
+  return (__gcc_v16qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pcmpgtb256 */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef signed char __gcc_v32qi_s __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_pcmpgtb256(__gcc_v32qi a, __gcc_v32qi b)
+{
+  __gcc_v32qi_s a_ = (__gcc_v32qi_s)a;
+  __gcc_v32qi_s b_ = (__gcc_v32qi_s)b;
+  __gcc_v32qi_s dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = a_[j] > b_[j] ? -1 : 0;
+  return (__gcc_v32qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pcmpgtd128 */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_pcmpgtd128(__gcc_v4si a, __gcc_v4si b)
+{
+  __gcc_v4si a_ = a;
+  __gcc_v4si b_ = b;
+  __gcc_v4si dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = a_[j] > b_[j] ? -1 : 0;
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pcmpgtd256 */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_pcmpgtd256(__gcc_v8si a, __gcc_v8si b)
+{
+  __gcc_v8si a_ = a;
+  __gcc_v8si b_ = b;
+  __gcc_v8si dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] > b_[j] ? -1 : 0;
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pcmpgtw128 */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_pcmpgtw128(__gcc_v8hi a, __gcc_v8hi b)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi b_ = b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] > b_[j] ? -1 : 0;
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pcmpgtw256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_pcmpgtw256(__gcc_v16hi a, __gcc_v16hi b)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi b_ = b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] > b_[j] ? -1 : 0;
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxsb128 */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_pmaxsb128(__gcc_v16qi a, __gcc_v16qi b)
+{
+  __gcc_v16qi_s a_ = (__gcc_v16qi_s)a;
+  __gcc_v16qi_s b_ = (__gcc_v16qi_s)b;
+  __gcc_v16qi_s dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] > b_[j] ? a_[j] : b_[j];
+  return (__gcc_v16qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxsb128_mask */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_pmaxsb128_mask(
+  __gcc_v16qi a,
+  __gcc_v16qi b,
+  __gcc_v16qi src,
+  unsigned short k)
+{
+  __gcc_v16qi_s a_ = (__gcc_v16qi_s)a;
+  __gcc_v16qi_s b_ = (__gcc_v16qi_s)b;
+  __gcc_v16qi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxsb256 */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef signed char __gcc_v32qi_s __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_pmaxsb256(__gcc_v32qi a, __gcc_v32qi b)
+{
+  __gcc_v32qi_s a_ = (__gcc_v32qi_s)a;
+  __gcc_v32qi_s b_ = (__gcc_v32qi_s)b;
+  __gcc_v32qi_s dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = a_[j] > b_[j] ? a_[j] : b_[j];
+  return (__gcc_v32qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxsb256_mask */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef signed char __gcc_v32qi_s __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_pmaxsb256_mask(
+  __gcc_v32qi a,
+  __gcc_v32qi b,
+  __gcc_v32qi src,
+  unsigned int k)
+{
+  __gcc_v32qi_s a_ = (__gcc_v32qi_s)a;
+  __gcc_v32qi_s b_ = (__gcc_v32qi_s)b;
+  __gcc_v32qi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxsb512_mask */
+
+typedef char __gcc_v64qi __attribute__((__vector_size__(64)));
+typedef signed char __gcc_v64qi_s __attribute__((__vector_size__(64)));
+
+__gcc_v64qi __builtin_ia32_pmaxsb512_mask(
+  __gcc_v64qi a,
+  __gcc_v64qi b,
+  __gcc_v64qi src,
+  unsigned long long k)
+{
+  __gcc_v64qi_s a_ = (__gcc_v64qi_s)a;
+  __gcc_v64qi_s b_ = (__gcc_v64qi_s)b;
+  __gcc_v64qi dst;
+  for(int j = 0; j < 64; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxsd128 */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_pmaxsd128(__gcc_v4si a, __gcc_v4si b)
+{
+  __gcc_v4si a_ = a;
+  __gcc_v4si b_ = b;
+  __gcc_v4si dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = a_[j] > b_[j] ? a_[j] : b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxsd128_mask */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_pmaxsd128_mask(
+  __gcc_v4si a,
+  __gcc_v4si b,
+  __gcc_v4si src,
+  unsigned char k)
+{
+  __gcc_v4si a_ = a;
+  __gcc_v4si b_ = b;
+  __gcc_v4si dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxsd256 */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_pmaxsd256(__gcc_v8si a, __gcc_v8si b)
+{
+  __gcc_v8si a_ = a;
+  __gcc_v8si b_ = b;
+  __gcc_v8si dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] > b_[j] ? a_[j] : b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxsd256_mask */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_pmaxsd256_mask(
+  __gcc_v8si a,
+  __gcc_v8si b,
+  __gcc_v8si src,
+  unsigned char k)
+{
+  __gcc_v8si a_ = a;
+  __gcc_v8si b_ = b;
+  __gcc_v8si dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxsd512_mask */
+
+typedef int __gcc_v16si __attribute__((__vector_size__(64)));
+
+__gcc_v16si __builtin_ia32_pmaxsd512_mask(
+  __gcc_v16si a,
+  __gcc_v16si b,
+  __gcc_v16si src,
+  unsigned short k)
+{
+  __gcc_v16si a_ = a;
+  __gcc_v16si b_ = b;
+  __gcc_v16si dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxsw128 */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_pmaxsw128(__gcc_v8hi a, __gcc_v8hi b)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi b_ = b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] > b_[j] ? a_[j] : b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxsw128_mask */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_pmaxsw128_mask(
+  __gcc_v8hi a,
+  __gcc_v8hi b,
+  __gcc_v8hi src,
+  unsigned char k)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi b_ = b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxsw256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_pmaxsw256(__gcc_v16hi a, __gcc_v16hi b)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi b_ = b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] > b_[j] ? a_[j] : b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxsw256_mask */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_pmaxsw256_mask(
+  __gcc_v16hi a,
+  __gcc_v16hi b,
+  __gcc_v16hi src,
+  unsigned short k)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi b_ = b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxsw512_mask */
+
+typedef short __gcc_v32hi __attribute__((__vector_size__(64)));
+
+__gcc_v32hi __builtin_ia32_pmaxsw512_mask(
+  __gcc_v32hi a,
+  __gcc_v32hi b,
+  __gcc_v32hi src,
+  unsigned int k)
+{
+  __gcc_v32hi a_ = a;
+  __gcc_v32hi b_ = b;
+  __gcc_v32hi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxub128 */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_pmaxub128(__gcc_v16qi a, __gcc_v16qi b)
+{
+  __gcc_v16qi_u a_ = (__gcc_v16qi_u)a;
+  __gcc_v16qi_u b_ = (__gcc_v16qi_u)b;
+  __gcc_v16qi_u dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] > b_[j] ? a_[j] : b_[j];
+  return (__gcc_v16qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxub128_mask */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_pmaxub128_mask(
+  __gcc_v16qi a,
+  __gcc_v16qi b,
+  __gcc_v16qi src,
+  unsigned short k)
+{
+  __gcc_v16qi_u a_ = (__gcc_v16qi_u)a;
+  __gcc_v16qi_u b_ = (__gcc_v16qi_u)b;
+  __gcc_v16qi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxub256 */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef unsigned char __gcc_v32qi_u __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_pmaxub256(__gcc_v32qi a, __gcc_v32qi b)
+{
+  __gcc_v32qi_u a_ = (__gcc_v32qi_u)a;
+  __gcc_v32qi_u b_ = (__gcc_v32qi_u)b;
+  __gcc_v32qi_u dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = a_[j] > b_[j] ? a_[j] : b_[j];
+  return (__gcc_v32qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxub256_mask */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef unsigned char __gcc_v32qi_u __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_pmaxub256_mask(
+  __gcc_v32qi a,
+  __gcc_v32qi b,
+  __gcc_v32qi src,
+  unsigned int k)
+{
+  __gcc_v32qi_u a_ = (__gcc_v32qi_u)a;
+  __gcc_v32qi_u b_ = (__gcc_v32qi_u)b;
+  __gcc_v32qi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxub512_mask */
+
+typedef char __gcc_v64qi __attribute__((__vector_size__(64)));
+typedef unsigned char __gcc_v64qi_u __attribute__((__vector_size__(64)));
+
+__gcc_v64qi __builtin_ia32_pmaxub512_mask(
+  __gcc_v64qi a,
+  __gcc_v64qi b,
+  __gcc_v64qi src,
+  unsigned long long k)
+{
+  __gcc_v64qi_u a_ = (__gcc_v64qi_u)a;
+  __gcc_v64qi_u b_ = (__gcc_v64qi_u)b;
+  __gcc_v64qi dst;
+  for(int j = 0; j < 64; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxud128 */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_pmaxud128(__gcc_v4si a, __gcc_v4si b)
+{
+  __gcc_v4si_u a_ = (__gcc_v4si_u)a;
+  __gcc_v4si_u b_ = (__gcc_v4si_u)b;
+  __gcc_v4si_u dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = a_[j] > b_[j] ? a_[j] : b_[j];
+  return (__gcc_v4si)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxud128_mask */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_pmaxud128_mask(
+  __gcc_v4si a,
+  __gcc_v4si b,
+  __gcc_v4si src,
+  unsigned char k)
+{
+  __gcc_v4si_u a_ = (__gcc_v4si_u)a;
+  __gcc_v4si_u b_ = (__gcc_v4si_u)b;
+  __gcc_v4si dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxud256 */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+typedef unsigned int __gcc_v8si_u __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_pmaxud256(__gcc_v8si a, __gcc_v8si b)
+{
+  __gcc_v8si_u a_ = (__gcc_v8si_u)a;
+  __gcc_v8si_u b_ = (__gcc_v8si_u)b;
+  __gcc_v8si_u dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] > b_[j] ? a_[j] : b_[j];
+  return (__gcc_v8si)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxud256_mask */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+typedef unsigned int __gcc_v8si_u __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_pmaxud256_mask(
+  __gcc_v8si a,
+  __gcc_v8si b,
+  __gcc_v8si src,
+  unsigned char k)
+{
+  __gcc_v8si_u a_ = (__gcc_v8si_u)a;
+  __gcc_v8si_u b_ = (__gcc_v8si_u)b;
+  __gcc_v8si dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxud512_mask */
+
+typedef int __gcc_v16si __attribute__((__vector_size__(64)));
+typedef unsigned int __gcc_v16si_u __attribute__((__vector_size__(64)));
+
+__gcc_v16si __builtin_ia32_pmaxud512_mask(
+  __gcc_v16si a,
+  __gcc_v16si b,
+  __gcc_v16si src,
+  unsigned short k)
+{
+  __gcc_v16si_u a_ = (__gcc_v16si_u)a;
+  __gcc_v16si_u b_ = (__gcc_v16si_u)b;
+  __gcc_v16si dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxuw128 */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_pmaxuw128(__gcc_v8hi a, __gcc_v8hi b)
+{
+  __gcc_v8hi_u a_ = (__gcc_v8hi_u)a;
+  __gcc_v8hi_u b_ = (__gcc_v8hi_u)b;
+  __gcc_v8hi_u dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] > b_[j] ? a_[j] : b_[j];
+  return (__gcc_v8hi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxuw128_mask */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_pmaxuw128_mask(
+  __gcc_v8hi a,
+  __gcc_v8hi b,
+  __gcc_v8hi src,
+  unsigned char k)
+{
+  __gcc_v8hi_u a_ = (__gcc_v8hi_u)a;
+  __gcc_v8hi_u b_ = (__gcc_v8hi_u)b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxuw256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+typedef unsigned short __gcc_v16hi_u __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_pmaxuw256(__gcc_v16hi a, __gcc_v16hi b)
+{
+  __gcc_v16hi_u a_ = (__gcc_v16hi_u)a;
+  __gcc_v16hi_u b_ = (__gcc_v16hi_u)b;
+  __gcc_v16hi_u dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] > b_[j] ? a_[j] : b_[j];
+  return (__gcc_v16hi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxuw256_mask */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+typedef unsigned short __gcc_v16hi_u __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_pmaxuw256_mask(
+  __gcc_v16hi a,
+  __gcc_v16hi b,
+  __gcc_v16hi src,
+  unsigned short k)
+{
+  __gcc_v16hi_u a_ = (__gcc_v16hi_u)a;
+  __gcc_v16hi_u b_ = (__gcc_v16hi_u)b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmaxuw512_mask */
+
+typedef short __gcc_v32hi __attribute__((__vector_size__(64)));
+typedef unsigned short __gcc_v32hi_u __attribute__((__vector_size__(64)));
+
+__gcc_v32hi __builtin_ia32_pmaxuw512_mask(
+  __gcc_v32hi a,
+  __gcc_v32hi b,
+  __gcc_v32hi src,
+  unsigned int k)
+{
+  __gcc_v32hi_u a_ = (__gcc_v32hi_u)a;
+  __gcc_v32hi_u b_ = (__gcc_v32hi_u)b;
+  __gcc_v32hi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] > b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminsb128 */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_pminsb128(__gcc_v16qi a, __gcc_v16qi b)
+{
+  __gcc_v16qi_s a_ = (__gcc_v16qi_s)a;
+  __gcc_v16qi_s b_ = (__gcc_v16qi_s)b;
+  __gcc_v16qi_s dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] < b_[j] ? a_[j] : b_[j];
+  return (__gcc_v16qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminsb128_mask */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_pminsb128_mask(
+  __gcc_v16qi a,
+  __gcc_v16qi b,
+  __gcc_v16qi src,
+  unsigned short k)
+{
+  __gcc_v16qi_s a_ = (__gcc_v16qi_s)a;
+  __gcc_v16qi_s b_ = (__gcc_v16qi_s)b;
+  __gcc_v16qi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminsb256 */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef signed char __gcc_v32qi_s __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_pminsb256(__gcc_v32qi a, __gcc_v32qi b)
+{
+  __gcc_v32qi_s a_ = (__gcc_v32qi_s)a;
+  __gcc_v32qi_s b_ = (__gcc_v32qi_s)b;
+  __gcc_v32qi_s dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = a_[j] < b_[j] ? a_[j] : b_[j];
+  return (__gcc_v32qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminsb256_mask */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef signed char __gcc_v32qi_s __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_pminsb256_mask(
+  __gcc_v32qi a,
+  __gcc_v32qi b,
+  __gcc_v32qi src,
+  unsigned int k)
+{
+  __gcc_v32qi_s a_ = (__gcc_v32qi_s)a;
+  __gcc_v32qi_s b_ = (__gcc_v32qi_s)b;
+  __gcc_v32qi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminsb512_mask */
+
+typedef char __gcc_v64qi __attribute__((__vector_size__(64)));
+typedef signed char __gcc_v64qi_s __attribute__((__vector_size__(64)));
+
+__gcc_v64qi __builtin_ia32_pminsb512_mask(
+  __gcc_v64qi a,
+  __gcc_v64qi b,
+  __gcc_v64qi src,
+  unsigned long long k)
+{
+  __gcc_v64qi_s a_ = (__gcc_v64qi_s)a;
+  __gcc_v64qi_s b_ = (__gcc_v64qi_s)b;
+  __gcc_v64qi dst;
+  for(int j = 0; j < 64; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminsd128 */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_pminsd128(__gcc_v4si a, __gcc_v4si b)
+{
+  __gcc_v4si a_ = a;
+  __gcc_v4si b_ = b;
+  __gcc_v4si dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = a_[j] < b_[j] ? a_[j] : b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminsd128_mask */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_pminsd128_mask(
+  __gcc_v4si a,
+  __gcc_v4si b,
+  __gcc_v4si src,
+  unsigned char k)
+{
+  __gcc_v4si a_ = a;
+  __gcc_v4si b_ = b;
+  __gcc_v4si dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminsd256 */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_pminsd256(__gcc_v8si a, __gcc_v8si b)
+{
+  __gcc_v8si a_ = a;
+  __gcc_v8si b_ = b;
+  __gcc_v8si dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] < b_[j] ? a_[j] : b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminsd256_mask */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_pminsd256_mask(
+  __gcc_v8si a,
+  __gcc_v8si b,
+  __gcc_v8si src,
+  unsigned char k)
+{
+  __gcc_v8si a_ = a;
+  __gcc_v8si b_ = b;
+  __gcc_v8si dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminsd512_mask */
+
+typedef int __gcc_v16si __attribute__((__vector_size__(64)));
+
+__gcc_v16si __builtin_ia32_pminsd512_mask(
+  __gcc_v16si a,
+  __gcc_v16si b,
+  __gcc_v16si src,
+  unsigned short k)
+{
+  __gcc_v16si a_ = a;
+  __gcc_v16si b_ = b;
+  __gcc_v16si dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminsw128 */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_pminsw128(__gcc_v8hi a, __gcc_v8hi b)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi b_ = b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] < b_[j] ? a_[j] : b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminsw128_mask */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_pminsw128_mask(
+  __gcc_v8hi a,
+  __gcc_v8hi b,
+  __gcc_v8hi src,
+  unsigned char k)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi b_ = b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminsw256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_pminsw256(__gcc_v16hi a, __gcc_v16hi b)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi b_ = b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] < b_[j] ? a_[j] : b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminsw256_mask */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_pminsw256_mask(
+  __gcc_v16hi a,
+  __gcc_v16hi b,
+  __gcc_v16hi src,
+  unsigned short k)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi b_ = b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminsw512_mask */
+
+typedef short __gcc_v32hi __attribute__((__vector_size__(64)));
+
+__gcc_v32hi __builtin_ia32_pminsw512_mask(
+  __gcc_v32hi a,
+  __gcc_v32hi b,
+  __gcc_v32hi src,
+  unsigned int k)
+{
+  __gcc_v32hi a_ = a;
+  __gcc_v32hi b_ = b;
+  __gcc_v32hi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminub128 */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_pminub128(__gcc_v16qi a, __gcc_v16qi b)
+{
+  __gcc_v16qi_u a_ = (__gcc_v16qi_u)a;
+  __gcc_v16qi_u b_ = (__gcc_v16qi_u)b;
+  __gcc_v16qi_u dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] < b_[j] ? a_[j] : b_[j];
+  return (__gcc_v16qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminub128_mask */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_pminub128_mask(
+  __gcc_v16qi a,
+  __gcc_v16qi b,
+  __gcc_v16qi src,
+  unsigned short k)
+{
+  __gcc_v16qi_u a_ = (__gcc_v16qi_u)a;
+  __gcc_v16qi_u b_ = (__gcc_v16qi_u)b;
+  __gcc_v16qi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminub256 */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef unsigned char __gcc_v32qi_u __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_pminub256(__gcc_v32qi a, __gcc_v32qi b)
+{
+  __gcc_v32qi_u a_ = (__gcc_v32qi_u)a;
+  __gcc_v32qi_u b_ = (__gcc_v32qi_u)b;
+  __gcc_v32qi_u dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = a_[j] < b_[j] ? a_[j] : b_[j];
+  return (__gcc_v32qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminub256_mask */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef unsigned char __gcc_v32qi_u __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_pminub256_mask(
+  __gcc_v32qi a,
+  __gcc_v32qi b,
+  __gcc_v32qi src,
+  unsigned int k)
+{
+  __gcc_v32qi_u a_ = (__gcc_v32qi_u)a;
+  __gcc_v32qi_u b_ = (__gcc_v32qi_u)b;
+  __gcc_v32qi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminub512_mask */
+
+typedef char __gcc_v64qi __attribute__((__vector_size__(64)));
+typedef unsigned char __gcc_v64qi_u __attribute__((__vector_size__(64)));
+
+__gcc_v64qi __builtin_ia32_pminub512_mask(
+  __gcc_v64qi a,
+  __gcc_v64qi b,
+  __gcc_v64qi src,
+  unsigned long long k)
+{
+  __gcc_v64qi_u a_ = (__gcc_v64qi_u)a;
+  __gcc_v64qi_u b_ = (__gcc_v64qi_u)b;
+  __gcc_v64qi dst;
+  for(int j = 0; j < 64; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminud128 */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_pminud128(__gcc_v4si a, __gcc_v4si b)
+{
+  __gcc_v4si_u a_ = (__gcc_v4si_u)a;
+  __gcc_v4si_u b_ = (__gcc_v4si_u)b;
+  __gcc_v4si_u dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = a_[j] < b_[j] ? a_[j] : b_[j];
+  return (__gcc_v4si)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminud128_mask */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_pminud128_mask(
+  __gcc_v4si a,
+  __gcc_v4si b,
+  __gcc_v4si src,
+  unsigned char k)
+{
+  __gcc_v4si_u a_ = (__gcc_v4si_u)a;
+  __gcc_v4si_u b_ = (__gcc_v4si_u)b;
+  __gcc_v4si dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminud256 */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+typedef unsigned int __gcc_v8si_u __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_pminud256(__gcc_v8si a, __gcc_v8si b)
+{
+  __gcc_v8si_u a_ = (__gcc_v8si_u)a;
+  __gcc_v8si_u b_ = (__gcc_v8si_u)b;
+  __gcc_v8si_u dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] < b_[j] ? a_[j] : b_[j];
+  return (__gcc_v8si)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminud256_mask */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+typedef unsigned int __gcc_v8si_u __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_pminud256_mask(
+  __gcc_v8si a,
+  __gcc_v8si b,
+  __gcc_v8si src,
+  unsigned char k)
+{
+  __gcc_v8si_u a_ = (__gcc_v8si_u)a;
+  __gcc_v8si_u b_ = (__gcc_v8si_u)b;
+  __gcc_v8si dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminud512_mask */
+
+typedef int __gcc_v16si __attribute__((__vector_size__(64)));
+typedef unsigned int __gcc_v16si_u __attribute__((__vector_size__(64)));
+
+__gcc_v16si __builtin_ia32_pminud512_mask(
+  __gcc_v16si a,
+  __gcc_v16si b,
+  __gcc_v16si src,
+  unsigned short k)
+{
+  __gcc_v16si_u a_ = (__gcc_v16si_u)a;
+  __gcc_v16si_u b_ = (__gcc_v16si_u)b;
+  __gcc_v16si dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminuw128 */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_pminuw128(__gcc_v8hi a, __gcc_v8hi b)
+{
+  __gcc_v8hi_u a_ = (__gcc_v8hi_u)a;
+  __gcc_v8hi_u b_ = (__gcc_v8hi_u)b;
+  __gcc_v8hi_u dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] < b_[j] ? a_[j] : b_[j];
+  return (__gcc_v8hi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminuw128_mask */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_pminuw128_mask(
+  __gcc_v8hi a,
+  __gcc_v8hi b,
+  __gcc_v8hi src,
+  unsigned char k)
+{
+  __gcc_v8hi_u a_ = (__gcc_v8hi_u)a;
+  __gcc_v8hi_u b_ = (__gcc_v8hi_u)b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminuw256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+typedef unsigned short __gcc_v16hi_u __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_pminuw256(__gcc_v16hi a, __gcc_v16hi b)
+{
+  __gcc_v16hi_u a_ = (__gcc_v16hi_u)a;
+  __gcc_v16hi_u b_ = (__gcc_v16hi_u)b;
+  __gcc_v16hi_u dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] < b_[j] ? a_[j] : b_[j];
+  return (__gcc_v16hi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminuw256_mask */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+typedef unsigned short __gcc_v16hi_u __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_pminuw256_mask(
+  __gcc_v16hi a,
+  __gcc_v16hi b,
+  __gcc_v16hi src,
+  unsigned short k)
+{
+  __gcc_v16hi_u a_ = (__gcc_v16hi_u)a;
+  __gcc_v16hi_u b_ = (__gcc_v16hi_u)b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pminuw512_mask */
+
+typedef short __gcc_v32hi __attribute__((__vector_size__(64)));
+typedef unsigned short __gcc_v32hi_u __attribute__((__vector_size__(64)));
+
+__gcc_v32hi __builtin_ia32_pminuw512_mask(
+  __gcc_v32hi a,
+  __gcc_v32hi b,
+  __gcc_v32hi src,
+  unsigned int k)
+{
+  __gcc_v32hi_u a_ = (__gcc_v32hi_u)a;
+  __gcc_v32hi_u b_ = (__gcc_v32hi_u)b;
+  __gcc_v32hi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] < b_[j] ? a_[j] : b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmulld128 */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_pmulld128(__gcc_v4si a, __gcc_v4si b)
+{
+  __gcc_v4si_u a_ = (__gcc_v4si_u)a;
+  __gcc_v4si_u b_ = (__gcc_v4si_u)b;
+  __gcc_v4si_u dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = a_[j] * b_[j];
+  return (__gcc_v4si)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmulld128_mask */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_pmulld128_mask(
+  __gcc_v4si a,
+  __gcc_v4si b,
+  __gcc_v4si src,
+  unsigned char k)
+{
+  __gcc_v4si_u a_ = (__gcc_v4si_u)a;
+  __gcc_v4si_u b_ = (__gcc_v4si_u)b;
+  __gcc_v4si dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] * b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmulld256 */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+typedef unsigned int __gcc_v8si_u __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_pmulld256(__gcc_v8si a, __gcc_v8si b)
+{
+  __gcc_v8si_u a_ = (__gcc_v8si_u)a;
+  __gcc_v8si_u b_ = (__gcc_v8si_u)b;
+  __gcc_v8si_u dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] * b_[j];
+  return (__gcc_v8si)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmulld256_mask */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+typedef unsigned int __gcc_v8si_u __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_pmulld256_mask(
+  __gcc_v8si a,
+  __gcc_v8si b,
+  __gcc_v8si src,
+  unsigned char k)
+{
+  __gcc_v8si_u a_ = (__gcc_v8si_u)a;
+  __gcc_v8si_u b_ = (__gcc_v8si_u)b;
+  __gcc_v8si dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] * b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmulld512_mask */
+
+typedef int __gcc_v16si __attribute__((__vector_size__(64)));
+typedef unsigned int __gcc_v16si_u __attribute__((__vector_size__(64)));
+
+__gcc_v16si __builtin_ia32_pmulld512_mask(
+  __gcc_v16si a,
+  __gcc_v16si b,
+  __gcc_v16si src,
+  unsigned short k)
+{
+  __gcc_v16si_u a_ = (__gcc_v16si_u)a;
+  __gcc_v16si_u b_ = (__gcc_v16si_u)b;
+  __gcc_v16si dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] * b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmullw128 */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_pmullw128(__gcc_v8hi a, __gcc_v8hi b)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi b_ = b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] * b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmullw128_mask */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_pmullw128_mask(
+  __gcc_v8hi a,
+  __gcc_v8hi b,
+  __gcc_v8hi src,
+  unsigned char k)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi b_ = b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] * b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmullw256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_pmullw256(__gcc_v16hi a, __gcc_v16hi b)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi b_ = b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] * b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmullw256_mask */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_pmullw256_mask(
+  __gcc_v16hi a,
+  __gcc_v16hi b,
+  __gcc_v16hi src,
+  unsigned short k)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi b_ = b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] * b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pmullw512_mask */
+
+typedef short __gcc_v32hi __attribute__((__vector_size__(64)));
+
+__gcc_v32hi __builtin_ia32_pmullw512_mask(
+  __gcc_v32hi a,
+  __gcc_v32hi b,
+  __gcc_v32hi src,
+  unsigned int k)
+{
+  __gcc_v32hi a_ = a;
+  __gcc_v32hi b_ = b;
+  __gcc_v32hi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] * b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_por128 */
+
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+
+__gcc_v2di __builtin_ia32_por128(__gcc_v2di a, __gcc_v2di b)
+{
+  __gcc_v2di a_ = a;
+  __gcc_v2di b_ = b;
+  __gcc_v2di dst;
+  for(int j = 0; j < 2; j++)
+    dst[j] = a_[j] | b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_por256 */
+
+typedef long long __gcc_v4di __attribute__((__vector_size__(32)));
+
+__gcc_v4di __builtin_ia32_por256(__gcc_v4di a, __gcc_v4di b)
+{
+  __gcc_v4di a_ = a;
+  __gcc_v4di b_ = b;
+  __gcc_v4di dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = a_[j] | b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pslldi128 */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_pslldi128(__gcc_v4si a, int b)
+{
+  __gcc_v4si_u a_ = (__gcc_v4si_u)a;
+  __gcc_v4si_u dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = (unsigned)b >= 32 ? 0 : a_[j] << b;
+  return (__gcc_v4si)dst;
+}
+
+/* FUNCTION: __builtin_ia32_pslldi256 */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+typedef unsigned int __gcc_v8si_u __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_pslldi256(__gcc_v8si a, int b)
+{
+  __gcc_v8si_u a_ = (__gcc_v8si_u)a;
+  __gcc_v8si_u dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (unsigned)b >= 32 ? 0 : a_[j] << b;
+  return (__gcc_v8si)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psllqi128 */
+
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+typedef unsigned long long __gcc_v2di_u __attribute__((__vector_size__(16)));
+
+__gcc_v2di __builtin_ia32_psllqi128(__gcc_v2di a, int b)
+{
+  __gcc_v2di_u a_ = (__gcc_v2di_u)a;
+  __gcc_v2di_u dst;
+  for(int j = 0; j < 2; j++)
+    dst[j] = (unsigned)b >= 64 ? 0 : a_[j] << b;
+  return (__gcc_v2di)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psllqi256 */
+
+typedef long long __gcc_v4di __attribute__((__vector_size__(32)));
+typedef unsigned long long __gcc_v4di_u __attribute__((__vector_size__(32)));
+
+__gcc_v4di __builtin_ia32_psllqi256(__gcc_v4di a, int b)
+{
+  __gcc_v4di_u a_ = (__gcc_v4di_u)a;
+  __gcc_v4di_u dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = (unsigned)b >= 64 ? 0 : a_[j] << b;
+  return (__gcc_v4di)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psllwi128 */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_psllwi128(__gcc_v8hi a, int b)
+{
+  __gcc_v8hi_u a_ = (__gcc_v8hi_u)a;
+  __gcc_v8hi_u dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (unsigned)b >= 16 ? 0 : a_[j] << b;
+  return (__gcc_v8hi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psllwi256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+typedef unsigned short __gcc_v16hi_u __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_psllwi256(__gcc_v16hi a, int b)
+{
+  __gcc_v16hi_u a_ = (__gcc_v16hi_u)a;
+  __gcc_v16hi_u dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (unsigned)b >= 16 ? 0 : a_[j] << b;
+  return (__gcc_v16hi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psradi128 */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_psradi128(__gcc_v4si a, int b)
+{
+  __gcc_v4si a_ = a;
+  __gcc_v4si dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = (unsigned)b >= 32 ? (a_[j] < 0 ? -1 : 0) : a_[j] >> b;
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psradi256 */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_psradi256(__gcc_v8si a, int b)
+{
+  __gcc_v8si a_ = a;
+  __gcc_v8si dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (unsigned)b >= 32 ? (a_[j] < 0 ? -1 : 0) : a_[j] >> b;
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psrawi128 */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_psrawi128(__gcc_v8hi a, int b)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (unsigned)b >= 16 ? (a_[j] < 0 ? -1 : 0) : a_[j] >> b;
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psrawi256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_psrawi256(__gcc_v16hi a, int b)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (unsigned)b >= 16 ? (a_[j] < 0 ? -1 : 0) : a_[j] >> b;
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psrldi128 */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_psrldi128(__gcc_v4si a, int b)
+{
+  __gcc_v4si_u a_ = (__gcc_v4si_u)a;
+  __gcc_v4si_u dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = (unsigned)b >= 32 ? 0 : a_[j] >> b;
+  return (__gcc_v4si)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psrldi256 */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+typedef unsigned int __gcc_v8si_u __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_psrldi256(__gcc_v8si a, int b)
+{
+  __gcc_v8si_u a_ = (__gcc_v8si_u)a;
+  __gcc_v8si_u dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (unsigned)b >= 32 ? 0 : a_[j] >> b;
+  return (__gcc_v8si)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psrlqi128 */
+
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+typedef unsigned long long __gcc_v2di_u __attribute__((__vector_size__(16)));
+
+__gcc_v2di __builtin_ia32_psrlqi128(__gcc_v2di a, int b)
+{
+  __gcc_v2di_u a_ = (__gcc_v2di_u)a;
+  __gcc_v2di_u dst;
+  for(int j = 0; j < 2; j++)
+    dst[j] = (unsigned)b >= 64 ? 0 : a_[j] >> b;
+  return (__gcc_v2di)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psrlqi256 */
+
+typedef long long __gcc_v4di __attribute__((__vector_size__(32)));
+typedef unsigned long long __gcc_v4di_u __attribute__((__vector_size__(32)));
+
+__gcc_v4di __builtin_ia32_psrlqi256(__gcc_v4di a, int b)
+{
+  __gcc_v4di_u a_ = (__gcc_v4di_u)a;
+  __gcc_v4di_u dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = (unsigned)b >= 64 ? 0 : a_[j] >> b;
+  return (__gcc_v4di)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psrlwi128 */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_psrlwi128(__gcc_v8hi a, int b)
+{
+  __gcc_v8hi_u a_ = (__gcc_v8hi_u)a;
+  __gcc_v8hi_u dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (unsigned)b >= 16 ? 0 : a_[j] >> b;
+  return (__gcc_v8hi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psrlwi256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+typedef unsigned short __gcc_v16hi_u __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_psrlwi256(__gcc_v16hi a, int b)
+{
+  __gcc_v16hi_u a_ = (__gcc_v16hi_u)a;
+  __gcc_v16hi_u dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (unsigned)b >= 16 ? 0 : a_[j] >> b;
+  return (__gcc_v16hi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubb */
+
+typedef char __gcc_v8qi __attribute__((__vector_size__(8)));
+
+__gcc_v8qi __builtin_ia32_psubb(__gcc_v8qi a, __gcc_v8qi b)
+{
+  __gcc_v8qi a_ = a;
+  __gcc_v8qi b_ = b;
+  __gcc_v8qi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] - b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubb128 */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_psubb128(__gcc_v16qi a, __gcc_v16qi b)
+{
+  __gcc_v16qi a_ = a;
+  __gcc_v16qi b_ = b;
+  __gcc_v16qi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] - b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubb128_mask */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_psubb128_mask(
+  __gcc_v16qi a,
+  __gcc_v16qi b,
+  __gcc_v16qi src,
+  unsigned short k)
+{
+  __gcc_v16qi a_ = a;
+  __gcc_v16qi b_ = b;
+  __gcc_v16qi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubb256 */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_psubb256(__gcc_v32qi a, __gcc_v32qi b)
+{
+  __gcc_v32qi a_ = a;
+  __gcc_v32qi b_ = b;
+  __gcc_v32qi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = a_[j] - b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubb256_mask */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_psubb256_mask(
+  __gcc_v32qi a,
+  __gcc_v32qi b,
+  __gcc_v32qi src,
+  unsigned int k)
+{
+  __gcc_v32qi a_ = a;
+  __gcc_v32qi b_ = b;
+  __gcc_v32qi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubb512_mask */
+
+typedef char __gcc_v64qi __attribute__((__vector_size__(64)));
+
+__gcc_v64qi __builtin_ia32_psubb512_mask(
+  __gcc_v64qi a,
+  __gcc_v64qi b,
+  __gcc_v64qi src,
+  unsigned long long k)
+{
+  __gcc_v64qi a_ = a;
+  __gcc_v64qi b_ = b;
+  __gcc_v64qi dst;
+  for(int j = 0; j < 64; j++)
+    dst[j] = (k >> j) & 1 ? (char)(a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubd */
+
+typedef int __gcc_v2si __attribute__((__vector_size__(8)));
+typedef unsigned int __gcc_v2si_u __attribute__((__vector_size__(8)));
+
+__gcc_v2si __builtin_ia32_psubd(__gcc_v2si a, __gcc_v2si b)
+{
+  __gcc_v2si_u a_ = (__gcc_v2si_u)a;
+  __gcc_v2si_u b_ = (__gcc_v2si_u)b;
+  __gcc_v2si_u dst;
+  for(int j = 0; j < 2; j++)
+    dst[j] = a_[j] - b_[j];
+  return (__gcc_v2si)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubd128 */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_psubd128(__gcc_v4si a, __gcc_v4si b)
+{
+  __gcc_v4si_u a_ = (__gcc_v4si_u)a;
+  __gcc_v4si_u b_ = (__gcc_v4si_u)b;
+  __gcc_v4si_u dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = a_[j] - b_[j];
+  return (__gcc_v4si)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubd128_mask */
+
+typedef int __gcc_v4si __attribute__((__vector_size__(16)));
+typedef unsigned int __gcc_v4si_u __attribute__((__vector_size__(16)));
+
+__gcc_v4si __builtin_ia32_psubd128_mask(
+  __gcc_v4si a,
+  __gcc_v4si b,
+  __gcc_v4si src,
+  unsigned char k)
+{
+  __gcc_v4si_u a_ = (__gcc_v4si_u)a;
+  __gcc_v4si_u b_ = (__gcc_v4si_u)b;
+  __gcc_v4si dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubd256 */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+typedef unsigned int __gcc_v8si_u __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_psubd256(__gcc_v8si a, __gcc_v8si b)
+{
+  __gcc_v8si_u a_ = (__gcc_v8si_u)a;
+  __gcc_v8si_u b_ = (__gcc_v8si_u)b;
+  __gcc_v8si_u dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] - b_[j];
+  return (__gcc_v8si)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubd256_mask */
+
+typedef int __gcc_v8si __attribute__((__vector_size__(32)));
+typedef unsigned int __gcc_v8si_u __attribute__((__vector_size__(32)));
+
+__gcc_v8si __builtin_ia32_psubd256_mask(
+  __gcc_v8si a,
+  __gcc_v8si b,
+  __gcc_v8si src,
+  unsigned char k)
+{
+  __gcc_v8si_u a_ = (__gcc_v8si_u)a;
+  __gcc_v8si_u b_ = (__gcc_v8si_u)b;
+  __gcc_v8si dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubd512_mask */
+
+typedef int __gcc_v16si __attribute__((__vector_size__(64)));
+typedef unsigned int __gcc_v16si_u __attribute__((__vector_size__(64)));
+
+__gcc_v16si __builtin_ia32_psubd512_mask(
+  __gcc_v16si a,
+  __gcc_v16si b,
+  __gcc_v16si src,
+  unsigned short k)
+{
+  __gcc_v16si_u a_ = (__gcc_v16si_u)a;
+  __gcc_v16si_u b_ = (__gcc_v16si_u)b;
+  __gcc_v16si dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (int)(a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubq128 */
+
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+typedef unsigned long long __gcc_v2di_u __attribute__((__vector_size__(16)));
+
+__gcc_v2di __builtin_ia32_psubq128(__gcc_v2di a, __gcc_v2di b)
+{
+  __gcc_v2di_u a_ = (__gcc_v2di_u)a;
+  __gcc_v2di_u b_ = (__gcc_v2di_u)b;
+  __gcc_v2di_u dst;
+  for(int j = 0; j < 2; j++)
+    dst[j] = a_[j] - b_[j];
+  return (__gcc_v2di)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubq128_mask */
+
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+typedef unsigned long long __gcc_v2di_u __attribute__((__vector_size__(16)));
+
+__gcc_v2di __builtin_ia32_psubq128_mask(
+  __gcc_v2di a,
+  __gcc_v2di b,
+  __gcc_v2di src,
+  unsigned char k)
+{
+  __gcc_v2di_u a_ = (__gcc_v2di_u)a;
+  __gcc_v2di_u b_ = (__gcc_v2di_u)b;
+  __gcc_v2di dst;
+  for(int j = 0; j < 2; j++)
+    dst[j] = (k >> j) & 1 ? (long long)(a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubq256 */
+
+typedef long long __gcc_v4di __attribute__((__vector_size__(32)));
+typedef unsigned long long __gcc_v4di_u __attribute__((__vector_size__(32)));
+
+__gcc_v4di __builtin_ia32_psubq256(__gcc_v4di a, __gcc_v4di b)
+{
+  __gcc_v4di_u a_ = (__gcc_v4di_u)a;
+  __gcc_v4di_u b_ = (__gcc_v4di_u)b;
+  __gcc_v4di_u dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = a_[j] - b_[j];
+  return (__gcc_v4di)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubq256_mask */
+
+typedef long long __gcc_v4di __attribute__((__vector_size__(32)));
+typedef unsigned long long __gcc_v4di_u __attribute__((__vector_size__(32)));
+
+__gcc_v4di __builtin_ia32_psubq256_mask(
+  __gcc_v4di a,
+  __gcc_v4di b,
+  __gcc_v4di src,
+  unsigned char k)
+{
+  __gcc_v4di_u a_ = (__gcc_v4di_u)a;
+  __gcc_v4di_u b_ = (__gcc_v4di_u)b;
+  __gcc_v4di dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = (k >> j) & 1 ? (long long)(a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubq512_mask */
+
+typedef long long __gcc_v8di __attribute__((__vector_size__(64)));
+typedef unsigned long long __gcc_v8di_u __attribute__((__vector_size__(64)));
+
+__gcc_v8di __builtin_ia32_psubq512_mask(
+  __gcc_v8di a,
+  __gcc_v8di b,
+  __gcc_v8di src,
+  unsigned char k)
+{
+  __gcc_v8di_u a_ = (__gcc_v8di_u)a;
+  __gcc_v8di_u b_ = (__gcc_v8di_u)b;
+  __gcc_v8di dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (long long)(a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubsb128 */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_psubsb128(__gcc_v16qi a, __gcc_v16qi b)
+{
+  __gcc_v16qi_s a_ = (__gcc_v16qi_s)a;
+  __gcc_v16qi_s b_ = (__gcc_v16qi_s)b;
+  __gcc_v16qi_s dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (a_[j] - b_[j]) < -128  ? -128
+             : (a_[j] - b_[j]) > 127 ? 127
+                                     : a_[j] - b_[j];
+  return (__gcc_v16qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubsb128_mask */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef signed char __gcc_v16qi_s __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_psubsb128_mask(
+  __gcc_v16qi a,
+  __gcc_v16qi b,
+  __gcc_v16qi src,
+  unsigned short k)
+{
+  __gcc_v16qi_s a_ = (__gcc_v16qi_s)a;
+  __gcc_v16qi_s b_ = (__gcc_v16qi_s)b;
+  __gcc_v16qi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (char)((a_[j] - b_[j]) < -128 ? -128 : (a_[j] - b_[j]) > 127 ? 127 : a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubsb256 */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef signed char __gcc_v32qi_s __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_psubsb256(__gcc_v32qi a, __gcc_v32qi b)
+{
+  __gcc_v32qi_s a_ = (__gcc_v32qi_s)a;
+  __gcc_v32qi_s b_ = (__gcc_v32qi_s)b;
+  __gcc_v32qi_s dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (a_[j] - b_[j]) < -128  ? -128
+             : (a_[j] - b_[j]) > 127 ? 127
+                                     : a_[j] - b_[j];
+  return (__gcc_v32qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubsb256_mask */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef signed char __gcc_v32qi_s __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_psubsb256_mask(
+  __gcc_v32qi a,
+  __gcc_v32qi b,
+  __gcc_v32qi src,
+  unsigned int k)
+{
+  __gcc_v32qi_s a_ = (__gcc_v32qi_s)a;
+  __gcc_v32qi_s b_ = (__gcc_v32qi_s)b;
+  __gcc_v32qi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (char)((a_[j] - b_[j]) < -128 ? -128 : (a_[j] - b_[j]) > 127 ? 127 : a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubsb512_mask */
+
+typedef char __gcc_v64qi __attribute__((__vector_size__(64)));
+typedef signed char __gcc_v64qi_s __attribute__((__vector_size__(64)));
+
+__gcc_v64qi __builtin_ia32_psubsb512_mask(
+  __gcc_v64qi a,
+  __gcc_v64qi b,
+  __gcc_v64qi src,
+  unsigned long long k)
+{
+  __gcc_v64qi_s a_ = (__gcc_v64qi_s)a;
+  __gcc_v64qi_s b_ = (__gcc_v64qi_s)b;
+  __gcc_v64qi dst;
+  for(int j = 0; j < 64; j++)
+    dst[j] = (k >> j) & 1 ? (char)((a_[j] - b_[j]) < -128 ? -128 : (a_[j] - b_[j]) > 127 ? 127 : a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubsw128_mask */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_psubsw128_mask(
+  __gcc_v8hi a,
+  __gcc_v8hi b,
+  __gcc_v8hi src,
+  unsigned char k)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi b_ = b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (short)((a_[j] - b_[j]) < -32768 ? -32768 : (a_[j] - b_[j]) > 32767 ? 32767 : a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubsw256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_psubsw256(__gcc_v16hi a, __gcc_v16hi b)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi b_ = b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (a_[j] - b_[j]) < -32768  ? -32768
+             : (a_[j] - b_[j]) > 32767 ? 32767
+                                       : a_[j] - b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubsw256_mask */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_psubsw256_mask(
+  __gcc_v16hi a,
+  __gcc_v16hi b,
+  __gcc_v16hi src,
+  unsigned short k)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi b_ = b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (short)((a_[j] - b_[j]) < -32768 ? -32768 : (a_[j] - b_[j]) > 32767 ? 32767 : a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubsw512_mask */
+
+typedef short __gcc_v32hi __attribute__((__vector_size__(64)));
+
+__gcc_v32hi __builtin_ia32_psubsw512_mask(
+  __gcc_v32hi a,
+  __gcc_v32hi b,
+  __gcc_v32hi src,
+  unsigned int k)
+{
+  __gcc_v32hi a_ = a;
+  __gcc_v32hi b_ = b;
+  __gcc_v32hi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (short)((a_[j] - b_[j]) < -32768 ? -32768 : (a_[j] - b_[j]) > 32767 ? 32767 : a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubusb128 */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_psubusb128(__gcc_v16qi a, __gcc_v16qi b)
+{
+  __gcc_v16qi_u a_ = (__gcc_v16qi_u)a;
+  __gcc_v16qi_u b_ = (__gcc_v16qi_u)b;
+  __gcc_v16qi_u dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (a_[j] - b_[j]) < 0 ? 0 : a_[j] - b_[j];
+  return (__gcc_v16qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubusb128_mask */
+
+typedef char __gcc_v16qi __attribute__((__vector_size__(16)));
+typedef unsigned char __gcc_v16qi_u __attribute__((__vector_size__(16)));
+
+__gcc_v16qi __builtin_ia32_psubusb128_mask(
+  __gcc_v16qi a,
+  __gcc_v16qi b,
+  __gcc_v16qi src,
+  unsigned short k)
+{
+  __gcc_v16qi_u a_ = (__gcc_v16qi_u)a;
+  __gcc_v16qi_u b_ = (__gcc_v16qi_u)b;
+  __gcc_v16qi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] =
+      (k >> j) & 1 ? (char)((a_[j] - b_[j]) < 0 ? 0 : a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubusb256 */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef unsigned char __gcc_v32qi_u __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_psubusb256(__gcc_v32qi a, __gcc_v32qi b)
+{
+  __gcc_v32qi_u a_ = (__gcc_v32qi_u)a;
+  __gcc_v32qi_u b_ = (__gcc_v32qi_u)b;
+  __gcc_v32qi_u dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (a_[j] - b_[j]) < 0 ? 0 : a_[j] - b_[j];
+  return (__gcc_v32qi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubusb256_mask */
+
+typedef char __gcc_v32qi __attribute__((__vector_size__(32)));
+typedef unsigned char __gcc_v32qi_u __attribute__((__vector_size__(32)));
+
+__gcc_v32qi __builtin_ia32_psubusb256_mask(
+  __gcc_v32qi a,
+  __gcc_v32qi b,
+  __gcc_v32qi src,
+  unsigned int k)
+{
+  __gcc_v32qi_u a_ = (__gcc_v32qi_u)a;
+  __gcc_v32qi_u b_ = (__gcc_v32qi_u)b;
+  __gcc_v32qi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] =
+      (k >> j) & 1 ? (char)((a_[j] - b_[j]) < 0 ? 0 : a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubusb512_mask */
+
+typedef char __gcc_v64qi __attribute__((__vector_size__(64)));
+typedef unsigned char __gcc_v64qi_u __attribute__((__vector_size__(64)));
+
+__gcc_v64qi __builtin_ia32_psubusb512_mask(
+  __gcc_v64qi a,
+  __gcc_v64qi b,
+  __gcc_v64qi src,
+  unsigned long long k)
+{
+  __gcc_v64qi_u a_ = (__gcc_v64qi_u)a;
+  __gcc_v64qi_u b_ = (__gcc_v64qi_u)b;
+  __gcc_v64qi dst;
+  for(int j = 0; j < 64; j++)
+    dst[j] =
+      (k >> j) & 1 ? (char)((a_[j] - b_[j]) < 0 ? 0 : a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubusw128_mask */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+typedef unsigned short __gcc_v8hi_u __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_psubusw128_mask(
+  __gcc_v8hi a,
+  __gcc_v8hi b,
+  __gcc_v8hi src,
+  unsigned char k)
+{
+  __gcc_v8hi_u a_ = (__gcc_v8hi_u)a;
+  __gcc_v8hi_u b_ = (__gcc_v8hi_u)b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] =
+      (k >> j) & 1 ? (short)((a_[j] - b_[j]) < 0 ? 0 : a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubusw256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+typedef unsigned short __gcc_v16hi_u __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_psubusw256(__gcc_v16hi a, __gcc_v16hi b)
+{
+  __gcc_v16hi_u a_ = (__gcc_v16hi_u)a;
+  __gcc_v16hi_u b_ = (__gcc_v16hi_u)b;
+  __gcc_v16hi_u dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (a_[j] - b_[j]) < 0 ? 0 : a_[j] - b_[j];
+  return (__gcc_v16hi)dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubusw256_mask */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+typedef unsigned short __gcc_v16hi_u __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_psubusw256_mask(
+  __gcc_v16hi a,
+  __gcc_v16hi b,
+  __gcc_v16hi src,
+  unsigned short k)
+{
+  __gcc_v16hi_u a_ = (__gcc_v16hi_u)a;
+  __gcc_v16hi_u b_ = (__gcc_v16hi_u)b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] =
+      (k >> j) & 1 ? (short)((a_[j] - b_[j]) < 0 ? 0 : a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubusw512_mask */
+
+typedef short __gcc_v32hi __attribute__((__vector_size__(64)));
+typedef unsigned short __gcc_v32hi_u __attribute__((__vector_size__(64)));
+
+__gcc_v32hi __builtin_ia32_psubusw512_mask(
+  __gcc_v32hi a,
+  __gcc_v32hi b,
+  __gcc_v32hi src,
+  unsigned int k)
+{
+  __gcc_v32hi_u a_ = (__gcc_v32hi_u)a;
+  __gcc_v32hi_u b_ = (__gcc_v32hi_u)b;
+  __gcc_v32hi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] =
+      (k >> j) & 1 ? (short)((a_[j] - b_[j]) < 0 ? 0 : a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubw */
+
+typedef short __gcc_v4hi __attribute__((__vector_size__(8)));
+
+__gcc_v4hi __builtin_ia32_psubw(__gcc_v4hi a, __gcc_v4hi b)
+{
+  __gcc_v4hi a_ = a;
+  __gcc_v4hi b_ = b;
+  __gcc_v4hi dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = a_[j] - b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubw128 */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_psubw128(__gcc_v8hi a, __gcc_v8hi b)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi b_ = b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = a_[j] - b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubw128_mask */
+
+typedef short __gcc_v8hi __attribute__((__vector_size__(16)));
+
+__gcc_v8hi __builtin_ia32_psubw128_mask(
+  __gcc_v8hi a,
+  __gcc_v8hi b,
+  __gcc_v8hi src,
+  unsigned char k)
+{
+  __gcc_v8hi a_ = a;
+  __gcc_v8hi b_ = b;
+  __gcc_v8hi dst;
+  for(int j = 0; j < 8; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubw256 */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_psubw256(__gcc_v16hi a, __gcc_v16hi b)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi b_ = b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = a_[j] - b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubw256_mask */
+
+typedef short __gcc_v16hi __attribute__((__vector_size__(32)));
+
+__gcc_v16hi __builtin_ia32_psubw256_mask(
+  __gcc_v16hi a,
+  __gcc_v16hi b,
+  __gcc_v16hi src,
+  unsigned short k)
+{
+  __gcc_v16hi a_ = a;
+  __gcc_v16hi b_ = b;
+  __gcc_v16hi dst;
+  for(int j = 0; j < 16; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_psubw512_mask */
+
+typedef short __gcc_v32hi __attribute__((__vector_size__(64)));
+
+__gcc_v32hi __builtin_ia32_psubw512_mask(
+  __gcc_v32hi a,
+  __gcc_v32hi b,
+  __gcc_v32hi src,
+  unsigned int k)
+{
+  __gcc_v32hi a_ = a;
+  __gcc_v32hi b_ = b;
+  __gcc_v32hi dst;
+  for(int j = 0; j < 32; j++)
+    dst[j] = (k >> j) & 1 ? (short)(a_[j] - b_[j]) : src[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pxor128 */
+
+typedef long long __gcc_v2di __attribute__((__vector_size__(16)));
+
+__gcc_v2di __builtin_ia32_pxor128(__gcc_v2di a, __gcc_v2di b)
+{
+  __gcc_v2di a_ = a;
+  __gcc_v2di b_ = b;
+  __gcc_v2di dst;
+  for(int j = 0; j < 2; j++)
+    dst[j] = a_[j] ^ b_[j];
+  return dst;
+}
+
+/* FUNCTION: __builtin_ia32_pxor256 */
+
+typedef long long __gcc_v4di __attribute__((__vector_size__(32)));
+
+__gcc_v4di __builtin_ia32_pxor256(__gcc_v4di a, __gcc_v4di b)
+{
+  __gcc_v4di a_ = a;
+  __gcc_v4di b_ = b;
+  __gcc_v4di dst;
+  for(int j = 0; j < 4; j++)
+    dst[j] = a_[j] ^ b_[j];
+  return dst;
+}
diff --git a/src/ansi-c/library_check.sh b/src/ansi-c/library_check.sh
index 6883984ae26..0c9202b2194 100755
--- a/src/ansi-c/library_check.sh
+++ b/src/ansi-c/library_check.sh
@@ -101,7 +101,25 @@ perl -p -i -e 's/^_mm_setr_epi(16|32)\n//' __functions # cbmc/SIMD1
 perl -p -i -e 's/^_mm_setr_pi16\n//' __functions # cbmc/SIMD1
 perl -p -i -e 's/^_mm_subs_ep[iu]16\n//' __functions # cbmc/SIMD1
 
-ls ../../regression/cbmc-library/ | egrep -v '(Makefile|CMakeLists.txt)' | sort -u > __tests
+# Functions exercised by the aggregate regression/cbmc/SIMD* smoke tests are
+# covered there rather than by an individual cbmc-library test; treat them as
+# exempt.
+grep -rhoE '__builtin_(ia32|neon)_[A-Za-z0-9_]+' ../../regression/cbmc/SIMD* \
+  2>/dev/null | sort -u > __simd_covered
+comm -23 __functions __simd_covered > __functions.new
+mv __functions.new __functions
+rm __simd_covered
+
+# The __builtin_ia32_* and __builtin_neon_* tests are consolidated into a single
+# directory per family; a function is covered when a .c file underneath
+# references it (rather than by having a directory of its own).
+{
+  ls ../../regression/cbmc-library/ | \
+    egrep -v '(Makefile|CMakeLists.txt|tests.log|^__builtin_ia32$|^__builtin_neon$)'
+  grep -rhoE '__builtin_(ia32|neon)_[A-Za-z0-9_]+' --include='*.c' \
+    ../../regression/cbmc-library/__builtin_ia32 \
+    ../../regression/cbmc-library/__builtin_neon 2>/dev/null
+} | sort -u > __tests
 diff -u __tests __functions
 ec="${?}"
 rm __functions __tests
diff --git a/src/ansi-c/parser.y b/src/ansi-c/parser.y
index 91526b2b8e6..07c0c8bc55a 100644
--- a/src/ansi-c/parser.y
+++ b/src/ansi-c/parser.y
@@ -168,6 +168,7 @@ int yyansi_cerror(const std::string &error);
 %token TOK_GCC_ATTRIBUTE_TRANSPARENT_UNION "transparent_union"
 %token TOK_GCC_ATTRIBUTE_PACKED "packed"
 %token TOK_GCC_ATTRIBUTE_VECTOR_SIZE "vector_size"
+%token TOK_GCC_ATTRIBUTE_NEON_VECTOR_TYPE "neon_vector_type"
 %token TOK_GCC_ATTRIBUTE_MODE "mode"
 %token TOK_GCC_ATTRIBUTE_GNU_INLINE "__gnu_inline__"
 %token TOK_GCC_ATTRIBUTE_WEAK "weak"
@@ -1681,6 +1682,8 @@ gcc_type_attribute:
         { $$=$1; set($$, ID_transparent_union); }
         | TOK_GCC_ATTRIBUTE_VECTOR_SIZE '(' comma_expression ')'
         { $$=$1; set($$, ID_frontend_vector); parser_stack($$).add(ID_size)=parser_stack($3); }
+        | TOK_GCC_ATTRIBUTE_NEON_VECTOR_TYPE '(' comma_expression ')'
+        { $$=$1; set($$, ID_frontend_vector); parser_stack($$).add(ID_size)=parser_stack($3); parser_stack($$).set(ID_C_vector_lanes, true); }
         | TOK_GCC_ATTRIBUTE_ALIGNED
         { $$=$1; set($$, ID_aligned); }
         | TOK_GCC_ATTRIBUTE_ALIGNED '(' comma_expression ')'
diff --git a/src/ansi-c/scanner.l b/src/ansi-c/scanner.l
index f9c7b8674ce..aa8d4f95e45 100644
--- a/src/ansi-c/scanner.l
+++ b/src/ansi-c/scanner.l
@@ -1672,6 +1672,9 @@ enable_or_disable ("enable"|"disable")
 "vector_size" |
 "__vector_size__"   { BEGIN(GCC_ATTRIBUTE3); loc(); return TOK_GCC_ATTRIBUTE_VECTOR_SIZE; }
 
+"neon_vector_type" |
+"__neon_vector_type__" { BEGIN(GCC_ATTRIBUTE3); loc(); return TOK_GCC_ATTRIBUTE_NEON_VECTOR_TYPE; }
+
 "mode" |
 "__mode__"          { BEGIN(GCC_ATTRIBUTE3); loc(); return TOK_GCC_ATTRIBUTE_MODE; }
 
diff --git a/src/util/irep_ids.def b/src/util/irep_ids.def
index 8f1cfe001ab..cecfd010763 100644
--- a/src/util/irep_ids.def
+++ b/src/util/irep_ids.def
@@ -370,6 +370,7 @@ IREP_ID_ONE(designator)
 IREP_ID_ONE(member_designator)
 IREP_ID_ONE(index_designator)
 IREP_ID_TWO(C_constant, #constant)
+IREP_ID_TWO(C_vector_lanes, #vector_lanes)
 IREP_ID_TWO(C_volatile, #volatile)
 IREP_ID_TWO(C_restricted, #restricted)
 IREP_ID_TWO(C_identifier, #identifier)