[optimization] Add AVX-512 VPCLMULQDQ accelerated CRC32-C for TXSQL 8.0.30

### Summary

This patch introduces an AVX-512 + VPCLMULQDQ accelerated CRC32-C implementation for InnoDB's page checksum and redo log verification, significantly improving throughput on modern x86-64 CPUs like the Hygon C86-4G.

### Background & Motivation

In high-concurrency OLTP workloads (e.g., BenchmarkSQL TPC-C), InnoDB spends a non-trivial amount of CPU time computing CRC32-C checksums for:
- **16 KB buffer pool pages** 
- **Variable-length redo log blocks** 
- **Doublewrite buffer pages**

The existing `crc32_using_pclmul` (SSE4.2 + PCLMULQDQ, 128-bit folding) is efficient for small buffers but becomes throughput-bound when processing large contiguous blocks, as it only utilizes **one 128-bit lane** per fold step.

AVX-512 VPCLMULQDQ extends this to **four parallel 128-bit lanes** (512-bit ZMM registers), enabling:
- **4× polynomial folding throughput** for large buffers (≥256 bytes)
- **Dual/quad-stream pipelining** to hide latency via register-level parallelism
- **Alignment-aware dispatch** (`VMOVDQA64` vs `VMOVDQU64`) for InnoDB's naturally aligned 16 KB pages

### Technical Details

#### 1. Runtime CPU Detection (`can_use_avx512_vpclmul`)

Implements the full Intel/AMD specification for AVX-512 feature detection:
- **CPUID leaf 1**: `OSXSAVE[27]` + `AVX[28]` (prerequisites for XGETBV)
- **XCR0 state mask**: validates OS-managed ZMM register save/restore (`0xE6` = SSE|AVX|OPMASK|ZMM_Hi256|Hi16_ZMM)
- **CPUID leaf 7**: `AVX512F[16]` + `AVX512DQ[17]` + `AVX512BW[30]` + `AVX512VL[31]` + `VPCLMULQDQ[10]`

#### 2. Algorithm: 512-bit Polynomial Folding
Large buffer (≥256 bytes): 4-stream folding with b2048 constants
                                      → 2-stream folding with b1024
                                      → 1-stream 512-bit loop with b512
                                      → 512→128 bit reduction with b384
                                      → 128-bit folding loop with b128
                                      → 128→64 bit reduction with b64
                                      → Barrett reduction to 32-bit CRC



Key optimizations:
- **`vpternlogd` ternary logic**: replaces 2× XOR + AND sequences with single-uop operations (e.g., `AVX512_TL_XOR3`, `AVX512_TL_XOR2_AND`)
- **Template-based alignment dispatch**: `crc32c_avx512_impl<true/false>` generates zero-overhead aligned/unaligned load paths at compile time
- **`_mm256_zeroupper()`**: eliminates AVX-SSE transition penalty  after ZMM execution

#### 3. Compiler Compatibility

| Compiler | Minimum Version | Target Attribute |
|----------|----------------|----------------|
| GCC      | 11.0           | `pclmul,avx512f,avx512dq,avx512bw,avx512vl,vpclmulqdq` |
| GCC      | 15.0           | `pclmul,avx10.1,vpclmulqdq` (unified ISA) |
| Clang    | 9.0            | `pclmul,avx512f,avx512dq,avx512bw,avx512vl,vpclmulqdq` |

MSVC is excluded from `__attribute__((target(...)))` and would require a separate `/arch:AVX512` compilation unit (not included in this patch).

#### 4. Integration Points

- **New flag**: `ut_crc32_avx512_enabled` (exposed in `ut0crc32.h`)
- **Priority**: AVX-512 path > SSE4.2+PCLMUL > SSE4.2-only > software fallback
- **Logging**: startup message indicates active path for observability

### Performance Impact (Expected)

Based on standalone benchmarks and  `crc32c_x86.cc` reference data:
<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 15">
<meta name=Originator content="Microsoft Word 15">
<link rel=File-List
href="file:///C:/Users/LIUXIN~1/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

<link rel=themeData
href="file:///C:/Users/LIUXIN~1/AppData/Local/Temp/msohtmlclip1/01/clip_themedata.thmx">
<link rel=colorSchemeMapping
href="file:///C:/Users/LIUXIN~1/AppData/Local/Temp/msohtmlclip1/01/clip_colorschememapping.xml">

<style>

</style>

</head>

<body lang=ZH-CN style='tab-interval:21.0pt;text-justify-trim:punctuation'>



缓冲区大小 | Software    (GB/s) | SSE4.2     (GB/s) | AVX-512    (GB/s) | 加速比 (AVX/SW) | 加速比 (AVX/SSE)
-- | -- | -- | -- | -- | --
64 B | 1.40 | 2.35 | 2.56 | 1.8× | 1.1×
256 B | 1.76 | 4.84 | 8.45 | 4.8× | 1.7×
512 B | 1.80 | 5.88 | 15.89 | 8.8× | 2.7×
1 KB | 1.80 | 6.30 | 23.22 | 12.9× | 3.7×
4 KB | 1.83 | 7.09 | 40.62 | 22.2× | 5.7×
16 KB | 1.83 | 7.33 | 45.05 | 24.6× | 6.1×
64 KB | 1.83 | 7.40 | 46.17 | 25.2× | 6.2×
256 KB | 1.83 | 7.40 | 46.56 | 25.5× | 6.3×
1 MB | 1.82 | 7.40 | 45.87 | 25.2× | 6.2×
4 MB | 1.83 | 7.39 | 45.88 | 25.1× | 6.2×




</body>

</html>



*Note: Actual TXSQL TPC-C improvement depends on checksum CPU fraction; typically 2–5% overall throughput gain at high concurrency.*

### Compatibility & Risk Assessment

- **x86-64 only**: ARM64 path unchanged; no impact on ARM deployments
- **Graceful degradation**: if AVX-512 is unavailable, falls back to existing PCLMUL or SSE4.2 paths
- **OS requirement**: Linux kernel ≥ 4.0 (for proper XSAVE state management) or any modern distribution
- **No ABI changes**: `ut_crc32_func_t` signature unchanged

### Patch Scope
storage/innobase/include/ut0crc32.h  |  8 ++++++++
storage/innobase/ut/crc32.cc         | 647 +++++++++++++++++++++++++-
2 files changed, 653 insertions(+), 2 deletions(-)


### Checklist

- [x] Implementation compiles with GCC 11+ / Clang 9+
- [x] Runtime CPU detection covers all AVX-512 sub-features
- [x] Alignment-aware dispatch for InnoDB page boundaries
- [x] VZEROUPPER penalty mitigation
- [x] BenchmarkSQL TPC-C validation on Intel Sapphire Rapids / Hygon C86-4G
- [x] Cross-validation against existing `crc32_using_pclmul` output (bit-exact)
- [ ] Review for MSVC `/arch:AVX512` separate compilation unit (future)

[0001-.add-avx512_crc32-for-TXSQL8.0.30.patch](https://github.com/user-attachments/files/29235404/0001-.add-avx512_crc32-for-TXSQL8.0.30.patch)

### References

- Intel: "Fast CRC Computation Using PCLMULQDQ Instruction" (white paper)
- MariaDB: `storage/innobase/ut/crc32c_x86.cc` (AVX-512 reference implementation)
- Linux kernel: `arch/x86/kernel/fpu/xstate.c` (XSAVE/AVX-512 state management)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[optimization] Add AVX-512 VPCLMULQDQ accelerated CRC32-C for TXSQL 8.0.30 #112

Summary

Background & Motivation

Technical Details

1. Runtime CPU Detection (`can_use_avx512_vpclmul`)

2. Algorithm: 512-bit Polynomial Folding

3. Compiler Compatibility

4. Integration Points

Performance Impact (Expected)

Compatibility & Risk Assessment

Patch Scope

Checklist

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Compiler	Minimum Version	Target Attribute
GCC	11.0	`pclmul,avx512f,avx512dq,avx512bw,avx512vl,vpclmulqdq`
GCC	15.0	`pclmul,avx10.1,vpclmulqdq` (unified ISA)
Clang	9.0	`pclmul,avx512f,avx512dq,avx512bw,avx512vl,vpclmulqdq`

缓冲区大小	Software (GB/s)	SSE4.2 (GB/s)	AVX-512 (GB/s)	加速比 (AVX/SW)	加速比 (AVX/SSE)
64 B	1.40	2.35	2.56	1.8×	1.1×
256 B	1.76	4.84	8.45	4.8×	1.7×
512 B	1.80	5.88	15.89	8.8×	2.7×
1 KB	1.80	6.30	23.22	12.9×	3.7×
4 KB	1.83	7.09	40.62	22.2×	5.7×
16 KB	1.83	7.33	45.05	24.6×	6.1×
64 KB	1.83	7.40	46.17	25.2×	6.2×
256 KB	1.83	7.40	46.56	25.5×	6.3×
1 MB	1.82	7.40	45.87	25.2×	6.2×
4 MB	1.83	7.39	45.88	25.1×	6.2×

Uh oh!

[optimization] Add AVX-512 VPCLMULQDQ accelerated CRC32-C for TXSQL 8.0.30 #112

Description

Summary

Background & Motivation

Technical Details

1. Runtime CPU Detection (can_use_avx512_vpclmul)

2. Algorithm: 512-bit Polynomial Folding

3. Compiler Compatibility

4. Integration Points

Performance Impact (Expected)

Compatibility & Risk Assessment

Patch Scope

Checklist

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Runtime CPU Detection (`can_use_avx512_vpclmul`)