Summary
This patch introduces an AVX-512 + VPCLMULQDQ accelerated CRC32-C implementation for InnoDB's page checksum and redo log verification, significantly improving throughput on modern x86-64 CPUs like the Hygon C86-4G.
Background & Motivation
In high-concurrency OLTP workloads (e.g., BenchmarkSQL TPC-C), InnoDB spends a non-trivial amount of CPU time computing CRC32-C checksums for:
- 16 KB buffer pool pages
- Variable-length redo log blocks
- Doublewrite buffer pages
The existing crc32_using_pclmul (SSE4.2 + PCLMULQDQ, 128-bit folding) is efficient for small buffers but becomes throughput-bound when processing large contiguous blocks, as it only utilizes one 128-bit lane per fold step.
AVX-512 VPCLMULQDQ extends this to four parallel 128-bit lanes (512-bit ZMM registers), enabling:
- 4× polynomial folding throughput for large buffers (≥256 bytes)
- Dual/quad-stream pipelining to hide latency via register-level parallelism
- Alignment-aware dispatch (
VMOVDQA64 vs VMOVDQU64) for InnoDB's naturally aligned 16 KB pages
Technical Details
1. Runtime CPU Detection (can_use_avx512_vpclmul)
Implements the full Intel/AMD specification for AVX-512 feature detection:
- CPUID leaf 1:
OSXSAVE[27] + AVX[28] (prerequisites for XGETBV)
- XCR0 state mask: validates OS-managed ZMM register save/restore (
0xE6 = SSE|AVX|OPMASK|ZMM_Hi256|Hi16_ZMM)
- CPUID leaf 7:
AVX512F[16] + AVX512DQ[17] + AVX512BW[30] + AVX512VL[31] + VPCLMULQDQ[10]
2. Algorithm: 512-bit Polynomial Folding
Large buffer (≥256 bytes): 4-stream folding with b2048 constants
→ 2-stream folding with b1024
→ 1-stream 512-bit loop with b512
→ 512→128 bit reduction with b384
→ 128-bit folding loop with b128
→ 128→64 bit reduction with b64
→ Barrett reduction to 32-bit CRC
Key optimizations:
vpternlogd ternary logic: replaces 2× XOR + AND sequences with single-uop operations (e.g., AVX512_TL_XOR3, AVX512_TL_XOR2_AND)
- Template-based alignment dispatch:
crc32c_avx512_impl<true/false> generates zero-overhead aligned/unaligned load paths at compile time
_mm256_zeroupper(): eliminates AVX-SSE transition penalty after ZMM execution
3. Compiler Compatibility
| Compiler |
Minimum Version |
Target Attribute |
| GCC |
11.0 |
pclmul,avx512f,avx512dq,avx512bw,avx512vl,vpclmulqdq |
| GCC |
15.0 |
pclmul,avx10.1,vpclmulqdq (unified ISA) |
| Clang |
9.0 |
pclmul,avx512f,avx512dq,avx512bw,avx512vl,vpclmulqdq |
MSVC is excluded from __attribute__((target(...))) and would require a separate /arch:AVX512 compilation unit (not included in this patch).
4. Integration Points
- New flag:
ut_crc32_avx512_enabled (exposed in ut0crc32.h)
- Priority: AVX-512 path > SSE4.2+PCLMUL > SSE4.2-only > software fallback
- Logging: startup message indicates active path for observability
Performance Impact (Expected)
Based on standalone benchmarks and crc32c_x86.cc reference data:
<style>
</style>
| 缓冲区大小 |
Software (GB/s) |
SSE4.2 (GB/s) |
AVX-512 (GB/s) |
加速比 (AVX/SW) |
加速比 (AVX/SSE) |
| 64 B |
1.40 |
2.35 |
2.56 |
1.8× |
1.1× |
| 256 B |
1.76 |
4.84 |
8.45 |
4.8× |
1.7× |
| 512 B |
1.80 |
5.88 |
15.89 |
8.8× |
2.7× |
| 1 KB |
1.80 |
6.30 |
23.22 |
12.9× |
3.7× |
| 4 KB |
1.83 |
7.09 |
40.62 |
22.2× |
5.7× |
| 16 KB |
1.83 |
7.33 |
45.05 |
24.6× |
6.1× |
| 64 KB |
1.83 |
7.40 |
46.17 |
25.2× |
6.2× |
| 256 KB |
1.83 |
7.40 |
46.56 |
25.5× |
6.3× |
| 1 MB |
1.82 |
7.40 |
45.87 |
25.2× |
6.2× |
| 4 MB |
1.83 |
7.39 |
45.88 |
25.1× |
6.2× |
Note: Actual TXSQL TPC-C improvement depends on checksum CPU fraction; typically 2–5% overall throughput gain at high concurrency.
Compatibility & Risk Assessment
- x86-64 only: ARM64 path unchanged; no impact on ARM deployments
- Graceful degradation: if AVX-512 is unavailable, falls back to existing PCLMUL or SSE4.2 paths
- OS requirement: Linux kernel ≥ 4.0 (for proper XSAVE state management) or any modern distribution
- No ABI changes:
ut_crc32_func_t signature unchanged
Patch Scope
storage/innobase/include/ut0crc32.h | 8 ++++++++
storage/innobase/ut/crc32.cc | 647 +++++++++++++++++++++++++-
2 files changed, 653 insertions(+), 2 deletions(-)
Checklist
0001-.add-avx512_crc32-for-TXSQL8.0.30.patch
References
- Intel: "Fast CRC Computation Using PCLMULQDQ Instruction" (white paper)
- MariaDB:
storage/innobase/ut/crc32c_x86.cc (AVX-512 reference implementation)
- Linux kernel:
arch/x86/kernel/fpu/xstate.c (XSAVE/AVX-512 state management)
Summary
This patch introduces an AVX-512 + VPCLMULQDQ accelerated CRC32-C implementation for InnoDB's page checksum and redo log verification, significantly improving throughput on modern x86-64 CPUs like the Hygon C86-4G.
Background & Motivation
In high-concurrency OLTP workloads (e.g., BenchmarkSQL TPC-C), InnoDB spends a non-trivial amount of CPU time computing CRC32-C checksums for:
The existing
crc32_using_pclmul(SSE4.2 + PCLMULQDQ, 128-bit folding) is efficient for small buffers but becomes throughput-bound when processing large contiguous blocks, as it only utilizes one 128-bit lane per fold step.AVX-512 VPCLMULQDQ extends this to four parallel 128-bit lanes (512-bit ZMM registers), enabling:
VMOVDQA64vsVMOVDQU64) for InnoDB's naturally aligned 16 KB pagesTechnical Details
1. Runtime CPU Detection (
can_use_avx512_vpclmul)Implements the full Intel/AMD specification for AVX-512 feature detection:
OSXSAVE[27]+AVX[28](prerequisites for XGETBV)0xE6= SSE|AVX|OPMASK|ZMM_Hi256|Hi16_ZMM)AVX512F[16]+AVX512DQ[17]+AVX512BW[30]+AVX512VL[31]+VPCLMULQDQ[10]2. Algorithm: 512-bit Polynomial Folding
Large buffer (≥256 bytes): 4-stream folding with b2048 constants
→ 2-stream folding with b1024
→ 1-stream 512-bit loop with b512
→ 512→128 bit reduction with b384
→ 128-bit folding loop with b128
→ 128→64 bit reduction with b64
→ Barrett reduction to 32-bit CRC
Key optimizations:
vpternlogdternary logic: replaces 2× XOR + AND sequences with single-uop operations (e.g.,AVX512_TL_XOR3,AVX512_TL_XOR2_AND)crc32c_avx512_impl<true/false>generates zero-overhead aligned/unaligned load paths at compile time_mm256_zeroupper(): eliminates AVX-SSE transition penalty after ZMM execution3. Compiler Compatibility
pclmul,avx512f,avx512dq,avx512bw,avx512vl,vpclmulqdqpclmul,avx10.1,vpclmulqdq(unified ISA)pclmul,avx512f,avx512dq,avx512bw,avx512vl,vpclmulqdqMSVC is excluded from
__attribute__((target(...)))and would require a separate/arch:AVX512compilation unit (not included in this patch).4. Integration Points
ut_crc32_avx512_enabled(exposed inut0crc32.h)Performance Impact (Expected)
Based on standalone benchmarks and
<style> </style>crc32c_x86.ccreference data:Note: Actual TXSQL TPC-C improvement depends on checksum CPU fraction; typically 2–5% overall throughput gain at high concurrency.
Compatibility & Risk Assessment
ut_crc32_func_tsignature unchangedPatch Scope
storage/innobase/include/ut0crc32.h | 8 ++++++++
storage/innobase/ut/crc32.cc | 647 +++++++++++++++++++++++++-
2 files changed, 653 insertions(+), 2 deletions(-)
Checklist
crc32_using_pclmuloutput (bit-exact)/arch:AVX512separate compilation unit (future)0001-.add-avx512_crc32-for-TXSQL8.0.30.patch
References
storage/innobase/ut/crc32c_x86.cc(AVX-512 reference implementation)arch/x86/kernel/fpu/xstate.c(XSAVE/AVX-512 state management)