Skip to content

[optimization] Add AVX-512 VPCLMULQDQ accelerated CRC32-C for TXSQL 8.0.30 #112

Description

@liuxingang-hygon

Summary

This patch introduces an AVX-512 + VPCLMULQDQ accelerated CRC32-C implementation for InnoDB's page checksum and redo log verification, significantly improving throughput on modern x86-64 CPUs like the Hygon C86-4G.

Background & Motivation

In high-concurrency OLTP workloads (e.g., BenchmarkSQL TPC-C), InnoDB spends a non-trivial amount of CPU time computing CRC32-C checksums for:

  • 16 KB buffer pool pages
  • Variable-length redo log blocks
  • Doublewrite buffer pages

The existing crc32_using_pclmul (SSE4.2 + PCLMULQDQ, 128-bit folding) is efficient for small buffers but becomes throughput-bound when processing large contiguous blocks, as it only utilizes one 128-bit lane per fold step.

AVX-512 VPCLMULQDQ extends this to four parallel 128-bit lanes (512-bit ZMM registers), enabling:

  • 4× polynomial folding throughput for large buffers (≥256 bytes)
  • Dual/quad-stream pipelining to hide latency via register-level parallelism
  • Alignment-aware dispatch (VMOVDQA64 vs VMOVDQU64) for InnoDB's naturally aligned 16 KB pages

Technical Details

1. Runtime CPU Detection (can_use_avx512_vpclmul)

Implements the full Intel/AMD specification for AVX-512 feature detection:

  • CPUID leaf 1: OSXSAVE[27] + AVX[28] (prerequisites for XGETBV)
  • XCR0 state mask: validates OS-managed ZMM register save/restore (0xE6 = SSE|AVX|OPMASK|ZMM_Hi256|Hi16_ZMM)
  • CPUID leaf 7: AVX512F[16] + AVX512DQ[17] + AVX512BW[30] + AVX512VL[31] + VPCLMULQDQ[10]

2. Algorithm: 512-bit Polynomial Folding

Large buffer (≥256 bytes): 4-stream folding with b2048 constants
→ 2-stream folding with b1024
→ 1-stream 512-bit loop with b512
→ 512→128 bit reduction with b384
→ 128-bit folding loop with b128
→ 128→64 bit reduction with b64
→ Barrett reduction to 32-bit CRC

Key optimizations:

  • vpternlogd ternary logic: replaces 2× XOR + AND sequences with single-uop operations (e.g., AVX512_TL_XOR3, AVX512_TL_XOR2_AND)
  • Template-based alignment dispatch: crc32c_avx512_impl<true/false> generates zero-overhead aligned/unaligned load paths at compile time
  • _mm256_zeroupper(): eliminates AVX-SSE transition penalty after ZMM execution

3. Compiler Compatibility

Compiler Minimum Version Target Attribute
GCC 11.0 pclmul,avx512f,avx512dq,avx512bw,avx512vl,vpclmulqdq
GCC 15.0 pclmul,avx10.1,vpclmulqdq (unified ISA)
Clang 9.0 pclmul,avx512f,avx512dq,avx512bw,avx512vl,vpclmulqdq

MSVC is excluded from __attribute__((target(...))) and would require a separate /arch:AVX512 compilation unit (not included in this patch).

4. Integration Points

  • New flag: ut_crc32_avx512_enabled (exposed in ut0crc32.h)
  • Priority: AVX-512 path > SSE4.2+PCLMUL > SSE4.2-only > software fallback
  • Logging: startup message indicates active path for observability

Performance Impact (Expected)

Based on standalone benchmarks and crc32c_x86.cc reference data:

<style> </style>
缓冲区大小 Software (GB/s) SSE4.2  (GB/s) AVX-512 (GB/s) 加速比 (AVX/SW) 加速比 (AVX/SSE)
64 B 1.40 2.35 2.56 1.8× 1.1×
256 B 1.76 4.84 8.45 4.8× 1.7×
512 B 1.80 5.88 15.89 8.8× 2.7×
1 KB 1.80 6.30 23.22 12.9× 3.7×
4 KB 1.83 7.09 40.62 22.2× 5.7×
16 KB 1.83 7.33 45.05 24.6× 6.1×
64 KB 1.83 7.40 46.17 25.2× 6.2×
256 KB 1.83 7.40 46.56 25.5× 6.3×
1 MB 1.82 7.40 45.87 25.2× 6.2×
4 MB 1.83 7.39 45.88 25.1× 6.2×

Note: Actual TXSQL TPC-C improvement depends on checksum CPU fraction; typically 2–5% overall throughput gain at high concurrency.

Compatibility & Risk Assessment

  • x86-64 only: ARM64 path unchanged; no impact on ARM deployments
  • Graceful degradation: if AVX-512 is unavailable, falls back to existing PCLMUL or SSE4.2 paths
  • OS requirement: Linux kernel ≥ 4.0 (for proper XSAVE state management) or any modern distribution
  • No ABI changes: ut_crc32_func_t signature unchanged

Patch Scope

storage/innobase/include/ut0crc32.h | 8 ++++++++
storage/innobase/ut/crc32.cc | 647 +++++++++++++++++++++++++-
2 files changed, 653 insertions(+), 2 deletions(-)

Checklist

  • Implementation compiles with GCC 11+ / Clang 9+
  • Runtime CPU detection covers all AVX-512 sub-features
  • Alignment-aware dispatch for InnoDB page boundaries
  • VZEROUPPER penalty mitigation
  • BenchmarkSQL TPC-C validation on Intel Sapphire Rapids / Hygon C86-4G
  • Cross-validation against existing crc32_using_pclmul output (bit-exact)
  • Review for MSVC /arch:AVX512 separate compilation unit (future)

0001-.add-avx512_crc32-for-TXSQL8.0.30.patch

References

  • Intel: "Fast CRC Computation Using PCLMULQDQ Instruction" (white paper)
  • MariaDB: storage/innobase/ut/crc32c_x86.cc (AVX-512 reference implementation)
  • Linux kernel: arch/x86/kernel/fpu/xstate.c (XSAVE/AVX-512 state management)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions