GH-3530: Optimize BYTE_STREAM_SPLIT encoding/decoding by iemejia · Pull Request #3569 · apache/parquet-java

iemejia · 2026-05-17T22:38:55Z

Part of #3530 — Apache Parquet Java Performance Improvements

Summary

Optimize scalar encode/decode for the BYTE_STREAM_SPLIT encoding.

Reader: Specialized transpose loops for element sizes 2/4/8/12/16 bytes plus generic fallback. Bulk array access when backing array is available.

Writer: Batched scatter buffers (int[]/long[] batches of 64) replacing per-value scatterBytes() which allocated temp byte[] and issued N single-byte writes.

Includes unit tests for transpose specializations, batch-boundary crossing, getBufferedSize with partial batches, direct ByteBuffer decode paths, and close/reset with pending unflushed batches.

JMH benchmarks: BssEncodingBenchmark, BssDecodingBenchmark covering FLOAT, DOUBLE, INT32, INT64, and FIXED_LEN_BYTE_ARRAY.

Benchmark results

Environment: JDK 25.0.3 (Temurin), OpenJDK 64-Bit Server VM, JMH 1.37, Linux x86_64, -wi 3 -i 5 -f 2.

Decoding:

Benchmark	Baseline (M ops/s)	Optimized (M ops/s)	Speedup
decodeInt	203	956	4.7x
decodeFloat	263	965	3.7x
decodeFloatDirect	263	693	2.6x
decodeDouble	132	341	2.6x
decodeLong	133	323	2.4x
decodeLongDirect	149	282	1.9x
decodeFlba(2)	286	449	1.6x
decodeFlba(7)	125	169	1.4x
decodeFlba(12)	95	176	1.9x
decodeFlba(16)	78	132	1.7x

Encoding:

Benchmark	Baseline (M ops/s)	Optimized (M ops/s)	Speedup
encodeLong	52	342	6.6x
encodeDouble	53	305	5.8x
encodeFloat	101	452	4.5x
encodeInt	99	439	4.4x
encodeFlba(12)	41	97	2.4x
encodeFlba(16)	32	70	2.2x
encodeFlba(7)	69	146	2.1x
encodeFlba(2)	192	276	1.4x

Every benchmark shows clear improvement with no regressions. 8-byte types benefit most from the batched scatter (6.6x) since the baseline scattered 8 bytes per value into 8 separate streams. The new Direct benchmarks validate the direct-buffer decode path.

Reader: replace generic ByteBuffer.get() transpose loop in decodeData() with specialized single-pass loops for element sizes 2/4/8/12/16 bytes plus a stream-oriented generic fallback. Bulk-access the backing array directly when available, falling back to a single bulk copy for direct buffers. Writer: replace per-value scatterBytes() (which allocates a temp byte[] and issues N single-byte stream writes) with batched scatter buffers. Int/Long values accumulate in int[]/long[] batches of 64 and flush as bulk write(byte[], off, len) calls -- one per stream. FLBA uses per-stream byte[][] scratch buffers with the same batching strategy. getBufferedSize() now accounts for unflushed batch values. Add JMH benchmarks for scalar encode/decode of all 5 BSS types (FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY). Add TestDataFactory for deterministic FLBA benchmark data generation. Add unit tests for transpose specializations, batch-boundary crossing, getBufferedSize with partial batches, direct ByteBuffer decode paths, and close/reset with pending unflushed batches.

Fokko · 2026-05-22T21:51:05Z

-      for (int stream = 0; stream < elementSizeInBytes; ++stream, ++destByteIndex) {
-        decoded[destByteIndex] = encoded.get(srcValueIndex + stream * valuesCount);
+    int totalBytes = valuesCount * elementSizeInBytes;
+    assert encoded.remaining() >= totalBytes;


Suggested change

assert encoded.remaining() >= totalBytes;

assert encoded.remaining() == totalBytes;

Done in 689208f.

Fokko · 2026-05-22T21:51:54Z

  protected final int numStreams;
  protected final int elementSizeInBytes;
-  private final CapacityByteArrayOutputStream[] byteStreams;
+  protected final CapacityByteArrayOutputStream[] byteStreams;


Why not keep this one private?

Suggested change

protected final CapacityByteArrayOutputStream[] byteStreams;

private final CapacityByteArrayOutputStream[] byteStreams;

Can't make it private: FixedLenByteArrayByteStreamSplitValuesWriter is a static nested class that extends the parent, and Java's name resolution can't reach private parent members from that context (private methods/fields aren't inherited, and static nested classes have no enclosing-instance reference to use for cross-class access). Additionally, japicmp flags the visibility reduction as an API-breaking change since these fields are protected in the released version. Left as-is.

Fokko · 2026-05-22T21:53:28Z

+      if (flbaBatchCount == 0) return;
+      final int count = flbaBatchCount;
+      for (int stream = 0; stream < length; stream++) {
+        byteStreams[stream].write(batchBufs[stream], 0, count);


I see, probaly we want to make a method for this one:

writeToStream(stream, batchBufs[stream], 0, count);

Done in 689208f. Extracted a package-private writeToStream(int stream, byte[] buf, int off, int len) helper. It's package-private (not private) because the FixedLenByteArrayByteStreamSplitValuesWriter static nested subclass also needs to call it — private methods aren't inherited and can't be resolved from a static nested class that extends the enclosing class. All three flush methods (flushIntBatch, flushLongBatch, flushFlbaBatch) now use it consistently.

Fokko · 2026-05-22T21:54:04Z

+      flushIntBatch();
+    } else if (longBatch != null) {
+      flushLongBatch();
+    }


Should we add an else case that throws?

Done in 689208f. Added an else-throw with ParquetEncodingException for fail-fast if batchCount > 0 but no batch buffer is initialized.

- Use exact assert (== instead of >=) for encoded.remaining() check - Extract package-private writeToStream() helper used consistently by all three flush methods (flushIntBatch, flushLongBatch, flushFlbaBatch) - Add else-throw in flushBatch() for defensive fail-fast when batchCount > 0 but no batch buffer is initialized Assisted-by: GitHub Copilot:claude-opus-4.6

iemejia · 2026-06-12T10:18:27Z

@Fokko All review comments addressed in 689208f. Benchmark results updated in the PR description (re-run on JDK 25.0.3, same machine). Ready for another round of review or merge.

iemejia force-pushed the parquet-perf-v2-par5-bss branch from f7fdee5 to 84de9c6 Compare May 17, 2026 23:04

Fokko reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3530: Optimize BYTE_STREAM_SPLIT encoding/decoding#3569

GH-3530: Optimize BYTE_STREAM_SPLIT encoding/decoding#3569
iemejia wants to merge 2 commits into
apache:masterfrom
iemejia:parquet-perf-v2-par5-bss

iemejia commented May 17, 2026 •

edited

Loading

Uh oh!

Fokko May 22, 2026

Uh oh!

iemejia Jun 12, 2026

Uh oh!

Fokko May 22, 2026

Uh oh!

iemejia Jun 12, 2026

Uh oh!

Fokko May 22, 2026

Uh oh!

iemejia Jun 12, 2026

Uh oh!

Fokko May 22, 2026

Uh oh!

iemejia Jun 12, 2026

Uh oh!

iemejia commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	assert encoded.remaining() >= totalBytes;
	assert encoded.remaining() == totalBytes;

	protected final CapacityByteArrayOutputStream[] byteStreams;
	private final CapacityByteArrayOutputStream[] byteStreams;

Conversation

iemejia commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iemejia commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iemejia commented May 17, 2026 •

edited

Loading

iemejia commented Jun 12, 2026 •

edited

Loading