Skip to content

GH-3530: Optimize BYTE_STREAM_SPLIT encoding/decoding#3569

Open
iemejia wants to merge 2 commits into
apache:masterfrom
iemejia:parquet-perf-v2-par5-bss
Open

GH-3530: Optimize BYTE_STREAM_SPLIT encoding/decoding#3569
iemejia wants to merge 2 commits into
apache:masterfrom
iemejia:parquet-perf-v2-par5-bss

Conversation

@iemejia

@iemejia iemejia commented May 17, 2026

Copy link
Copy Markdown
Member

Part of #3530 — Apache Parquet Java Performance Improvements

Summary

Optimize scalar encode/decode for the BYTE_STREAM_SPLIT encoding.

Reader: Specialized transpose loops for element sizes 2/4/8/12/16 bytes plus generic fallback. Bulk array access when backing array is available.

Writer: Batched scatter buffers (int[]/long[] batches of 64) replacing per-value scatterBytes() which allocated temp byte[] and issued N single-byte writes.

Includes unit tests for transpose specializations, batch-boundary crossing, getBufferedSize with partial batches, direct ByteBuffer decode paths, and close/reset with pending unflushed batches.

JMH benchmarks: BssEncodingBenchmark, BssDecodingBenchmark covering FLOAT, DOUBLE, INT32, INT64, and FIXED_LEN_BYTE_ARRAY.

Benchmark results

Environment: JDK 25.0.3 (Temurin), OpenJDK 64-Bit Server VM, JMH 1.37, Linux x86_64, -wi 3 -i 5 -f 2.

Decoding:

Benchmark Baseline (M ops/s) Optimized (M ops/s) Speedup
decodeInt 203 956 4.7x
decodeFloat 263 965 3.7x
decodeFloatDirect 263 693 2.6x
decodeDouble 132 341 2.6x
decodeLong 133 323 2.4x
decodeLongDirect 149 282 1.9x
decodeFlba(2) 286 449 1.6x
decodeFlba(7) 125 169 1.4x
decodeFlba(12) 95 176 1.9x
decodeFlba(16) 78 132 1.7x

Encoding:

Benchmark Baseline (M ops/s) Optimized (M ops/s) Speedup
encodeLong 52 342 6.6x
encodeDouble 53 305 5.8x
encodeFloat 101 452 4.5x
encodeInt 99 439 4.4x
encodeFlba(12) 41 97 2.4x
encodeFlba(16) 32 70 2.2x
encodeFlba(7) 69 146 2.1x
encodeFlba(2) 192 276 1.4x

Every benchmark shows clear improvement with no regressions. 8-byte types benefit most from the batched scatter (6.6x) since the baseline scattered 8 bytes per value into 8 separate streams. The new Direct benchmarks validate the direct-buffer decode path.

Reader: replace generic ByteBuffer.get() transpose loop in decodeData()
with specialized single-pass loops for element sizes 2/4/8/12/16 bytes
plus a stream-oriented generic fallback. Bulk-access the backing array
directly when available, falling back to a single bulk copy for direct
buffers.

Writer: replace per-value scatterBytes() (which allocates a temp byte[]
and issues N single-byte stream writes) with batched scatter buffers.
Int/Long values accumulate in int[]/long[] batches of 64 and flush as
bulk write(byte[], off, len) calls -- one per stream. FLBA uses
per-stream byte[][] scratch buffers with the same batching strategy.
getBufferedSize() now accounts for unflushed batch values.

Add JMH benchmarks for scalar encode/decode of all 5 BSS types (FLOAT,
DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY). Add TestDataFactory for
deterministic FLBA benchmark data generation. Add unit tests for
transpose specializations, batch-boundary crossing, getBufferedSize
with partial batches, direct ByteBuffer decode paths, and close/reset
with pending unflushed batches.
@iemejia iemejia force-pushed the parquet-perf-v2-par5-bss branch from f7fdee5 to 84de9c6 Compare May 17, 2026 23:04
for (int stream = 0; stream < elementSizeInBytes; ++stream, ++destByteIndex) {
decoded[destByteIndex] = encoded.get(srcValueIndex + stream * valuesCount);
int totalBytes = valuesCount * elementSizeInBytes;
assert encoded.remaining() >= totalBytes;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert encoded.remaining() >= totalBytes;
assert encoded.remaining() == totalBytes;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 689208f.

protected final int numStreams;
protected final int elementSizeInBytes;
private final CapacityByteArrayOutputStream[] byteStreams;
protected final CapacityByteArrayOutputStream[] byteStreams;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep this one private?

Suggested change
protected final CapacityByteArrayOutputStream[] byteStreams;
private final CapacityByteArrayOutputStream[] byteStreams;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't make it private: FixedLenByteArrayByteStreamSplitValuesWriter is a static nested class that extends the parent, and Java's name resolution can't reach private parent members from that context (private methods/fields aren't inherited, and static nested classes have no enclosing-instance reference to use for cross-class access). Additionally, japicmp flags the visibility reduction as an API-breaking change since these fields are protected in the released version. Left as-is.

if (flbaBatchCount == 0) return;
final int count = flbaBatchCount;
for (int stream = 0; stream < length; stream++) {
byteStreams[stream].write(batchBufs[stream], 0, count);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, probaly we want to make a method for this one:

writeToStream(stream, batchBufs[stream], 0, count);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 689208f. Extracted a package-private writeToStream(int stream, byte[] buf, int off, int len) helper. It's package-private (not private) because the FixedLenByteArrayByteStreamSplitValuesWriter static nested subclass also needs to call it — private methods aren't inherited and can't be resolved from a static nested class that extends the enclosing class. All three flush methods (flushIntBatch, flushLongBatch, flushFlbaBatch) now use it consistently.

flushIntBatch();
} else if (longBatch != null) {
flushLongBatch();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add an else case that throws?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 689208f. Added an else-throw with ParquetEncodingException for fail-fast if batchCount > 0 but no batch buffer is initialized.

- Use exact assert (== instead of >=) for encoded.remaining() check
- Extract package-private writeToStream() helper used consistently by
  all three flush methods (flushIntBatch, flushLongBatch, flushFlbaBatch)
- Add else-throw in flushBatch() for defensive fail-fast when batchCount
  > 0 but no batch buffer is initialized

Assisted-by: GitHub Copilot:claude-opus-4.6
@iemejia

iemejia commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

@Fokko All review comments addressed in 689208f. Benchmark results updated in the PR description (re-run on JDK 25.0.3, same machine). Ready for another round of review or merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants