HIVE-29647: Parallelize Parquet split generation directory listing on blob storage by deniskuzZ · Pull Request #6526 · apache/hive

deniskuzZ · 2026-06-04T15:42:16Z

What changes were proposed in this pull request?

Override MapredParquetInputFormat.listStatus to list the input directories in parallel when there is more than one recursive input dir on a blob filesystem; all other cases defer to the default listing.

Why are the changes needed?

Parquet split generation lists each input directory (typically one per partition) serially, which dominates planning time on object stores where every listing is a network round trip.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Cluster (10 workers) TPC-DS scale 1Tb, external, parquet
query2 150sec (before) / 70 (after) # without the semijoins (see #6525)

… blob storage

Copilot

Pull request overview

This PR (HIVE-29647) speeds up Parquet split generation on object/blob stores by overriding MapredParquetInputFormat.listStatus to list multiple recursive input directories concurrently, falling back to the default FileInputFormat listing for other scenarios.

Changes:

Added a listStatus(JobConf) override that parallelizes recursive directory listing when multiple input dirs are present on blob storage.
Introduced a dedicated worker thread pool and completion-based result collection for per-directory listings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    Path[] dirs = getInputPaths(job);
+    // Only the recursive case (the Tez default) takes the parallel path; non-recursive listing has
+    // subtler sub-directory semantics, so defer to the default.
+    if (dirs.length <= 1
+        || !job.getBoolean(FileInputFormat.INPUT_DIR_RECURSIVE, false)
+        || !BlobStorageUtils.isBlobStorageFileSystem(job, dirs[0].getFileSystem(job))) {
+      return super.listStatus(job);
+    }


+    int numThreads = Math.max(2, HiveConf.getIntVar(job, HiveConf.ConfVars.HIVE_COMPUTE_SPLITS_NUM_THREADS));
+    ExecutorService pool = newWorkerPool(numThreads);
+    CompletionService<List<FileStatus>> completionService = new ExecutorCompletionService<>(pool);
+


deniskuzZ · 2026-06-04T16:20:19Z

+              FileSystem dirFs = dir.getFileSystem(job);
+              List<FileStatus> dirFiles = new ArrayList<>();
+              FileUtils.listStatusRecursively(dirFs, new FileStatus(0, true, 0, 0, 0, dir), dirFiles);


we tried to save on HEAD requested per partition and we expect every input to be a dir

+    } catch (ExecutionException e) {
+      Throwable cause = e.getCause();
+      if (cause instanceof IOException) {
+        throw (IOException) cause;
+      }
+      throw new IOException("Failed to list input directories", cause);
+    } finally {


+  @Override
+  protected FileStatus[] listStatus(JobConf job) throws IOException {
+    Path[] dirs = getInputPaths(job);


Aggarwal-Raghav · 2026-06-04T16:03:50Z

+    UserGroupInformation ugi = UserGroupInformation.getCurrentUser();
+
+    int numThreads = Math.max(2, HiveConf.getIntVar(job, HiveConf.ConfVars.HIVE_COMPUTE_SPLITS_NUM_THREADS));
+    ExecutorService pool = newWorkerPool(numThreads);


For every listStatus call ExecutorService is getting created. Can't we make it static similar to OrcInputFormat has done?

IcebergInputFormat does the same. It allows changing the pool size in runtime. ORC adds cap on number of listing threads. So it's not uniform even now. ...

sonarqubecloud · 2026-06-04T21:28:17Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

HIVE-29647: Parallelize Parquet split generation directory listing on…

5feb893

… blob storage

asf-ci-hive added the tests pending label Jun 4, 2026

deniskuzZ requested review from abstractdog and Copilot June 4, 2026 15:47

Copilot started reviewing on behalf of deniskuzZ June 4, 2026 15:48 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

Aggarwal-Raghav reviewed Jun 4, 2026

View reviewed changes

review comments #1

ab9115b

asf-ci-hive added tests passed tests pending and removed tests pending tests passed labels Jun 4, 2026

asf-ci-hive added tests passed and removed tests pending labels Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-29647: Parallelize Parquet split generation directory listing on blob storage#6526

HIVE-29647: Parallelize Parquet split generation directory listing on blob storage#6526
deniskuzZ wants to merge 2 commits into
apache:masterfrom
deniskuzZ:HIVE-29647

deniskuzZ commented Jun 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

deniskuzZ Jun 4, 2026

Uh oh!

Aggarwal-Raghav Jun 4, 2026

Uh oh!

deniskuzZ Jun 4, 2026 •

edited

Loading

Uh oh!

deniskuzZ Jun 4, 2026

Uh oh!

sonarqubecloud Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

deniskuzZ commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

deniskuzZ Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Aggarwal-Raghav Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Jun 4, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

deniskuzZ commented Jun 4, 2026 •

edited

Loading

deniskuzZ Jun 4, 2026 •

edited

Loading