Expand Android audio API for decoder-only audio LLMs by kirklandsign · Pull Request #19920 · pytorch/executorch

kirklandsign · 2026-06-01T17:33:42Z

Summary:
ExecuTorch's Android LLM API has foundational audio support (prefillAudio, prefillRawAudio), but several gaps remain for running decoder-only audio LLMs like Voxtral smoothly from Kotlin/Java. This diff closes those gaps by adding parity with the image API and introducing new audio-specific features.

Changes:

MODEL_TYPE_TEXT_AUDIO constant (value 5) added to LlmModule and LlmModuleConfig for audio-specific model type discoverability. Routes to the multimodal runner in JNI, same as MODEL_TYPE_MULTIMODAL.

ByteBuffer zero-copy variants for audio prefill, mirroring the image API:

prefillAudio(ByteBuffer, batchSize, nBins, nFrames) for uint8 mel spectrograms
prefillNormalizedAudio(ByteBuffer, batchSize, nBins, nFrames) for float32 mel spectrograms (with native byte order and alignment validation)
prefillRawAudio(ByteBuffer, batchSize, nChannels, nSamples) for raw PCM bytes

Typed raw audio variants for common PCM formats:

prefillRawAudio(ShortArray, ...) for PCM-16
prefillRawAudio(FloatArray, ...) for float32

WAV file path API: prefillAudioFromFile(path) loads and decodes a WAV file natively (16-bit PCM, 32-bit PCM, 32-bit IEEE float) via load_wav_audio_data(), then prefills the audio encoder.

Audio config fields in LlmModuleConfig: sampleRate (default 16000) and preprocessorPath for future audio preprocessing integration.

Audio fields in LlmGenerationConfig: audioFilePath and audioData allow audio+text generation in a single generate(prompt, config, callback) call. Audio is prefilled before text generation begins.

Differential Revision: D106580523

Summary: ExecuTorch's Android LLM API has foundational audio support (`prefillAudio`, `prefillRawAudio`), but several gaps remain for running decoder-only audio LLMs like Voxtral smoothly from Kotlin/Java. This diff closes those gaps by adding parity with the image API and introducing new audio-specific features. Changes: `MODEL_TYPE_TEXT_AUDIO` constant (value 5) added to `LlmModule` and `LlmModuleConfig` for audio-specific model type discoverability. Routes to the multimodal runner in JNI, same as `MODEL_TYPE_MULTIMODAL`. ByteBuffer zero-copy variants for audio prefill, mirroring the image API: - `prefillAudio(ByteBuffer, batchSize, nBins, nFrames)` for uint8 mel spectrograms - `prefillNormalizedAudio(ByteBuffer, batchSize, nBins, nFrames)` for float32 mel spectrograms (with native byte order and alignment validation) - `prefillRawAudio(ByteBuffer, batchSize, nChannels, nSamples)` for raw PCM bytes Typed raw audio variants for common PCM formats: - `prefillRawAudio(ShortArray, ...)` for PCM-16 - `prefillRawAudio(FloatArray, ...)` for float32 WAV file path API: `prefillAudioFromFile(path)` loads and decodes a WAV file natively (16-bit PCM, 32-bit PCM, 32-bit IEEE float) via `load_wav_audio_data()`, then prefills the audio encoder. Audio config fields in `LlmModuleConfig`: `sampleRate` (default 16000) and `preprocessorPath` for future audio preprocessing integration. Audio fields in `LlmGenerationConfig`: `audioFilePath` and `audioData` allow audio+text generation in a single `generate(prompt, config, callback)` call. Audio is prefilled before text generation begins. Differential Revision: D106580523

pytorch-bot · 2026-06-01T17:33:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19920

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 7601528 with merge base eeb0646 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner (gh)
>>> Lint for extension/android/jni/jni_layer_llama.cpp:
pull / android / build-android (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-06-01T17:33:52Z

@kirklandsign has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106580523.

kirklandsign · 2026-06-01T17:34:02Z

@claude review

github-actions · 2026-06-01T17:34:42Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

claude · 2026-06-01T17:35:10Z

Claude finished @kirklandsign's task in 2m 25s —— View job

Code Review: Expand Android audio API for decoder-only audio LLMs

Read changed files and gather context
Review JNI layer (jni_layer_llama.cpp)
Review Kotlin API (LlmModule.kt)
Review config changes (LlmModuleConfig.kt, LlmGenerationConfig.kt)
Post review feedback

Overall this is a well-structured extension that mirrors the existing image API pattern. The overflow checks, direct ByteBuffer validation, and native registration are all solid. A few issues worth addressing:

Issue 1: Atomicity gap in `generate(prompt, config, callback)` (Medium)

File: LlmModule.kt:378-397

The config-based generate calls prefillAudioFromFile or prefillAudio — each of which acquires and releases the lock independently — then calls the image-based generate overload (which acquires the lock again). This means another thread could interleave between the audio prefill and text generation.

Compare with the image path at lines 490-520, where prefill + generate happen atomically under a single lock acquisition. The audio path should have the same guarantee.

fun generate(prompt: String, config: LlmGenerationConfig, llmCallback: LlmCallback) {
    mLock.lock()
    try {
        checkNotReentrant()
        checkNotDestroyed()
        if (config.audioFilePath != null) {
            val nativeResult = prefillAudioFromFileNative(config.audioFilePath)
            if (nativeResult != 0) {
                throw ExecutorchRuntimeException.makeExecutorchException(nativeResult, "Prefill failed")
            }
        } else if (config.audioData != null) {
            val nativeResult = prefillAudioInputFloat(config.audioData, ...)
            if (nativeResult != 0) { ... }
        }
        val err = generateNative(prompt, config.seqLen, llmCallback, ...)
        if (err != 0) { ... }
    } finally {
        mLock.unlock()
    }
}

Issue 2: `prefill_audio_from_file` can abort the process (Medium-High)

File: jni_layer_llama.cpp:735-763

load_wav_audio_data() uses ET_CHECK_MSG internally, which calls runtime_abort() on failure (invalid WAV, missing file, etc.). In a JNI context this kills the entire Android process rather than returning an error code.

Other JNI methods in this file are safe because they only call runner_->prefill() which returns Result<>. But load_wav_audio_data is different — it aborts on bad input.

Consider either:

Wrapping the call in a try/catch if the runtime is configured to throw instead of abort, or
Using the lower-level WAV parsing APIs that return error codes, or
Validating the file path/header before calling load_wav_audio_data

At minimum, document this behavior in the Kotlin-side @throws so callers know a malformed WAV can crash the app.

Issue 3: `prefillRawAudio(ByteBuffer)` expected size calculation may be incorrect (Low-Medium)

File: LlmModule.kt:1007-1014

For raw PCM audio in a ByteBuffer, the expected size is computed as batchSize * nChannels * nSamples. But for PCM-16 audio (2 bytes/sample) or float32 (4 bytes/sample), the byte count should be batchSize * nChannels * nSamples * bytesPerSample. As written, it only works correctly for 8-bit PCM.

The same issue exists on the JNI side at jni_layer_llama.cpp:717:

size_t expected = static_cast<size_t>(batch_size) * n_channels * n_samples;

The caller would need to encode nSamples as the number of bytes rather than the number of samples to make this work with 16-bit or float audio. If that's the intended semantics, the parameter name and KDoc are misleading (they say "number of samples per channel"). If samples means actual sample count, the buffer size check is too small for >8-bit formats.

Issue 4: `sampleRate` and `preprocessorPath` are config-only with no JNI plumbing (Nit)

File: LlmModuleConfig.kt:26-27

These new fields (sampleRate, preprocessorPath) are stored in the config but never passed to native code via initHybrid. The constructor at LlmModule.kt:198-209 doesn't use them either. This is fine if they're placeholders for future work, but consider adding a brief note in the class KDoc or builder methods indicating they're reserved/unused pending native support.

Issue 5: Minor — redundant copy in JNI `prefill_audio_input_float` (Nit)

File: jni_layer_llama.cpp:506-537

std::vector<jfloat> data_jfloat(data_size);
std::vector<float> data_f(data_size);
data->getRegion(0, data_size, data_jfloat.data());
for (int i = 0; i < data_size; i++) {
  data_f[i] = data_jfloat[i];
}

Since jfloat is float on all Android ABIs, the element-by-element copy is unnecessary — you could read directly into data_f or use memcpy. Same pattern exists for prefill_normalized_images_input with jfloat. This is pre-existing code, so not a blocker.

Summary

The API surface is clean and the parity with the image API is well done. The main concerns are:

Atomicity of audio prefill + generate in the config-based overload
Process abort risk from load_wav_audio_data on malformed files
Buffer size validation semantics for raw audio ByteBuffer with >8-bit PCM

None are blockers if callers are careful, but (1) and (2) could cause hard-to-debug issues in production Android apps.

Copilot

Pull request overview

This PR expands ExecuTorch’s Android LLM API to better support decoder-only “text + audio” workflows by introducing a new audio-oriented model type, adding additional audio prefill entrypoints (including ByteBuffer variants), and allowing generation configs to carry audio inputs that are prefilled before text generation.

Changes:

Add MODEL_TYPE_TEXT_AUDIO = 5 and route it to the multimodal runner on the JNI side (same path as MODEL_TYPE_MULTIMODAL).
Add new audio-prefill APIs in the Android wrapper, including direct ByteBuffer variants and WAV file path prefill.
Extend LlmModuleConfig/LlmGenerationConfig with audio-related fields to support combined audio+text generation calls.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
extension/android/jni/jni_layer_llama.cpp	Adds model-type routing for text-audio and new JNI entrypoints for audio prefill (buffers, typed arrays) plus WAV file prefill.
extension/android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmModuleConfig.kt	Adds `MODEL_TYPE_TEXT_AUDIO` and new audio config fields (`sampleRate`, `preprocessorPath`).
extension/android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmModule.kt	Adds prefill-before-generate behavior via `LlmGenerationConfig` audio fields, plus new audio prefill APIs (ByteBuffer, typed raw PCM, WAV path).
extension/android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmGenerationConfig.kt	Adds audio fields and builder APIs for passing audio to `generate(prompt, config, callback)`.

Comments suppressed due to low confidence (1)

extension/android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmModule.kt:397

generate(prompt, config, ...) performs audio prefill and generation as separate calls that each acquire/release the module lock, allowing other threads to interleave stateful operations (resetContext/stop/generate) between the prefill and the actual generate. Also, when using audioData it forwards dimensions without verifying audioData.size == batchSizenBinsnFrames, which can trigger a native ET_CHECK abort in llm::Audio’s constructor if mismatched. Consider holding the lock across both operations and validate the audioData shape before calling native code.

  fun generate(prompt: String, config: LlmGenerationConfig, llmCallback: LlmCallback) {
    if (config.audioFilePath != null) {
      prefillAudioFromFile(config.audioFilePath)
    } else if (config.audioData != null) {
      prefillAudio(config.audioData, config.audioBatchSize, config.audioNBins, config.audioNFrames)
    }
    generate(
        null,
        0,
        0,
        0,
        prompt,
        config.seqLen,
        llmCallback,
        config.echo,
        config.temperature,
        config.numBos,
        config.numEos,
    )
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    std::string file_path = path->toStdString();
+    std::vector<float> audio_data =
+        llm::load_wav_audio_data(file_path);
+    if (audio_data.empty()) {
+      return static_cast<jint>(Error::InvalidArgument);


+    std::vector<jshort> data_jshort(data_size);
+    data->getRegion(0, data_size, data_jshort.data());
+    size_t byte_size = data_size * sizeof(jshort);
+    std::vector<uint8_t> data_u8(byte_size);
+    std::memcpy(data_u8.data(), data_jshort.data(), byte_size);
+    llm::RawAudio audio{std::move(data_u8), batch_size, n_channels, n_samples};
+    std::vector<llm::MultimodalInput> inputs;
+    inputs.emplace_back(llm::MultimodalInput{std::move(audio)});
+    int32_t bos = needs_bos_ ? num_bos_ : 0;


+    /** Sets a WAV file path to prefill as audio before text generation. */
+    fun audioFilePath(path: String): Builder = apply { this.audioFilePath = path }
+
+    /** Sets pre-processed audio data (mel spectrogram) to prefill before text generation. */
+    fun audioData(
+        data: FloatArray,
+        batchSize: Int,
+        nBins: Int,
+        nFrames: Int,
+    ): Builder = apply {
+      this.audioData = data
+      this.audioBatchSize = batchSize
+      this.audioNBins = nBins
+      this.audioNFrames = nFrames
+    }
+
    /** Constructs the LlmGenerationConfig instance with the configured parameters. */
    fun build(): LlmGenerationConfig =
-        LlmGenerationConfig(echo, maxNewTokens, warming, seqLen, temperature, numBos, numEos)
+        LlmGenerationConfig(
+            echo,
+            maxNewTokens,
+            warming,
+            seqLen,
+            temperature,
+            numBos,
+            numEos,
+            audioFilePath,
+            audioData,
+            audioBatchSize,
+            audioNBins,
+            audioNFrames,
+        )


+   * This is the zero-copy counterpart to [prefillAudio] with [ByteArray]. For large audio inputs
+   * (e.g., multi-minute clips), using a direct ByteBuffer avoids the JNI array copy overhead.
+   *


Copilot AI review requested due to automatic review settings June 1, 2026 17:33

kirklandsign requested a review from psiddh as a code owner June 1, 2026 17:33

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 1, 2026

meta-codesync Bot added fb-exported meta-exported labels Jun 1, 2026

Copilot started reviewing on behalf of kirklandsign June 1, 2026 17:33 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand Android audio API for decoder-only audio LLMs#19920

Expand Android audio API for decoder-only audio LLMs#19920
kirklandsign wants to merge 1 commit into
mainfrom
export-D106580523

kirklandsign commented Jun 1, 2026

Uh oh!

pytorch-bot Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Jun 1, 2026

Uh oh!

kirklandsign commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

claude Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kirklandsign commented Jun 1, 2026

Uh oh!

pytorch-bot Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19920

❌ 2 New Failures

Uh oh!

meta-codesync Bot commented Jun 1, 2026

Uh oh!

kirklandsign commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

This PR needs a release notes: label

Uh oh!

claude Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Expand Android audio API for decoder-only audio LLMs

Issue 1: Atomicity gap in generate(prompt, config, callback) (Medium)

Issue 2: prefill_audio_from_file can abort the process (Medium-High)

Issue 3: prefillRawAudio(ByteBuffer) expected size calculation may be incorrect (Low-Medium)

Issue 4: sampleRate and preprocessorPath are config-only with no JNI plumbing (Nit)

Issue 5: Minor — redundant copy in JNI prefill_audio_input_float (Nit)

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented Jun 1, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 1, 2026 •

edited

Loading

Issue 1: Atomicity gap in `generate(prompt, config, callback)` (Medium)

Issue 2: `prefill_audio_from_file` can abort the process (Medium-High)

Issue 3: `prefillRawAudio(ByteBuffer)` expected size calculation may be incorrect (Low-Medium)

Issue 4: `sampleRate` and `preprocessorPath` are config-only with no JNI plumbing (Nit)

Issue 5: Minor — redundant copy in JNI `prefill_audio_input_float` (Nit)