Skip to content

Expand Android audio API for decoder-only audio LLMs#19920

Open
kirklandsign wants to merge 1 commit into
mainfrom
export-D106580523
Open

Expand Android audio API for decoder-only audio LLMs#19920
kirklandsign wants to merge 1 commit into
mainfrom
export-D106580523

Conversation

@kirklandsign
Copy link
Copy Markdown
Contributor

Summary:
ExecuTorch's Android LLM API has foundational audio support (prefillAudio, prefillRawAudio), but several gaps remain for running decoder-only audio LLMs like Voxtral smoothly from Kotlin/Java. This diff closes those gaps by adding parity with the image API and introducing new audio-specific features.

Changes:

MODEL_TYPE_TEXT_AUDIO constant (value 5) added to LlmModule and LlmModuleConfig for audio-specific model type discoverability. Routes to the multimodal runner in JNI, same as MODEL_TYPE_MULTIMODAL.

ByteBuffer zero-copy variants for audio prefill, mirroring the image API:

  • prefillAudio(ByteBuffer, batchSize, nBins, nFrames) for uint8 mel spectrograms
  • prefillNormalizedAudio(ByteBuffer, batchSize, nBins, nFrames) for float32 mel spectrograms (with native byte order and alignment validation)
  • prefillRawAudio(ByteBuffer, batchSize, nChannels, nSamples) for raw PCM bytes

Typed raw audio variants for common PCM formats:

  • prefillRawAudio(ShortArray, ...) for PCM-16
  • prefillRawAudio(FloatArray, ...) for float32

WAV file path API: prefillAudioFromFile(path) loads and decodes a WAV file natively (16-bit PCM, 32-bit PCM, 32-bit IEEE float) via load_wav_audio_data(), then prefills the audio encoder.

Audio config fields in LlmModuleConfig: sampleRate (default 16000) and preprocessorPath for future audio preprocessing integration.

Audio fields in LlmGenerationConfig: audioFilePath and audioData allow audio+text generation in a single generate(prompt, config, callback) call. Audio is prefilled before text generation begins.

Differential Revision: D106580523

Summary:
ExecuTorch's Android LLM API has foundational audio support (`prefillAudio`, `prefillRawAudio`), but several gaps remain for running decoder-only audio LLMs like Voxtral smoothly from Kotlin/Java. This diff closes those gaps by adding parity with the image API and introducing new audio-specific features.

Changes:

`MODEL_TYPE_TEXT_AUDIO` constant (value 5) added to `LlmModule` and `LlmModuleConfig` for audio-specific model type discoverability. Routes to the multimodal runner in JNI, same as `MODEL_TYPE_MULTIMODAL`.

ByteBuffer zero-copy variants for audio prefill, mirroring the image API:
- `prefillAudio(ByteBuffer, batchSize, nBins, nFrames)` for uint8 mel spectrograms
- `prefillNormalizedAudio(ByteBuffer, batchSize, nBins, nFrames)` for float32 mel spectrograms (with native byte order and alignment validation)
- `prefillRawAudio(ByteBuffer, batchSize, nChannels, nSamples)` for raw PCM bytes

Typed raw audio variants for common PCM formats:
- `prefillRawAudio(ShortArray, ...)` for PCM-16
- `prefillRawAudio(FloatArray, ...)` for float32

WAV file path API: `prefillAudioFromFile(path)` loads and decodes a WAV file natively (16-bit PCM, 32-bit PCM, 32-bit IEEE float) via `load_wav_audio_data()`, then prefills the audio encoder.

Audio config fields in `LlmModuleConfig`: `sampleRate` (default 16000) and `preprocessorPath` for future audio preprocessing integration.

Audio fields in `LlmGenerationConfig`: `audioFilePath` and `audioData` allow audio+text generation in a single `generate(prompt, config, callback)` call. Audio is prefilled before text generation begins.

Differential Revision: D106580523
Copilot AI review requested due to automatic review settings June 1, 2026 17:33
@kirklandsign kirklandsign requested a review from psiddh as a code owner June 1, 2026 17:33
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jun 1, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19920

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 7601528 with merge base eeb0646 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 1, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Jun 1, 2026

@kirklandsign has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106580523.

@kirklandsign
Copy link
Copy Markdown
Contributor Author

@claude review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @kirklandsign's task in 2m 25s —— View job


Code Review: Expand Android audio API for decoder-only audio LLMs

  • Read changed files and gather context
  • Review JNI layer (jni_layer_llama.cpp)
  • Review Kotlin API (LlmModule.kt)
  • Review config changes (LlmModuleConfig.kt, LlmGenerationConfig.kt)
  • Post review feedback

Overall this is a well-structured extension that mirrors the existing image API pattern. The overflow checks, direct ByteBuffer validation, and native registration are all solid. A few issues worth addressing:


Issue 1: Atomicity gap in generate(prompt, config, callback) (Medium)

File: LlmModule.kt:378-397

The config-based generate calls prefillAudioFromFile or prefillAudio — each of which acquires and releases the lock independently — then calls the image-based generate overload (which acquires the lock again). This means another thread could interleave between the audio prefill and text generation.

Compare with the image path at lines 490-520, where prefill + generate happen atomically under a single lock acquisition. The audio path should have the same guarantee.

fun generate(prompt: String, config: LlmGenerationConfig, llmCallback: LlmCallback) {
    mLock.lock()
    try {
        checkNotReentrant()
        checkNotDestroyed()
        if (config.audioFilePath != null) {
            val nativeResult = prefillAudioFromFileNative(config.audioFilePath)
            if (nativeResult != 0) {
                throw ExecutorchRuntimeException.makeExecutorchException(nativeResult, "Prefill failed")
            }
        } else if (config.audioData != null) {
            val nativeResult = prefillAudioInputFloat(config.audioData, ...)
            if (nativeResult != 0) { ... }
        }
        val err = generateNative(prompt, config.seqLen, llmCallback, ...)
        if (err != 0) { ... }
    } finally {
        mLock.unlock()
    }
}

Issue 2: prefill_audio_from_file can abort the process (Medium-High)

File: jni_layer_llama.cpp:735-763

load_wav_audio_data() uses ET_CHECK_MSG internally, which calls runtime_abort() on failure (invalid WAV, missing file, etc.). In a JNI context this kills the entire Android process rather than returning an error code.

Other JNI methods in this file are safe because they only call runner_->prefill() which returns Result<>. But load_wav_audio_data is different — it aborts on bad input.

Consider either:

  1. Wrapping the call in a try/catch if the runtime is configured to throw instead of abort, or
  2. Using the lower-level WAV parsing APIs that return error codes, or
  3. Validating the file path/header before calling load_wav_audio_data

At minimum, document this behavior in the Kotlin-side @throws so callers know a malformed WAV can crash the app.


Issue 3: prefillRawAudio(ByteBuffer) expected size calculation may be incorrect (Low-Medium)

File: LlmModule.kt:1007-1014

For raw PCM audio in a ByteBuffer, the expected size is computed as batchSize * nChannels * nSamples. But for PCM-16 audio (2 bytes/sample) or float32 (4 bytes/sample), the byte count should be batchSize * nChannels * nSamples * bytesPerSample. As written, it only works correctly for 8-bit PCM.

The same issue exists on the JNI side at jni_layer_llama.cpp:717:

size_t expected = static_cast<size_t>(batch_size) * n_channels * n_samples;

The caller would need to encode nSamples as the number of bytes rather than the number of samples to make this work with 16-bit or float audio. If that's the intended semantics, the parameter name and KDoc are misleading (they say "number of samples per channel"). If samples means actual sample count, the buffer size check is too small for >8-bit formats.


Issue 4: sampleRate and preprocessorPath are config-only with no JNI plumbing (Nit)

File: LlmModuleConfig.kt:26-27

These new fields (sampleRate, preprocessorPath) are stored in the config but never passed to native code via initHybrid. The constructor at LlmModule.kt:198-209 doesn't use them either. This is fine if they're placeholders for future work, but consider adding a brief note in the class KDoc or builder methods indicating they're reserved/unused pending native support.


Issue 5: Minor — redundant copy in JNI prefill_audio_input_float (Nit)

File: jni_layer_llama.cpp:506-537

std::vector<jfloat> data_jfloat(data_size);
std::vector<float> data_f(data_size);
data->getRegion(0, data_size, data_jfloat.data());
for (int i = 0; i < data_size; i++) {
  data_f[i] = data_jfloat[i];
}

Since jfloat is float on all Android ABIs, the element-by-element copy is unnecessary — you could read directly into data_f or use memcpy. Same pattern exists for prefill_normalized_images_input with jfloat. This is pre-existing code, so not a blocker.


Summary

The API surface is clean and the parity with the image API is well done. The main concerns are:

  1. Atomicity of audio prefill + generate in the config-based overload
  2. Process abort risk from load_wav_audio_data on malformed files
  3. Buffer size validation semantics for raw audio ByteBuffer with >8-bit PCM

None are blockers if callers are careful, but (1) and (2) could cause hard-to-debug issues in production Android apps.


Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands ExecuTorch’s Android LLM API to better support decoder-only “text + audio” workflows by introducing a new audio-oriented model type, adding additional audio prefill entrypoints (including ByteBuffer variants), and allowing generation configs to carry audio inputs that are prefilled before text generation.

Changes:

  • Add MODEL_TYPE_TEXT_AUDIO = 5 and route it to the multimodal runner on the JNI side (same path as MODEL_TYPE_MULTIMODAL).
  • Add new audio-prefill APIs in the Android wrapper, including direct ByteBuffer variants and WAV file path prefill.
  • Extend LlmModuleConfig/LlmGenerationConfig with audio-related fields to support combined audio+text generation calls.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
extension/android/jni/jni_layer_llama.cpp Adds model-type routing for text-audio and new JNI entrypoints for audio prefill (buffers, typed arrays) plus WAV file prefill.
extension/android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmModuleConfig.kt Adds MODEL_TYPE_TEXT_AUDIO and new audio config fields (sampleRate, preprocessorPath).
extension/android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmModule.kt Adds prefill-before-generate behavior via LlmGenerationConfig audio fields, plus new audio prefill APIs (ByteBuffer, typed raw PCM, WAV path).
extension/android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmGenerationConfig.kt Adds audio fields and builder APIs for passing audio to generate(prompt, config, callback).
Comments suppressed due to low confidence (1)

extension/android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmModule.kt:397

  • generate(prompt, config, ...) performs audio prefill and generation as separate calls that each acquire/release the module lock, allowing other threads to interleave stateful operations (resetContext/stop/generate) between the prefill and the actual generate. Also, when using audioData it forwards dimensions without verifying audioData.size == batchSizenBinsnFrames, which can trigger a native ET_CHECK abort in llm::Audio’s constructor if mismatched. Consider holding the lock across both operations and validate the audioData shape before calling native code.
  fun generate(prompt: String, config: LlmGenerationConfig, llmCallback: LlmCallback) {
    if (config.audioFilePath != null) {
      prefillAudioFromFile(config.audioFilePath)
    } else if (config.audioData != null) {
      prefillAudio(config.audioData, config.audioBatchSize, config.audioNBins, config.audioNFrames)
    }
    generate(
        null,
        0,
        0,
        0,
        prompt,
        config.seqLen,
        llmCallback,
        config.echo,
        config.temperature,
        config.numBos,
        config.numEos,
    )
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +742 to +746
std::string file_path = path->toStdString();
std::vector<float> audio_data =
llm::load_wav_audio_data(file_path);
if (audio_data.empty()) {
return static_cast<jint>(Error::InvalidArgument);
Comment on lines +654 to +662
std::vector<jshort> data_jshort(data_size);
data->getRegion(0, data_size, data_jshort.data());
size_t byte_size = data_size * sizeof(jshort);
std::vector<uint8_t> data_u8(byte_size);
std::memcpy(data_u8.data(), data_jshort.data(), byte_size);
llm::RawAudio audio{std::move(data_u8), batch_size, n_channels, n_samples};
std::vector<llm::MultimodalInput> inputs;
inputs.emplace_back(llm::MultimodalInput{std::move(audio)});
int32_t bos = needs_bos_ ? num_bos_ : 0;
Comment on lines +84 to +115
/** Sets a WAV file path to prefill as audio before text generation. */
fun audioFilePath(path: String): Builder = apply { this.audioFilePath = path }

/** Sets pre-processed audio data (mel spectrogram) to prefill before text generation. */
fun audioData(
data: FloatArray,
batchSize: Int,
nBins: Int,
nFrames: Int,
): Builder = apply {
this.audioData = data
this.audioBatchSize = batchSize
this.audioNBins = nBins
this.audioNFrames = nFrames
}

/** Constructs the LlmGenerationConfig instance with the configured parameters. */
fun build(): LlmGenerationConfig =
LlmGenerationConfig(echo, maxNewTokens, warming, seqLen, temperature, numBos, numEos)
LlmGenerationConfig(
echo,
maxNewTokens,
warming,
seqLen,
temperature,
numBos,
numEos,
audioFilePath,
audioData,
audioBatchSize,
audioNBins,
audioNFrames,
)
Comment on lines +814 to +816
* This is the zero-copy counterpart to [prefillAudio] with [ByteArray]. For large audio inputs
* (e.g., multi-minute clips), using a direct ByteBuffer avoids the JNI array copy overhead.
*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants