Expand Android audio API for decoder-only audio LLMs#19920
Conversation
Summary: ExecuTorch's Android LLM API has foundational audio support (`prefillAudio`, `prefillRawAudio`), but several gaps remain for running decoder-only audio LLMs like Voxtral smoothly from Kotlin/Java. This diff closes those gaps by adding parity with the image API and introducing new audio-specific features. Changes: `MODEL_TYPE_TEXT_AUDIO` constant (value 5) added to `LlmModule` and `LlmModuleConfig` for audio-specific model type discoverability. Routes to the multimodal runner in JNI, same as `MODEL_TYPE_MULTIMODAL`. ByteBuffer zero-copy variants for audio prefill, mirroring the image API: - `prefillAudio(ByteBuffer, batchSize, nBins, nFrames)` for uint8 mel spectrograms - `prefillNormalizedAudio(ByteBuffer, batchSize, nBins, nFrames)` for float32 mel spectrograms (with native byte order and alignment validation) - `prefillRawAudio(ByteBuffer, batchSize, nChannels, nSamples)` for raw PCM bytes Typed raw audio variants for common PCM formats: - `prefillRawAudio(ShortArray, ...)` for PCM-16 - `prefillRawAudio(FloatArray, ...)` for float32 WAV file path API: `prefillAudioFromFile(path)` loads and decodes a WAV file natively (16-bit PCM, 32-bit PCM, 32-bit IEEE float) via `load_wav_audio_data()`, then prefills the audio encoder. Audio config fields in `LlmModuleConfig`: `sampleRate` (default 16000) and `preprocessorPath` for future audio preprocessing integration. Audio fields in `LlmGenerationConfig`: `audioFilePath` and `audioData` allow audio+text generation in a single `generate(prompt, config, callback)` call. Audio is prefilled before text generation begins. Differential Revision: D106580523
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19920
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New FailuresAs of commit 7601528 with merge base eeb0646 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@kirklandsign has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106580523. |
|
@claude review |
This PR needs a
|
|
Claude finished @kirklandsign's task in 2m 25s —— View job Code Review: Expand Android audio API for decoder-only audio LLMs
Overall this is a well-structured extension that mirrors the existing image API pattern. The overflow checks, direct ByteBuffer validation, and native registration are all solid. A few issues worth addressing: Issue 1: Atomicity gap in
|
There was a problem hiding this comment.
Pull request overview
This PR expands ExecuTorch’s Android LLM API to better support decoder-only “text + audio” workflows by introducing a new audio-oriented model type, adding additional audio prefill entrypoints (including ByteBuffer variants), and allowing generation configs to carry audio inputs that are prefilled before text generation.
Changes:
- Add
MODEL_TYPE_TEXT_AUDIO = 5and route it to the multimodal runner on the JNI side (same path asMODEL_TYPE_MULTIMODAL). - Add new audio-prefill APIs in the Android wrapper, including direct
ByteBuffervariants and WAV file path prefill. - Extend
LlmModuleConfig/LlmGenerationConfigwith audio-related fields to support combined audio+text generation calls.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| extension/android/jni/jni_layer_llama.cpp | Adds model-type routing for text-audio and new JNI entrypoints for audio prefill (buffers, typed arrays) plus WAV file prefill. |
| extension/android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmModuleConfig.kt | Adds MODEL_TYPE_TEXT_AUDIO and new audio config fields (sampleRate, preprocessorPath). |
| extension/android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmModule.kt | Adds prefill-before-generate behavior via LlmGenerationConfig audio fields, plus new audio prefill APIs (ByteBuffer, typed raw PCM, WAV path). |
| extension/android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmGenerationConfig.kt | Adds audio fields and builder APIs for passing audio to generate(prompt, config, callback). |
Comments suppressed due to low confidence (1)
extension/android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmModule.kt:397
- generate(prompt, config, ...) performs audio prefill and generation as separate calls that each acquire/release the module lock, allowing other threads to interleave stateful operations (resetContext/stop/generate) between the prefill and the actual generate. Also, when using audioData it forwards dimensions without verifying audioData.size == batchSizenBinsnFrames, which can trigger a native ET_CHECK abort in llm::Audio’s constructor if mismatched. Consider holding the lock across both operations and validate the audioData shape before calling native code.
fun generate(prompt: String, config: LlmGenerationConfig, llmCallback: LlmCallback) {
if (config.audioFilePath != null) {
prefillAudioFromFile(config.audioFilePath)
} else if (config.audioData != null) {
prefillAudio(config.audioData, config.audioBatchSize, config.audioNBins, config.audioNFrames)
}
generate(
null,
0,
0,
0,
prompt,
config.seqLen,
llmCallback,
config.echo,
config.temperature,
config.numBos,
config.numEos,
)
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::string file_path = path->toStdString(); | ||
| std::vector<float> audio_data = | ||
| llm::load_wav_audio_data(file_path); | ||
| if (audio_data.empty()) { | ||
| return static_cast<jint>(Error::InvalidArgument); |
| std::vector<jshort> data_jshort(data_size); | ||
| data->getRegion(0, data_size, data_jshort.data()); | ||
| size_t byte_size = data_size * sizeof(jshort); | ||
| std::vector<uint8_t> data_u8(byte_size); | ||
| std::memcpy(data_u8.data(), data_jshort.data(), byte_size); | ||
| llm::RawAudio audio{std::move(data_u8), batch_size, n_channels, n_samples}; | ||
| std::vector<llm::MultimodalInput> inputs; | ||
| inputs.emplace_back(llm::MultimodalInput{std::move(audio)}); | ||
| int32_t bos = needs_bos_ ? num_bos_ : 0; |
| /** Sets a WAV file path to prefill as audio before text generation. */ | ||
| fun audioFilePath(path: String): Builder = apply { this.audioFilePath = path } | ||
|
|
||
| /** Sets pre-processed audio data (mel spectrogram) to prefill before text generation. */ | ||
| fun audioData( | ||
| data: FloatArray, | ||
| batchSize: Int, | ||
| nBins: Int, | ||
| nFrames: Int, | ||
| ): Builder = apply { | ||
| this.audioData = data | ||
| this.audioBatchSize = batchSize | ||
| this.audioNBins = nBins | ||
| this.audioNFrames = nFrames | ||
| } | ||
|
|
||
| /** Constructs the LlmGenerationConfig instance with the configured parameters. */ | ||
| fun build(): LlmGenerationConfig = | ||
| LlmGenerationConfig(echo, maxNewTokens, warming, seqLen, temperature, numBos, numEos) | ||
| LlmGenerationConfig( | ||
| echo, | ||
| maxNewTokens, | ||
| warming, | ||
| seqLen, | ||
| temperature, | ||
| numBos, | ||
| numEos, | ||
| audioFilePath, | ||
| audioData, | ||
| audioBatchSize, | ||
| audioNBins, | ||
| audioNFrames, | ||
| ) |
| * This is the zero-copy counterpart to [prefillAudio] with [ByteArray]. For large audio inputs | ||
| * (e.g., multi-minute clips), using a direct ByteBuffer avoids the JNI array copy overhead. | ||
| * |
Summary:
ExecuTorch's Android LLM API has foundational audio support (
prefillAudio,prefillRawAudio), but several gaps remain for running decoder-only audio LLMs like Voxtral smoothly from Kotlin/Java. This diff closes those gaps by adding parity with the image API and introducing new audio-specific features.Changes:
MODEL_TYPE_TEXT_AUDIOconstant (value 5) added toLlmModuleandLlmModuleConfigfor audio-specific model type discoverability. Routes to the multimodal runner in JNI, same asMODEL_TYPE_MULTIMODAL.ByteBuffer zero-copy variants for audio prefill, mirroring the image API:
prefillAudio(ByteBuffer, batchSize, nBins, nFrames)for uint8 mel spectrogramsprefillNormalizedAudio(ByteBuffer, batchSize, nBins, nFrames)for float32 mel spectrograms (with native byte order and alignment validation)prefillRawAudio(ByteBuffer, batchSize, nChannels, nSamples)for raw PCM bytesTyped raw audio variants for common PCM formats:
prefillRawAudio(ShortArray, ...)for PCM-16prefillRawAudio(FloatArray, ...)for float32WAV file path API:
prefillAudioFromFile(path)loads and decodes a WAV file natively (16-bit PCM, 32-bit PCM, 32-bit IEEE float) viaload_wav_audio_data(), then prefills the audio encoder.Audio config fields in
LlmModuleConfig:sampleRate(default 16000) andpreprocessorPathfor future audio preprocessing integration.Audio fields in
LlmGenerationConfig:audioFilePathandaudioDataallow audio+text generation in a singlegenerate(prompt, config, callback)call. Audio is prefilled before text generation begins.Differential Revision: D106580523