feat(audio): add AudioModule for issue #1932 by GuoZhuoRan · Pull Request #2507 · dimensionalOS/dimos

GuoZhuoRan · 2026-06-16T12:47:50Z

Adds mic audio capture and chunked publishing as AudioStamped on an Out stream, mirroring CameraModule. Validated on macOS Apple Silicon at 50 Hz / 20 ms frames with both synthetic (sine tone) and real mic sources.

dimos/msgs/audio_msgs/AudioStamped.py: Python overlay wrapping foxglove_msgs.RawAudio for LCM encode/decode, with from_pcm() and to_numpy() helpers. Flags that builtin_interfaces.Time (not std_msgs.Header) is the wire type, so frame_id is not preserved.
dimos/hardware/sensors/audio/module.py: AudioModule(Module) with AudioConfig(ModuleConfig), async def main() lifecycle, @rpc start/stop, @Skill record_clip.
examples/audio/validate_audio_module.py: LCM round-trip assert + live stream rate/timestamp validation.

Problem

Closes DIM-XXX

Solution

How to Test

Contributor License Agreement

I have read and approved the CLA.

@rpc

Adds mic audio capture and chunked publishing as AudioStamped on an Out stream, mirroring CameraModule. Validated on macOS Apple Silicon at 50 Hz / 20 ms frames with both synthetic (sine tone) and real mic sources. - dimos/msgs/audio_msgs/AudioStamped.py: Python overlay wrapping foxglove_msgs.RawAudio for LCM encode/decode, with from_pcm() and to_numpy() helpers. Flags that builtin_interfaces.Time (not std_msgs.Header) is the wire type, so frame_id is not preserved. - dimos/hardware/sensors/audio/module.py: AudioModule(Module) with AudioConfig(ModuleConfig), async def main() lifecycle, @rpc start/stop, @Skill record_clip. - examples/audio/validate_audio_module.py: LCM round-trip assert + live stream rate/timestamp validation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps · 2026-06-16T12:53:02Z

Greptile Summary

This PR introduces a complete voice I/O subsystem for dimos: AudioModule (mic capture → AudioStamped), SpeakerModule (PCM playback), SpeechToTextModule (VAD + AEC + Whisper), AgentTextModule (LLM routing), TextToSpeechModule (three provider backends), and FunVoiceEffectsModule (pitch/echo effects), wired together in the audio-speech-loopback and demo-audio blueprints.

AudioStamped wraps foxglove_msgs.RawAudio for LCM transport, with from_pcm() / to_numpy() / lcm_encode() / lcm_decode() helpers.
The pyttsx3 TTS path plays audio directly through the OS but never publishes tts_reference_audio, leaving SpeechToTextModule's AEC without reference data for echo suppression.
_stream_lock in SpeakerModule (line 255) and _state_lock in SpeechToTextModule (line 466) are declared as plain class attributes rather than PrivateAttr, making them shared across all instances of their respective classes.

Confidence Score: 4/5

Mergeable with caution — the pyttsx3 AEC gap is a functional regression for that provider, and the class-level Lock declarations could cause cross-instance interference in tests or concurrent pipelines.

The pyttsx3 TTS path never publishes tts_reference_audio, so SpeechToTextModule's acoustic echo canceller runs without any reference signal when that provider is active, causing echo to pass through to Whisper transcription. Separately, _stream_lock (SpeakerModule) and _state_lock (SpeechToTextModule) are class-level attributes in Pydantic models, meaning all instances of each class share the same Lock — in any multi-instance scenario the locks serialise unrelated instances against each other and can block teardown.

dimos/hardware/sensors/audio/module.py — the pyttsx3 provider block (around line 1240) and the Lock declarations for SpeakerModule (line 255) and SpeechToTextModule (line 466).

Important Files Changed

Filename	Overview
dimos/hardware/sensors/audio/module.py	Adds five audio modules (Capture, Speaker, STT, AgentText, TTS, FunVoiceEffects) and two blueprints; pyttsx3 TTS path omits tts_reference_audio publish, and class-level Lock declarations are shared across instances
dimos/msgs/audio_msgs/AudioStamped.py	New AudioStamped message type wrapping foxglove_msgs.RawAudio; fallback timestamp in from_pcm still uses time.monotonic() but this was already flagged in prior review threads
examples/audio/validate_audio_module.py	LCM round-trip + live stream validation script; explicitly passes time.monotonic() as ts in the round-trip test (line 60), inconsistent with module's time.time() timestamps
dimos/robot/all_blueprints.py	Auto-generated registry updated to include audio-speech-loopback and demo-audio blueprints; straightforward addition
dimos/hardware/sensors/audio/init.py	Empty init file for the new audio package; no issues
dimos/msgs/audio_msgs/init.py	Empty init file for audio_msgs package; no issues

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Mic as AudioModule (Mic)
    participant STT as SpeechToTextModule
    participant Agent as AgentTextModule
    participant TTS as TextToSpeechModule
    participant FX as FunVoiceEffectsModule
    participant Speaker as SpeakerModule

    Mic->>STT: AudioStamped (mic_audio)
    STT->>Agent: str (speech_text)
    Agent->>TTS: str (agent_response)
    TTS->>STT: bool (tts_active_signal)
    TTS->>STT: str (recent_tts_text)
    TTS->>STT: AudioStamped (tts_reference_audio) — missing for pyttsx3
    TTS->>FX: AudioStamped (tts_audio_raw)
    Speaker->>STT: bool (speaker_playing_signal)
    FX->>Speaker: AudioStamped (tts_audio)

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Mic as AudioModule (Mic)
    participant STT as SpeechToTextModule
    participant Agent as AgentTextModule
    participant TTS as TextToSpeechModule
    participant FX as FunVoiceEffectsModule
    participant Speaker as SpeakerModule

    Mic->>STT: AudioStamped (mic_audio)
    STT->>Agent: str (speech_text)
    Agent->>TTS: str (agent_response)
    TTS->>STT: bool (tts_active_signal)
    TTS->>STT: str (recent_tts_text)
    TTS->>STT: AudioStamped (tts_reference_audio) — missing for pyttsx3
    TTS->>FX: AudioStamped (tts_audio_raw)
    Speaker->>STT: bool (speaker_playing_signal)
    FX->>Speaker: AudioStamped (tts_audio)

_{Reviews (13): Last reviewed commit: "docs: add audio startup environment note..." | Re-trigger Greptile}

greptile-apps · 2026-06-16T12:53:09Z

+    @skill
+    def record_clip(self, seconds: float = 1.0) -> bytes:
+        """Record and return a clip of raw PCM audio.
+
+        Collects frames from the live audio stream for `seconds` seconds and
+        returns them concatenated as raw S16LE PCM bytes.
+        """
+        import threading
+
+        buf: list[bytes] = []
+        done = threading.Event()
+        collected = [0.0]
+
+        def on_frame(msg: AudioStamped) -> None:
+            buf.append(msg.data)
+            collected[0] += self.config.frame_ms / 1000.0
+            if collected[0] >= seconds:
+                done.set()
+
+        unsub = self.audio.subscribe(on_frame)
+        done.wait(timeout=seconds + 2.0)
+        unsub()
+        return b"".join(buf)


record_clip silently returns empty bytes if the module is not running

If record_clip is called before start() or after stop(), no frames will ever arrive, done.wait will time out after seconds + 2.0 seconds, and the method returns b"" with no error or log message. Callers have no way to distinguish a successful empty recording from a misconfigured call. At minimum, a log warning on timeout (or a raised exception) would surface the problem.

greptile-apps · 2026-06-16T12:53:10Z

+    def __repr__(self) -> str:
+        n_samples = len(self.data) // (2 if "16" in self.sample_format else 4)
+        return (
+            f"AudioStamped(rate={self.sample_rate}, ch={self.channels}, "
+            f"fmt={self.sample_format}, samples={n_samples}, ts={self.ts:.6f})"
+        )


The n_samples heuristic does not divide by self.channels, so for multi-channel audio the repr reports total interleaved samples (e.g. 320 for 20 ms of stereo 16 kHz) rather than samples per channel (160). The existing byte-width check ("16" in self.sample_format) also silently falls back to 4 bytes/sample for any unknown format string, which could produce a nonsensical count.

Suggested change

def __repr__(self) -> str:

n_samples = len(self.data) // (2 if "16" in self.sample_format else 4)

return (

f"AudioStamped(rate={self.sample_rate}, ch={self.channels}, "

f"fmt={self.sample_format}, samples={n_samples}, ts={self.ts:.6f})"

)

def __repr__(self) -> str:

bytes_per_sample = 2 if "16" in self.sample_format else 4

n_frames = len(self.data) // (bytes_per_sample * self.channels)

return (

f"AudioStamped(rate={self.sample_rate}, ch={self.channels}, "

f"fmt={self.sample_format}, frames={n_frames}, ts={self.ts:.6f})"

)

leshy · 2026-06-16T14:58:02Z

+    audio metadata.  Serialises to/from foxglove_msgs.RawAudio on the wire.
+    """
+
+    msg_name = "foxglove_msgs.RawAudio"  # wire type used for LCM


we don't use foxglove, where does this come from?

Foxglove is not a new dependency; it's already mirrored into dimos_lcm, and RawAudio is the only audio type in there, so I reused it. Left a note that it's a stand-in pending a native Header-bearing type.

leshy · 2026-06-16T14:59:01Z

+    def lcm_encode(self) -> bytes:
+        """Encode to foxglove_msgs.RawAudio wire bytes.
+
+        NOTE: frame_id and seq from self.header are NOT preserved (the wire


ros2 header has no seq

why not preserve frame_id?

My comment is wrong; I will fix this issue. Frame_id does exist, but based on the RawAudio format, it only carries a timestamp; there is no frame_id on the wire type to put in. Preserving a frame_id means adding a header-bearing audio type to dimos-lcm, and we can discuss it today

- Remove all mentions of `seq` (ROS2 std_msgs/Header has no seq field) - Reword frame_id note: dropped because RawAudio has no frame_id field on the wire, not by design choice - Mark foxglove_msgs.RawAudio as a temporary stand-in pending team decision on a native Header-bearing LCM type Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps · 2026-06-17T02:38:27Z

+        ts: float | None = None,
+    ) -> AudioStamped:
+        """Construct from raw PCM bytes."""
+        t = ts if ts is not None else time.monotonic()


The from_pcm factory's fallback timestamp uses time.monotonic(), which returns an opaque system-relative counter (seconds since boot) rather than a Unix wall-clock time. Any caller that omits the ts argument — including external consumers of this public API — will create an AudioStamped whose ts field is near 0–86400 rather than near the Unix epoch (~1.7 × 10⁹). This makes Timestamped.dt() return a date in 1970 and breaks cross-stream alignment with any module that uses time.time().

Suggested change

t = ts if ts is not None else time.monotonic()

t = ts if ts is not None else time.time()

Add demo_audio blueprint to module.py and regenerate all_blueprints.py so AudioModule is accessible via: dimos run demo-audio (blueprint) dimos run audio-module (standalone module) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps · 2026-06-17T08:33:39Z

+        (TextToSpeechModule, "text", "speech_text"),
+        (TextToSpeechModule, "audio", "tts_audio"),
+        (SpeakerModule, "audio", "tts_audio"),
+    ]
+)
+
+
+demo_audio = autoconnect(
+    AudioModule.blueprint(),


TextToSpeechModule publishes frames with time.monotonic() timestamps

Every chunk published from _worker_loop uses time.monotonic() as its timestamp. time.monotonic() returns a system-relative counter (seconds since boot), not a Unix wall-clock time. Downstream consumers calling Timestamped.dt() will get dates in 1970, and cross-stream alignment with any module that uses time.time() (e.g., CameraModule) will fail. Replace with time.time() to match the rest of the stack.

Inserts AgentTextModule into the audio_speech_loopback pipeline so spoken input is routed through a LangChain chat LLM (default gpt-4o-mini) before being handed to TTS, turning the parrot loop into an actual spoken-command interface. - AgentTextConfig: model, system_prompt, api_key (env fallback to OPENAI_API_KEY with startup warning), queue_max_texts, history_max_turns - AgentTextModule: daemon thread + bounded queue (same pattern as TTS/STT), rolling conversation history trimmed to history_max_turns pairs - Blueprint: STT.text → speech_text → AgentTextModule.text_in; AgentTextModule.text_out → agent_response → TTS.text - README: module table, data flow diagram, Known Gaps, Next Steps updated All changed lines marked # [AGENT-WIRE] for easy grep. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps · 2026-06-22T05:42:36Z

Want your agent to iterate on Greptile's feedback? Try greploops.

greptile-apps · 2026-06-22T05:42:39Z

+    async def main(self) -> None:  # type: ignore[override]
+        self._queue = queue.Queue(maxsize=256)
+        self._running = threading.Event()
+        self._running.set()
+        self._thread = threading.Thread(target=self._worker_loop, daemon=True)
+        self._thread.start()
+        yield


_processor stale state survives stop/start cycle

FunVoiceEffectsModule._processor is a class-level (or persistent instance-level) _FunVoiceProcessor that is never reset or re-created in main(). On the first run, reconfigure() is triggered because self._processor.sample_rate is None. After a stop() + start() cycle at the same sample rate, self._processor.sample_rate already equals msg.sample_rate, so the if self._processor.sample_rate != msg.sample_rate guard on line 1785 never fires and reconfigure() is skipped. The processor then begins the new session with stale inbuf/outbuf residuals, the previous session's echo_buf contents, the ring modulator at a non-zero ring_phase, and gate_gain at whatever state it was left in — causing audible echo artifacts and corrupted audio from the moment the module restarts. The fix is to call self._processor = _FunVoiceProcessor() (or invoke self._processor.reconfigure(...)) at the top of main() before yield.

greptile-apps · 2026-06-22T07:44:15Z

+            if provider == "pyttsx3":
+                self.tts_active.publish(True)
+                try:
+                    import pyttsx3  # type: ignore[import-not-found]
+                except Exception as e:
+                    logger.warning(f"TextToSpeechModule: missing pyttsx3 ({e}); dropping text")
+                    self.tts_active.publish(False)
+                    continue
+                try:
+                    engine = pyttsx3.init()
+                    engine.setProperty("rate", self.config.pyttsx3_rate_wpm)
+                    engine.setProperty("volume", self.config.pyttsx3_volume)
+                    engine.say(text)
+                    engine.runAndWait()
+                    self.spoken_text.publish(text)
+                except Exception:
+                    logger.exception("TextToSpeechModule pyttsx3 synthesis failed")
+                self.tts_active.publish(False)
+                continue


pyttsx3 path never publishes tts_reference_audio, leaving AEC blind

The pyttsx3 provider plays audio directly through the OS audio subsystem and then returns, but it never calls self.tts_reference_audio.publish(...). SpeechToTextModule._on_tts_reference_audio is the only way its AEC ring-buffer gets reference audio. Without it, _aec_ref_buf stays empty for the entire pyttsx3 utterance, so _apply_aec returns the microphone signal unmodified — any echo the mic picks up from the speakers will pass straight into the Whisper pipeline and be transcribed as a user utterance.

The tts_active guard will suppress STT during the synchronous engine.runAndWait() call, but it offers only a coarse time window (tts_guard_seconds, default 0.8 s) and does not account for room reverb that persists after playback ends. The macos-say and openai paths both populate tts_reference_audio correctly; the pyttsx3 path should do the same, using the raw PCM that pyttsx3 synthesises (or a zero-padded synthetic fill if that audio isn't capturable).

GuoZhuoRan requested review from leshy, mustafab0, paul-nechifor and spomichter as code owners June 16, 2026 12:47

greptile-apps Bot reviewed Jun 16, 2026

View reviewed changes

leshy reviewed Jun 16, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 17, 2026

View reviewed changes

GuoZhuoRan and others added 2 commits June 17, 2026 14:01

feat(audio): register AudioModule in dimos CLI

459ab1f

Add demo_audio blueprint to module.py and regenerate all_blueprints.py so AudioModule is accessible via: dimos run demo-audio (blueprint) dimos run audio-module (standalone module) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(audio): add audio speech loopback pipeline

cbe8411

greptile-apps Bot reviewed Jun 17, 2026

View reviewed changes

GuoZhuoRan and others added 7 commits June 17, 2026 16:42

feat(audio): add fun voice effects chain

732e60c

fix(audio): add OpenAI TTS timeout + fail-fast

68858a4

fix: reduce audio loopback echo and STT backlog

49391ee

fix: use wall-clock audio timestamps and clean subscriptions

f36bb77

docs: add audio subsystem handoff

63b81ee

docs: add audio handoff contact email

9baa8df

greptile-apps Bot reviewed Jun 22, 2026

View reviewed changes

GuoZhuoRan added 2 commits June 22, 2026 14:00

docs: update audio handoff status

91ef6bd

docs: add audio startup environment notes

164991f

greptile-apps Bot reviewed Jun 22, 2026

View reviewed changes

	t = ts if ts is not None else time.monotonic()
	t = ts if ts is not None else time.time()

Conversation

GuoZhuoRan commented Jun 16, 2026

Problem

Solution

How to Test

Contributor License Agreement

Uh oh!

greptile-apps Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

leshy Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

GuoZhuoRan Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leshy Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

GuoZhuoRan Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Jun 22, 2026

Uh oh!

greptile-apps Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 16, 2026 •

edited

Loading

GuoZhuoRan Jun 17, 2026 •

edited

Loading

GuoZhuoRan Jun 17, 2026 •

edited

Loading