Skip to content

feat(audio): add AudioModule for issue #1932#2507

Open
GuoZhuoRan wants to merge 13 commits into
dimensionalOS:mainfrom
GuoZhuoRan:feat/audio-module-1932
Open

feat(audio): add AudioModule for issue #1932#2507
GuoZhuoRan wants to merge 13 commits into
dimensionalOS:mainfrom
GuoZhuoRan:feat/audio-module-1932

Conversation

@GuoZhuoRan

Copy link
Copy Markdown

Adds mic audio capture and chunked publishing as AudioStamped on an Out stream, mirroring CameraModule. Validated on macOS Apple Silicon at 50 Hz / 20 ms frames with both synthetic (sine tone) and real mic sources.

  • dimos/msgs/audio_msgs/AudioStamped.py: Python overlay wrapping foxglove_msgs.RawAudio for LCM encode/decode, with from_pcm() and to_numpy() helpers. Flags that builtin_interfaces.Time (not std_msgs.Header) is the wire type, so frame_id is not preserved.
  • dimos/hardware/sensors/audio/module.py: AudioModule(Module) with AudioConfig(ModuleConfig), async def main() lifecycle, @rpc start/stop, @Skill record_clip.
  • examples/audio/validate_audio_module.py: LCM round-trip assert + live stream rate/timestamp validation.

Problem

Closes DIM-XXX

Solution

How to Test

Contributor License Agreement

  • I have read and approved the CLA.

Adds mic audio capture and chunked publishing as AudioStamped on an Out
stream, mirroring CameraModule. Validated on macOS Apple Silicon at
50 Hz / 20 ms frames with both synthetic (sine tone) and real mic sources.

- dimos/msgs/audio_msgs/AudioStamped.py: Python overlay wrapping
  foxglove_msgs.RawAudio for LCM encode/decode, with from_pcm() and
  to_numpy() helpers. Flags that builtin_interfaces.Time (not
  std_msgs.Header) is the wire type, so frame_id is not preserved.
- dimos/hardware/sensors/audio/module.py: AudioModule(Module) with
  AudioConfig(ModuleConfig), async def main() lifecycle, @rpc start/stop,
  @Skill record_clip.
- examples/audio/validate_audio_module.py: LCM round-trip assert +
  live stream rate/timestamp validation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces a complete voice I/O subsystem for dimos: AudioModule (mic capture → AudioStamped), SpeakerModule (PCM playback), SpeechToTextModule (VAD + AEC + Whisper), AgentTextModule (LLM routing), TextToSpeechModule (three provider backends), and FunVoiceEffectsModule (pitch/echo effects), wired together in the audio-speech-loopback and demo-audio blueprints.

  • AudioStamped wraps foxglove_msgs.RawAudio for LCM transport, with from_pcm() / to_numpy() / lcm_encode() / lcm_decode() helpers.
  • The pyttsx3 TTS path plays audio directly through the OS but never publishes tts_reference_audio, leaving SpeechToTextModule's AEC without reference data for echo suppression.
  • _stream_lock in SpeakerModule (line 255) and _state_lock in SpeechToTextModule (line 466) are declared as plain class attributes rather than PrivateAttr, making them shared across all instances of their respective classes.

Confidence Score: 4/5

Mergeable with caution — the pyttsx3 AEC gap is a functional regression for that provider, and the class-level Lock declarations could cause cross-instance interference in tests or concurrent pipelines.

The pyttsx3 TTS path never publishes tts_reference_audio, so SpeechToTextModule's acoustic echo canceller runs without any reference signal when that provider is active, causing echo to pass through to Whisper transcription. Separately, _stream_lock (SpeakerModule) and _state_lock (SpeechToTextModule) are class-level attributes in Pydantic models, meaning all instances of each class share the same Lock — in any multi-instance scenario the locks serialise unrelated instances against each other and can block teardown.

dimos/hardware/sensors/audio/module.py — the pyttsx3 provider block (around line 1240) and the Lock declarations for SpeakerModule (line 255) and SpeechToTextModule (line 466).

Important Files Changed

Filename Overview
dimos/hardware/sensors/audio/module.py Adds five audio modules (Capture, Speaker, STT, AgentText, TTS, FunVoiceEffects) and two blueprints; pyttsx3 TTS path omits tts_reference_audio publish, and class-level Lock declarations are shared across instances
dimos/msgs/audio_msgs/AudioStamped.py New AudioStamped message type wrapping foxglove_msgs.RawAudio; fallback timestamp in from_pcm still uses time.monotonic() but this was already flagged in prior review threads
examples/audio/validate_audio_module.py LCM round-trip + live stream validation script; explicitly passes time.monotonic() as ts in the round-trip test (line 60), inconsistent with module's time.time() timestamps
dimos/robot/all_blueprints.py Auto-generated registry updated to include audio-speech-loopback and demo-audio blueprints; straightforward addition
dimos/hardware/sensors/audio/init.py Empty init file for the new audio package; no issues
dimos/msgs/audio_msgs/init.py Empty init file for audio_msgs package; no issues

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Mic as AudioModule (Mic)
    participant STT as SpeechToTextModule
    participant Agent as AgentTextModule
    participant TTS as TextToSpeechModule
    participant FX as FunVoiceEffectsModule
    participant Speaker as SpeakerModule

    Mic->>STT: AudioStamped (mic_audio)
    STT->>Agent: str (speech_text)
    Agent->>TTS: str (agent_response)
    TTS->>STT: bool (tts_active_signal)
    TTS->>STT: str (recent_tts_text)
    TTS->>STT: AudioStamped (tts_reference_audio) — missing for pyttsx3
    TTS->>FX: AudioStamped (tts_audio_raw)
    Speaker->>STT: bool (speaker_playing_signal)
    FX->>Speaker: AudioStamped (tts_audio)
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Mic as AudioModule (Mic)
    participant STT as SpeechToTextModule
    participant Agent as AgentTextModule
    participant TTS as TextToSpeechModule
    participant FX as FunVoiceEffectsModule
    participant Speaker as SpeakerModule

    Mic->>STT: AudioStamped (mic_audio)
    STT->>Agent: str (speech_text)
    Agent->>TTS: str (agent_response)
    TTS->>STT: bool (tts_active_signal)
    TTS->>STT: str (recent_tts_text)
    TTS->>STT: AudioStamped (tts_reference_audio) — missing for pyttsx3
    TTS->>FX: AudioStamped (tts_audio_raw)
    Speaker->>STT: bool (speaker_playing_signal)
    FX->>Speaker: AudioStamped (tts_audio)
Loading

Reviews (13): Last reviewed commit: "docs: add audio startup environment note..." | Re-trigger Greptile

Comment thread dimos/hardware/sensors/audio/module.py
Comment thread dimos/hardware/sensors/audio/module.py
Comment thread dimos/hardware/sensors/audio/module.py
Comment on lines +183 to +205
@skill
def record_clip(self, seconds: float = 1.0) -> bytes:
"""Record and return a clip of raw PCM audio.

Collects frames from the live audio stream for `seconds` seconds and
returns them concatenated as raw S16LE PCM bytes.
"""
import threading

buf: list[bytes] = []
done = threading.Event()
collected = [0.0]

def on_frame(msg: AudioStamped) -> None:
buf.append(msg.data)
collected[0] += self.config.frame_ms / 1000.0
if collected[0] >= seconds:
done.set()

unsub = self.audio.subscribe(on_frame)
done.wait(timeout=seconds + 2.0)
unsub()
return b"".join(buf)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 record_clip silently returns empty bytes if the module is not running

If record_clip is called before start() or after stop(), no frames will ever arrive, done.wait will time out after seconds + 2.0 seconds, and the method returns b"" with no error or log message. Callers have no way to distinguish a successful empty recording from a misconfigured call. At minimum, a log warning on timeout (or a raised exception) would surface the problem.

Comment on lines +163 to +168
def __repr__(self) -> str:
n_samples = len(self.data) // (2 if "16" in self.sample_format else 4)
return (
f"AudioStamped(rate={self.sample_rate}, ch={self.channels}, "
f"fmt={self.sample_format}, samples={n_samples}, ts={self.ts:.6f})"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The n_samples heuristic does not divide by self.channels, so for multi-channel audio the repr reports total interleaved samples (e.g. 320 for 20 ms of stereo 16 kHz) rather than samples per channel (160). The existing byte-width check ("16" in self.sample_format) also silently falls back to 4 bytes/sample for any unknown format string, which could produce a nonsensical count.

Suggested change
def __repr__(self) -> str:
n_samples = len(self.data) // (2 if "16" in self.sample_format else 4)
return (
f"AudioStamped(rate={self.sample_rate}, ch={self.channels}, "
f"fmt={self.sample_format}, samples={n_samples}, ts={self.ts:.6f})"
)
def __repr__(self) -> str:
bytes_per_sample = 2 if "16" in self.sample_format else 4
n_frames = len(self.data) // (bytes_per_sample * self.channels)
return (
f"AudioStamped(rate={self.sample_rate}, ch={self.channels}, "
f"fmt={self.sample_format}, frames={n_frames}, ts={self.ts:.6f})"
)

audio metadata. Serialises to/from foxglove_msgs.RawAudio on the wire.
"""

msg_name = "foxglove_msgs.RawAudio" # wire type used for LCM

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't use foxglove, where does this come from?

@GuoZhuoRan GuoZhuoRan Jun 17, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Foxglove is not a new dependency; it's already mirrored into dimos_lcm, and RawAudio is the only audio type in there, so I reused it. Left a note that it's a stand-in pending a native Header-bearing type.

Comment thread dimos/msgs/audio_msgs/AudioStamped.py Outdated
def lcm_encode(self) -> bytes:
"""Encode to foxglove_msgs.RawAudio wire bytes.

NOTE: frame_id and seq from self.header are NOT preserved (the wire

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ros2 header has no seq

why not preserve frame_id?

@GuoZhuoRan GuoZhuoRan Jun 17, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment is wrong; I will fix this issue. Frame_id does exist, but based on the RawAudio format, it only carries a timestamp; there is no frame_id on the wire type to put in. Preserving a frame_id means adding a header-bearing audio type to dimos-lcm, and we can discuss it today

- Remove all mentions of `seq` (ROS2 std_msgs/Header has no seq field)
- Reword frame_id note: dropped because RawAudio has no frame_id field
  on the wire, not by design choice
- Mark foxglove_msgs.RawAudio as a temporary stand-in pending team
  decision on a native Header-bearing LCM type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ts: float | None = None,
) -> AudioStamped:
"""Construct from raw PCM bytes."""
t = ts if ts is not None else time.monotonic()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The from_pcm factory's fallback timestamp uses time.monotonic(), which returns an opaque system-relative counter (seconds since boot) rather than a Unix wall-clock time. Any caller that omits the ts argument — including external consumers of this public API — will create an AudioStamped whose ts field is near 0–86400 rather than near the Unix epoch (~1.7 × 10⁹). This makes Timestamped.dt() return a date in 1970 and breaks cross-stream alignment with any module that uses time.time().

Suggested change
t = ts if ts is not None else time.monotonic()
t = ts if ts is not None else time.time()

GuoZhuoRan and others added 2 commits June 17, 2026 14:01
Add demo_audio blueprint to module.py and regenerate all_blueprints.py
so AudioModule is accessible via:
  dimos run demo-audio          (blueprint)
  dimos run audio-module        (standalone module)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread dimos/hardware/sensors/audio/module.py Outdated
Comment on lines +706 to +714
(TextToSpeechModule, "text", "speech_text"),
(TextToSpeechModule, "audio", "tts_audio"),
(SpeakerModule, "audio", "tts_audio"),
]
)


demo_audio = autoconnect(
AudioModule.blueprint(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 TextToSpeechModule publishes frames with time.monotonic() timestamps

Every chunk published from _worker_loop uses time.monotonic() as its timestamp. time.monotonic() returns a system-relative counter (seconds since boot), not a Unix wall-clock time. Downstream consumers calling Timestamped.dt() will get dates in 1970, and cross-stream alignment with any module that uses time.time() (e.g., CameraModule) will fail. Replace with time.time() to match the rest of the stack.

GuoZhuoRan and others added 7 commits June 17, 2026 16:42
Inserts AgentTextModule into the audio_speech_loopback pipeline so spoken
input is routed through a LangChain chat LLM (default gpt-4o-mini) before
being handed to TTS, turning the parrot loop into an actual spoken-command
interface.

- AgentTextConfig: model, system_prompt, api_key (env fallback to
  OPENAI_API_KEY with startup warning), queue_max_texts, history_max_turns
- AgentTextModule: daemon thread + bounded queue (same pattern as TTS/STT),
  rolling conversation history trimmed to history_max_turns pairs
- Blueprint: STT.text → speech_text → AgentTextModule.text_in;
  AgentTextModule.text_out → agent_response → TTS.text
- README: module table, data flow diagram, Known Gaps, Next Steps updated

All changed lines marked # [AGENT-WIRE] for easy grep.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Want your agent to iterate on Greptile's feedback? Try greploops.

Comment on lines +1706 to +1712
async def main(self) -> None: # type: ignore[override]
self._queue = queue.Queue(maxsize=256)
self._running = threading.Event()
self._running.set()
self._thread = threading.Thread(target=self._worker_loop, daemon=True)
self._thread.start()
yield

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 _processor stale state survives stop/start cycle

FunVoiceEffectsModule._processor is a class-level (or persistent instance-level) _FunVoiceProcessor that is never reset or re-created in main(). On the first run, reconfigure() is triggered because self._processor.sample_rate is None. After a stop() + start() cycle at the same sample rate, self._processor.sample_rate already equals msg.sample_rate, so the if self._processor.sample_rate != msg.sample_rate guard on line 1785 never fires and reconfigure() is skipped. The processor then begins the new session with stale inbuf/outbuf residuals, the previous session's echo_buf contents, the ring modulator at a non-zero ring_phase, and gate_gain at whatever state it was left in — causing audible echo artifacts and corrupted audio from the moment the module restarts. The fix is to call self._processor = _FunVoiceProcessor() (or invoke self._processor.reconfigure(...)) at the top of main() before yield.

Comment on lines +1240 to +1258
if provider == "pyttsx3":
self.tts_active.publish(True)
try:
import pyttsx3 # type: ignore[import-not-found]
except Exception as e:
logger.warning(f"TextToSpeechModule: missing pyttsx3 ({e}); dropping text")
self.tts_active.publish(False)
continue
try:
engine = pyttsx3.init()
engine.setProperty("rate", self.config.pyttsx3_rate_wpm)
engine.setProperty("volume", self.config.pyttsx3_volume)
engine.say(text)
engine.runAndWait()
self.spoken_text.publish(text)
except Exception:
logger.exception("TextToSpeechModule pyttsx3 synthesis failed")
self.tts_active.publish(False)
continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 pyttsx3 path never publishes tts_reference_audio, leaving AEC blind

The pyttsx3 provider plays audio directly through the OS audio subsystem and then returns, but it never calls self.tts_reference_audio.publish(...). SpeechToTextModule._on_tts_reference_audio is the only way its AEC ring-buffer gets reference audio. Without it, _aec_ref_buf stays empty for the entire pyttsx3 utterance, so _apply_aec returns the microphone signal unmodified — any echo the mic picks up from the speakers will pass straight into the Whisper pipeline and be transcribed as a user utterance.

The tts_active guard will suppress STT during the synchronous engine.runAndWait() call, but it offers only a coarse time window (tts_guard_seconds, default 0.8 s) and does not account for room reverb that persists after playback ends. The macos-say and openai paths both populate tts_reference_audio correctly; the pyttsx3 path should do the same, using the raw PCM that pyttsx3 synthesises (or a zero-padded synthetic fill if that audio isn't capturable).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants