State shape#
interface VoiceSessionState {
status: "idle" | "connecting" | "connected" | "error";
error: string | null;
isMuted: boolean;
phase: "idle" | "listening" | "speaking" | "pause" | "thinking";
userSpeaking: boolean;
agentSpeaking: boolean;
duration: number; // seconds since connected
messages: TranscriptMessage[];
idleWarning: number | null; // seconds until idle timeout (null = no warning)
}| Field | Meaning |
|---|---|
status | The connection lifecycle. idle → connecting → connected → (idle on disconnect, or error). |
error | Populated when status === "error". Always check this when handling errors. |
isMuted | Mic state. Mirrors setMuted() / toggleMute(). |
phase | What the conversation is doing right now (see below). |
userSpeaking | true between speech.started and speech.ended events. Use for live waveform UIs. |
agentSpeaking | true while TTS is playing. |
duration | Seconds since status became connected. Updates every second. |
messages | Full transcript — user and bot turns. See Transcript messages below. |
idleWarning | When the server emits session.idle_warning, this holds the seconds remaining until timeout. null when no warning is active. |
Call phases#
phase tells you what the conversation is doing right now. It's the field you'll bind to UI state most often (orb color, animation, status label).
| Phase | Meaning | Triggered by |
|---|---|---|
idle | Not in a call | Initial state, after disconnect |
listening | Mic is hot, waiting for speech | Connection established; after bot finishes; after turn.resumed |
speaking | Agent is speaking (TTS playing) | First bot.word event |
thinking | Processing user input, waiting for LLM | user.message (STT final), turn.end |
pause | Turn detection pause — user may still be talking | turn.pause (brief silence detected) |
Typical flow during one exchange:
listening ──► (user speaks) ──► thinking ──► speaking ──► (bot finishes) ──► listening
▲
│ (turn.pause / turn.resumed cycles)
▼
pauseTranscript messages#
The messages array contains the full conversation history. Each message is structured:
interface TranscriptMessage {
id: number;
role: "user" | "bot";
text: string;
isInterim?: boolean; // user only: STT is still processing
speaking?: boolean; // bot only: TTS is playing this message
interrupted?: boolean; // bot only: user barged in
messageId?: string; // bot only: server-assigned ID
}Messages mutate in place as STT refines, words stream in, and the bot finishes speaking — they don't get replaced. That means if you bind to messages reactively, the right entry will update.
User message lifecycle#
user.speaking → { role: "user", text: "Hola", isInterim: true }
text updates as STT refines...
user.speaking → { role: "user", text: "Hola que", isInterim: true }
user.message → { role: "user", text: "Hola, ¿qué tal?", isInterim: false }If you're rendering a transcript, render isInterim: true messages with reduced opacity or a "typing" indicator so the user sees that the STT is still processing.
Bot message lifecycle (word-by-word)#
bot.speaking → { role: "bot", text: "", speaking: true, messageId: "abc" }
bot.word → text: "Hello"
bot.word → text: "Hello there"
bot.word → text: "Hello there how"
bot.word → text: "Hello there how are"
bot.word → text: "Hello there how are you"
bot.finished → { speaking: false, text: "Hello there, how are you?" }bot.speaking arrives with the full intended text, but the widget intentionally starts with text: "" and builds word-by-word so the on-screen captions stay in sync with the audio.
bot.finished may include a polished final text (with proper punctuation that the per-word stream doesn't have).
Interrupted bot#
When the user barges in mid-utterance:
bot.word → text: "Hello there how"
bot.interrupted → { speaking: false, interrupted: true }Render interrupted messages with a visual marker (e.g. ⚡ icon, ellipsis, gray border) so users see the bot was cut off rather than just suddenly stopping.
Subscribing to changes#
The state object is stable by identity — getState() returns the same reference until something changes. This is what makes it safe for React's useSyncExternalStore:
session.subscribe(() => {
const next = session.getState(); // new reference only if state changed
// ...
});For more targeted updates, subscribe to specific events:
session.addEventListener("phase", (e) => {
// only fires when phase actually changes
document.body.dataset.phase = e.detail.phase;
});Driving UI from phase and agentSpeaking#
A common pattern: bind your "orb" or status visual to phase for the overall mode, and use agentSpeaking for a faster-reacting animation layer.
const orb = document.getElementById("orb");
session.subscribe(() => {
const { phase, agentSpeaking, idleWarning } = session.getState();
orb.dataset.phase = phase; // CSS handles per-phase styling
orb.classList.toggle("speaking", agentSpeaking);
orb.classList.toggle("idle-warning", idleWarning !== null);
});The voice-widget package follows exactly this pattern — see its theming guide for the full set of CSS classes.
What's next#
- DataChannel protocol — the raw events that drive state changes
VoiceSessionclass — methods and constructor
