feat: implement WebSocket server with VoiceAgent for real-time voice interaction

- Added a new WebSocket server implementation in `ws-server-2.ts` that utilizes the `VoiceAgent` for handling voice interactions. - Integrated weather and time tools using the `ai` library for enhanced responses. - Refactored existing `ws-server.ts` to streamline the connection handling and event logging. - Enhanced `VoiceAgent` to support streaming speech generation with improved chunk handling and interruption capabilities. - Introduced new event listeners for better logging and handling of speech-related events. - Added graceful shutdown handling for the WebSocket server.
2026-03-02 18:36:39 +00:00 · 2026-02-13 17:16:12 +05:30
parent c1cd705d49
commit 6510232655
6 changed files with 1749 additions and 174 deletions
--- a/README.md
+++ b/README.md
@@ -1,12 +1,14 @@
 # voice-agent-ai-sdk

-Minimal voice/text agent SDK built on AI SDK with optional WebSocket transport.
+Streaming voice/text agent SDK built on AI SDK with optional WebSocket transport.

 ## Current status

- Text flow works via `sendText()` (no WebSocket required).
- WebSocket flow works when `connect()` is used with a running WS endpoint.
- Voice streaming is not implemented yet.
+- Streaming text generation is implemented via `streamText`.
+- Tool calling is supported in-stream.
+- Speech synthesis is implemented with chunked streaming TTS.
+- Audio transcription is supported (when `transcriptionModel` is configured).
+- WebSocket protocol events are emitted for stream, tool, and speech lifecycle.

 ## Prerequisites

@@ -25,13 +27,35 @@ Minimal voice/text agent SDK built on AI SDK with optional WebSocket transport.
   OPENAI_API_KEY=your_openai_api_key
   VOICE_WS_ENDPOINT=ws://localhost:8080

+`VOICE_WS_ENDPOINT` is optional for text-only usage.
+
+## VoiceAgent configuration
+
+The agent accepts:
+
+- `model` (required): chat model
+- `transcriptionModel` (optional): STT model
+- `speechModel` (optional): TTS model
+- `instructions` (optional): system prompt
+- `stopWhen` (optional): stopping condition
+- `tools` (optional): AI SDK tools map
+- `endpoint` (optional): WebSocket endpoint
+- `voice` (optional): TTS voice, default `alloy`
+- `speechInstructions` (optional): style instructions for TTS
+- `outputFormat` (optional): audio format, default `mp3`
+- `streamingSpeech` (optional):
+   - `minChunkSize`
+   - `maxChunkSize`
+   - `parallelGeneration`
+   - `maxParallelRequests`
+
 ## Run (text-only check)

-This validates model + tool calls without requiring WebSocket:
+This validates LLM + tool + streaming speech without requiring WebSocket:

 pnpm demo

-Expected logs include `text` events and optional `tool_start`.
+Expected logs include `text`, `chunk:text_delta`, tool events, and speech chunk events.

 ## Run (WebSocket check)

@@ -45,7 +69,22 @@ Expected logs include `text` events and optional `tool_start`.

 The demo will:
 - run `sendText()` first (text-only sanity check), then
- connect to `VOICE_WS_ENDPOINT` if provided.
+- connect to `VOICE_WS_ENDPOINT` if provided,
+- emit streaming protocol messages (`text_delta`, `tool_call`, `audio_chunk`, `response_complete`, etc.).
+
+## Browser voice client (HTML)
+
+A simple browser client is available at [example/voice-client.html](example/voice-client.html).
+
+What it does:
+- captures microphone speech using Web Speech API (speech-to-text)
+- sends transcript to the agent via WebSocket (`type: "transcript"`)
+- receives streaming `audio_chunk` messages and plays them in order
+
+How to use:
+1. Start your agent server/WebSocket endpoint.
+2. Open [example/voice-client.html](example/voice-client.html) in a browser (Chrome/Edge recommended).
+3. Connect to `ws://localhost:8080` (or your endpoint), then click **Start Mic**.

 ## Scripts

@@ -58,3 +97,4 @@ The demo will:

 - If `VOICE_WS_ENDPOINT` is empty, WebSocket connect is skipped.
 - The sample WS server sends a mock `transcript` message for end-to-end testing.
+- Streaming TTS uses chunk queueing and supports interruption (`interrupt`).