feat: implement WebSocket server with VoiceAgent for real-time voice interaction

- Added a new WebSocket server implementation in `ws-server-2.ts` that utilizes the `VoiceAgent` for handling voice interactions.
- Integrated weather and time tools using the `ai` library for enhanced responses.
- Refactored existing `ws-server.ts` to streamline the connection handling and event logging.
- Enhanced `VoiceAgent` to support streaming speech generation with improved chunk handling and interruption capabilities.
- Introduced new event listeners for better logging and handling of speech-related events.
- Added graceful shutdown handling for the WebSocket server.
This commit is contained in:
Bijit Mondal
2026-02-13 17:16:12 +05:30
parent c1cd705d49
commit 6510232655
6 changed files with 1749 additions and 174 deletions

View File

@@ -1,12 +1,14 @@
# voice-agent-ai-sdk
Minimal voice/text agent SDK built on AI SDK with optional WebSocket transport.
Streaming voice/text agent SDK built on AI SDK with optional WebSocket transport.
## Current status
- Text flow works via `sendText()` (no WebSocket required).
- WebSocket flow works when `connect()` is used with a running WS endpoint.
- Voice streaming is not implemented yet.
- Streaming text generation is implemented via `streamText`.
- Tool calling is supported in-stream.
- Speech synthesis is implemented with chunked streaming TTS.
- Audio transcription is supported (when `transcriptionModel` is configured).
- WebSocket protocol events are emitted for stream, tool, and speech lifecycle.
## Prerequisites
@@ -25,13 +27,35 @@ Minimal voice/text agent SDK built on AI SDK with optional WebSocket transport.
OPENAI_API_KEY=your_openai_api_key
VOICE_WS_ENDPOINT=ws://localhost:8080
`VOICE_WS_ENDPOINT` is optional for text-only usage.
## VoiceAgent configuration
The agent accepts:
- `model` (required): chat model
- `transcriptionModel` (optional): STT model
- `speechModel` (optional): TTS model
- `instructions` (optional): system prompt
- `stopWhen` (optional): stopping condition
- `tools` (optional): AI SDK tools map
- `endpoint` (optional): WebSocket endpoint
- `voice` (optional): TTS voice, default `alloy`
- `speechInstructions` (optional): style instructions for TTS
- `outputFormat` (optional): audio format, default `mp3`
- `streamingSpeech` (optional):
- `minChunkSize`
- `maxChunkSize`
- `parallelGeneration`
- `maxParallelRequests`
## Run (text-only check)
This validates model + tool calls without requiring WebSocket:
This validates LLM + tool + streaming speech without requiring WebSocket:
pnpm demo
Expected logs include `text` events and optional `tool_start`.
Expected logs include `text`, `chunk:text_delta`, tool events, and speech chunk events.
## Run (WebSocket check)
@@ -45,7 +69,22 @@ Expected logs include `text` events and optional `tool_start`.
The demo will:
- run `sendText()` first (text-only sanity check), then
- connect to `VOICE_WS_ENDPOINT` if provided.
- connect to `VOICE_WS_ENDPOINT` if provided,
- emit streaming protocol messages (`text_delta`, `tool_call`, `audio_chunk`, `response_complete`, etc.).
## Browser voice client (HTML)
A simple browser client is available at [example/voice-client.html](example/voice-client.html).
What it does:
- captures microphone speech using Web Speech API (speech-to-text)
- sends transcript to the agent via WebSocket (`type: "transcript"`)
- receives streaming `audio_chunk` messages and plays them in order
How to use:
1. Start your agent server/WebSocket endpoint.
2. Open [example/voice-client.html](example/voice-client.html) in a browser (Chrome/Edge recommended).
3. Connect to `ws://localhost:8080` (or your endpoint), then click **Start Mic**.
## Scripts
@@ -58,3 +97,4 @@ The demo will:
- If `VOICE_WS_ENDPOINT` is empty, WebSocket connect is skipped.
- The sample WS server sends a mock `transcript` message for end-to-end testing.
- Streaming TTS uses chunk queueing and supports interruption (`interrupt`).