mirror of
https://github.com/Bijit-Mondal/VoiceAgent.git
synced 2026-03-02 18:36:39 +00:00
feat: implement WebSocket server with VoiceAgent for real-time voice interaction
- Added a new WebSocket server implementation in `ws-server-2.ts` that utilizes the `VoiceAgent` for handling voice interactions. - Integrated weather and time tools using the `ai` library for enhanced responses. - Refactored existing `ws-server.ts` to streamline the connection handling and event logging. - Enhanced `VoiceAgent` to support streaming speech generation with improved chunk handling and interruption capabilities. - Introduced new event listeners for better logging and handling of speech-related events. - Added graceful shutdown handling for the WebSocket server.
This commit is contained in:
54
README.md
54
README.md
@@ -1,12 +1,14 @@
|
||||
# voice-agent-ai-sdk
|
||||
|
||||
Minimal voice/text agent SDK built on AI SDK with optional WebSocket transport.
|
||||
Streaming voice/text agent SDK built on AI SDK with optional WebSocket transport.
|
||||
|
||||
## Current status
|
||||
|
||||
- Text flow works via `sendText()` (no WebSocket required).
|
||||
- WebSocket flow works when `connect()` is used with a running WS endpoint.
|
||||
- Voice streaming is not implemented yet.
|
||||
- Streaming text generation is implemented via `streamText`.
|
||||
- Tool calling is supported in-stream.
|
||||
- Speech synthesis is implemented with chunked streaming TTS.
|
||||
- Audio transcription is supported (when `transcriptionModel` is configured).
|
||||
- WebSocket protocol events are emitted for stream, tool, and speech lifecycle.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
@@ -25,13 +27,35 @@ Minimal voice/text agent SDK built on AI SDK with optional WebSocket transport.
|
||||
OPENAI_API_KEY=your_openai_api_key
|
||||
VOICE_WS_ENDPOINT=ws://localhost:8080
|
||||
|
||||
`VOICE_WS_ENDPOINT` is optional for text-only usage.
|
||||
|
||||
## VoiceAgent configuration
|
||||
|
||||
The agent accepts:
|
||||
|
||||
- `model` (required): chat model
|
||||
- `transcriptionModel` (optional): STT model
|
||||
- `speechModel` (optional): TTS model
|
||||
- `instructions` (optional): system prompt
|
||||
- `stopWhen` (optional): stopping condition
|
||||
- `tools` (optional): AI SDK tools map
|
||||
- `endpoint` (optional): WebSocket endpoint
|
||||
- `voice` (optional): TTS voice, default `alloy`
|
||||
- `speechInstructions` (optional): style instructions for TTS
|
||||
- `outputFormat` (optional): audio format, default `mp3`
|
||||
- `streamingSpeech` (optional):
|
||||
- `minChunkSize`
|
||||
- `maxChunkSize`
|
||||
- `parallelGeneration`
|
||||
- `maxParallelRequests`
|
||||
|
||||
## Run (text-only check)
|
||||
|
||||
This validates model + tool calls without requiring WebSocket:
|
||||
This validates LLM + tool + streaming speech without requiring WebSocket:
|
||||
|
||||
pnpm demo
|
||||
|
||||
Expected logs include `text` events and optional `tool_start`.
|
||||
Expected logs include `text`, `chunk:text_delta`, tool events, and speech chunk events.
|
||||
|
||||
## Run (WebSocket check)
|
||||
|
||||
@@ -45,7 +69,22 @@ Expected logs include `text` events and optional `tool_start`.
|
||||
|
||||
The demo will:
|
||||
- run `sendText()` first (text-only sanity check), then
|
||||
- connect to `VOICE_WS_ENDPOINT` if provided.
|
||||
- connect to `VOICE_WS_ENDPOINT` if provided,
|
||||
- emit streaming protocol messages (`text_delta`, `tool_call`, `audio_chunk`, `response_complete`, etc.).
|
||||
|
||||
## Browser voice client (HTML)
|
||||
|
||||
A simple browser client is available at [example/voice-client.html](example/voice-client.html).
|
||||
|
||||
What it does:
|
||||
- captures microphone speech using Web Speech API (speech-to-text)
|
||||
- sends transcript to the agent via WebSocket (`type: "transcript"`)
|
||||
- receives streaming `audio_chunk` messages and plays them in order
|
||||
|
||||
How to use:
|
||||
1. Start your agent server/WebSocket endpoint.
|
||||
2. Open [example/voice-client.html](example/voice-client.html) in a browser (Chrome/Edge recommended).
|
||||
3. Connect to `ws://localhost:8080` (or your endpoint), then click **Start Mic**.
|
||||
|
||||
## Scripts
|
||||
|
||||
@@ -58,3 +97,4 @@ The demo will:
|
||||
|
||||
- If `VOICE_WS_ENDPOINT` is empty, WebSocket connect is skipped.
|
||||
- The sample WS server sends a mock `transcript` message for end-to-end testing.
|
||||
- Streaming TTS uses chunk queueing and supports interruption (`interrupt`).
|
||||
|
||||
Reference in New Issue
Block a user