mirror of
https://github.com/Bijit-Mondal/VoiceAgent.git
synced 2026-03-02 10:36:37 +00:00
- Added a new WebSocket server implementation in `ws-server-2.ts` that utilizes the `VoiceAgent` for handling voice interactions. - Integrated weather and time tools using the `ai` library for enhanced responses. - Refactored existing `ws-server.ts` to streamline the connection handling and event logging. - Enhanced `VoiceAgent` to support streaming speech generation with improved chunk handling and interruption capabilities. - Introduced new event listeners for better logging and handling of speech-related events. - Added graceful shutdown handling for the WebSocket server.
2.8 KiB
2.8 KiB
voice-agent-ai-sdk
Streaming voice/text agent SDK built on AI SDK with optional WebSocket transport.
Current status
- Streaming text generation is implemented via
streamText. - Tool calling is supported in-stream.
- Speech synthesis is implemented with chunked streaming TTS.
- Audio transcription is supported (when
transcriptionModelis configured). - WebSocket protocol events are emitted for stream, tool, and speech lifecycle.
Prerequisites
- Node.js 20+
- pnpm
- OpenAI API key
Setup
-
Install dependencies:
pnpm install
-
Configure environment variables in
.env:OPENAI_API_KEY=your_openai_api_key VOICE_WS_ENDPOINT=ws://localhost:8080
VOICE_WS_ENDPOINT is optional for text-only usage.
VoiceAgent configuration
The agent accepts:
model(required): chat modeltranscriptionModel(optional): STT modelspeechModel(optional): TTS modelinstructions(optional): system promptstopWhen(optional): stopping conditiontools(optional): AI SDK tools mapendpoint(optional): WebSocket endpointvoice(optional): TTS voice, defaultalloyspeechInstructions(optional): style instructions for TTSoutputFormat(optional): audio format, defaultmp3streamingSpeech(optional):minChunkSizemaxChunkSizeparallelGenerationmaxParallelRequests
Run (text-only check)
This validates LLM + tool + streaming speech without requiring WebSocket:
pnpm demo
Expected logs include text, chunk:text_delta, tool events, and speech chunk events.
Run (WebSocket check)
-
Start local WS server:
pnpm ws:server
-
In another terminal, run demo:
pnpm demo
The demo will:
- run
sendText()first (text-only sanity check), then - connect to
VOICE_WS_ENDPOINTif provided, - emit streaming protocol messages (
text_delta,tool_call,audio_chunk,response_complete, etc.).
Browser voice client (HTML)
A simple browser client is available at example/voice-client.html.
What it does:
- captures microphone speech using Web Speech API (speech-to-text)
- sends transcript to the agent via WebSocket (
type: "transcript") - receives streaming
audio_chunkmessages and plays them in order
How to use:
- Start your agent server/WebSocket endpoint.
- Open example/voice-client.html in a browser (Chrome/Edge recommended).
- Connect to
ws://localhost:8080(or your endpoint), then click Start Mic.
Scripts
pnpm build– build TypeScriptpnpm dev– watch TypeScriptpnpm demo– run demo clientpnpm ws:server– run local test WebSocket server
Notes
- If
VOICE_WS_ENDPOINTis empty, WebSocket connect is skipped. - The sample WS server sends a mock
transcriptmessage for end-to-end testing. - Streaming TTS uses chunk queueing and supports interruption (
interrupt).