Bijit Mondal 6510232655 feat: implement WebSocket server with VoiceAgent for real-time voice interaction
- Added a new WebSocket server implementation in `ws-server-2.ts` that utilizes the `VoiceAgent` for handling voice interactions.
- Integrated weather and time tools using the `ai` library for enhanced responses.
- Refactored existing `ws-server.ts` to streamline the connection handling and event logging.
- Enhanced `VoiceAgent` to support streaming speech generation with improved chunk handling and interruption capabilities.
- Introduced new event listeners for better logging and handling of speech-related events.
- Added graceful shutdown handling for the WebSocket server.
2026-02-13 17:33:22 +05:30
2026-02-13 17:33:22 +05:30
2026-02-13 17:32:22 +05:30
2026-02-13 17:33:22 +05:30
2026-02-13 17:33:22 +05:30
2026-02-13 17:33:22 +05:30
2026-02-13 17:33:22 +05:30

voice-agent-ai-sdk

Streaming voice/text agent SDK built on AI SDK with optional WebSocket transport.

Current status

  • Streaming text generation is implemented via streamText.
  • Tool calling is supported in-stream.
  • Speech synthesis is implemented with chunked streaming TTS.
  • Audio transcription is supported (when transcriptionModel is configured).
  • WebSocket protocol events are emitted for stream, tool, and speech lifecycle.

Prerequisites

  • Node.js 20+
  • pnpm
  • OpenAI API key

Setup

  1. Install dependencies:

    pnpm install

  2. Configure environment variables in .env:

    OPENAI_API_KEY=your_openai_api_key VOICE_WS_ENDPOINT=ws://localhost:8080

VOICE_WS_ENDPOINT is optional for text-only usage.

VoiceAgent configuration

The agent accepts:

  • model (required): chat model
  • transcriptionModel (optional): STT model
  • speechModel (optional): TTS model
  • instructions (optional): system prompt
  • stopWhen (optional): stopping condition
  • tools (optional): AI SDK tools map
  • endpoint (optional): WebSocket endpoint
  • voice (optional): TTS voice, default alloy
  • speechInstructions (optional): style instructions for TTS
  • outputFormat (optional): audio format, default mp3
  • streamingSpeech (optional):
    • minChunkSize
    • maxChunkSize
    • parallelGeneration
    • maxParallelRequests

Run (text-only check)

This validates LLM + tool + streaming speech without requiring WebSocket:

pnpm demo

Expected logs include text, chunk:text_delta, tool events, and speech chunk events.

Run (WebSocket check)

  1. Start local WS server:

    pnpm ws:server

  2. In another terminal, run demo:

    pnpm demo

The demo will:

  • run sendText() first (text-only sanity check), then
  • connect to VOICE_WS_ENDPOINT if provided,
  • emit streaming protocol messages (text_delta, tool_call, audio_chunk, response_complete, etc.).

Browser voice client (HTML)

A simple browser client is available at example/voice-client.html.

What it does:

  • captures microphone speech using Web Speech API (speech-to-text)
  • sends transcript to the agent via WebSocket (type: "transcript")
  • receives streaming audio_chunk messages and plays them in order

How to use:

  1. Start your agent server/WebSocket endpoint.
  2. Open example/voice-client.html in a browser (Chrome/Edge recommended).
  3. Connect to ws://localhost:8080 (or your endpoint), then click Start Mic.

Scripts

  • pnpm build build TypeScript
  • pnpm dev watch TypeScript
  • pnpm demo run demo client
  • pnpm ws:server run local test WebSocket server

Notes

  • If VOICE_WS_ENDPOINT is empty, WebSocket connect is skipped.
  • The sample WS server sends a mock transcript message for end-to-end testing.
  • Streaming TTS uses chunk queueing and supports interruption (interrupt).
Description
AI Voice Agent
Readme MIT 558 KiB
Languages
TypeScript 69.5%
HTML 30%
JavaScript 0.3%
Shell 0.2%