# voice-agent-ai-sdk Streaming voice/text agent SDK built on AI SDK with optional WebSocket transport. ## Current status - Streaming text generation is implemented via `streamText`. - Tool calling is supported in-stream. - Speech synthesis is implemented with chunked streaming TTS. - Audio transcription is supported (when `transcriptionModel` is configured). - WebSocket protocol events are emitted for stream, tool, and speech lifecycle. ## Prerequisites - Node.js 20+ - pnpm - OpenAI API key ## Setup 1. Install dependencies: pnpm install 2. Configure environment variables in `.env`: OPENAI_API_KEY=your_openai_api_key VOICE_WS_ENDPOINT=ws://localhost:8080 `VOICE_WS_ENDPOINT` is optional for text-only usage. ## VoiceAgent usage (as in the demo) Minimal end-to-end example using AI SDK tools, streaming text, and streaming TTS: ```ts import "dotenv/config"; import { VoiceAgent } from "./src"; import { tool } from "ai"; import { z } from "zod"; import { openai } from "@ai-sdk/openai"; const weatherTool = tool({ description: "Get the weather in a location", inputSchema: z.object({ location: z.string() }), execute: async ({ location }) => ({ location, temperature: 72, conditions: "sunny" }), }); const agent = new VoiceAgent({ model: openai("gpt-4o"), transcriptionModel: openai.transcription("whisper-1"), speechModel: openai.speech("gpt-4o-mini-tts"), instructions: "You are a helpful voice assistant.", voice: "alloy", speechInstructions: "Speak in a friendly, natural conversational tone.", outputFormat: "mp3", streamingSpeech: { minChunkSize: 40, maxChunkSize: 180, parallelGeneration: true, maxParallelRequests: 2, }, endpoint: process.env.VOICE_WS_ENDPOINT, tools: { getWeather: weatherTool }, }); agent.on("text", ({ role, text }) => { const prefix = role === "user" ? "πŸ‘€" : "πŸ€–"; console.log(prefix, text); }); agent.on("chunk:text_delta", ({ text }) => process.stdout.write(text)); agent.on("speech_start", ({ streaming }) => console.log("speech_start", streaming)); agent.on("audio_chunk", ({ chunkId, format, uint8Array }) => { console.log("audio_chunk", chunkId, format, uint8Array.length); }); await agent.sendText("What's the weather in San Francisco?"); if (process.env.VOICE_WS_ENDPOINT) { await agent.connect(process.env.VOICE_WS_ENDPOINT); } ``` ### Configuration options The agent accepts: - `model` (required): chat model - `transcriptionModel` (optional): STT model - `speechModel` (optional): TTS model - `instructions` (optional): system prompt - `stopWhen` (optional): stopping condition - `tools` (optional): AI SDK tools map - `endpoint` (optional): WebSocket endpoint - `voice` (optional): TTS voice, default `alloy` - `speechInstructions` (optional): style instructions for TTS - `outputFormat` (optional): audio format, default `mp3` - `streamingSpeech` (optional): - `minChunkSize` - `maxChunkSize` - `parallelGeneration` - `maxParallelRequests` ### Common methods - `sendText(text)` – process text input (streamed response) - `sendAudio(base64Audio)` – process base64 audio input - `sendAudioBuffer(buffer)` – process raw audio buffer input - `transcribeAudio(buffer)` – transcribe audio directly - `generateAndSendSpeechFull(text)` – non-streaming TTS fallback - `interruptSpeech(reason)` – interrupt streaming speech (barge‑in) - `connect(url?)` / `handleSocket(ws)` – WebSocket usage ### Key events (from demo) - `text` – user/assistant messages - `chunk:text_delta` – streaming text deltas - `chunk:tool_call` / `tool_result` – tool lifecycle - `speech_start` / `speech_complete` / `speech_interrupted` - `speech_chunk_queued` / `audio_chunk` / `audio` - `connected` / `disconnected` ## Run (text-only check) This validates LLM + tool + streaming speech without requiring WebSocket: pnpm demo Expected logs include `text`, `chunk:text_delta`, tool events, and speech chunk events. ## Run (WebSocket check) 1. Start local WS server: pnpm ws:server 2. In another terminal, run demo: pnpm demo The demo will: - run `sendText()` first (text-only sanity check), then - connect to `VOICE_WS_ENDPOINT` if provided, - emit streaming protocol messages (`text_delta`, `tool_call`, `audio_chunk`, `response_complete`, etc.). ## Browser voice client (HTML) A simple browser client is available at [example/voice-client.html](example/voice-client.html). What it does: - captures microphone speech using Web Speech API (speech-to-text) - sends transcript to the agent via WebSocket (`type: "transcript"`) - receives streaming `audio_chunk` messages and plays them in order How to use: 1. Start your agent server/WebSocket endpoint. 2. Open [example/voice-client.html](example/voice-client.html) in a browser (Chrome/Edge recommended). 3. Connect to `ws://localhost:8080` (or your endpoint), then click **Start Mic**. ## Scripts - `pnpm build` – build TypeScript - `pnpm dev` – watch TypeScript - `pnpm demo` – run demo client - `pnpm ws:server` – run local test WebSocket server ## Notes - If `VOICE_WS_ENDPOINT` is empty, WebSocket connect is skipped. - The sample WS server sends a mock `transcript` message for end-to-end testing. - Streaming TTS uses chunk queueing and supports interruption (`interrupt`).