Bijit Mondal ce10d521f3 feat: add dist directory with compiled files and type definitions
- Created dist/index.js and dist/index.d.ts for main entry points.
- Added source maps for index.js and index.d.ts.
- Introduced dist/utils/StreamBuffer.js and StreamBuffer.d.ts with source maps.
- Updated package.json to point main and types to dist files.
- Included additional files in package.json for distribution.
- Added peerDependencies and updated devDependencies.
2026-02-14 14:39:23 +05:30
WIP
2026-02-14 12:26:47 +05:30
2026-02-13 17:32:22 +05:30
WIP
2026-02-14 12:26:47 +05:30
2026-02-13 17:33:22 +05:30
2026-02-13 17:33:22 +05:30

voice-agent-ai-sdk

npm version

Streaming voice/text agent SDK built on AI SDK with optional WebSocket transport.

Features

  • Streaming text generation via AI SDK streamText with multi-step tool calling.
  • Chunked streaming TTS — text is split at sentence boundaries and converted to speech in parallel as the LLM streams, giving low time-to-first-audio.
  • Audio transcription via AI SDK experimental_transcribe (e.g. Whisper).
  • Barge-in / interruption — user speech cancels both the in-flight LLM stream and pending TTS, saving tokens and latency.
  • Memory management — configurable sliding-window on conversation history (maxMessages, maxTotalChars) and audio input size limits.
  • Serial request queue — concurrent sendText / audio inputs are queued and processed one at a time, preventing race conditions.
  • Graceful lifecycledisconnect() aborts all in-flight work; destroy() permanently releases every resource.
  • WebSocket transport with a full protocol of stream, tool, and speech lifecycle events.
  • Works without WebSocket — call sendText() directly for text-only or server-side use.

Prerequisites

  • Node.js 20+
  • pnpm
  • OpenAI API key

Setup

  1. Install dependencies:

    pnpm install

  2. Configure environment variables in .env:

    OPENAI_API_KEY=your_openai_api_key VOICE_WS_ENDPOINT=ws://localhost:8080

VOICE_WS_ENDPOINT is optional for text-only usage.

VoiceAgent usage (as in the demo)

Minimal end-to-end example using AI SDK tools, streaming text, and streaming TTS:

import "dotenv/config";
import { VoiceAgent } from "./src";
import { tool } from "ai";
import { z } from "zod";
import { openai } from "@ai-sdk/openai";

const weatherTool = tool({
   description: "Get the weather in a location",
   inputSchema: z.object({ location: z.string() }),
   execute: async ({ location }) => ({ location, temperature: 72, conditions: "sunny" }),
});

const agent = new VoiceAgent({
   model: openai("gpt-4o"),
   transcriptionModel: openai.transcription("whisper-1"),
   speechModel: openai.speech("gpt-4o-mini-tts"),
   instructions: "You are a helpful voice assistant.",
   voice: "alloy",
   speechInstructions: "Speak in a friendly, natural conversational tone.",
   outputFormat: "mp3",
   streamingSpeech: {
      minChunkSize: 40,
      maxChunkSize: 180,
      parallelGeneration: true,
      maxParallelRequests: 2,
   },
   // Memory management (new in 0.1.0)
   history: {
      maxMessages: 50,       // keep last 50 messages
      maxTotalChars: 100_000, // or trim when total chars exceed 100k
   },
   maxAudioInputSize: 5 * 1024 * 1024, // 5 MB limit
   endpoint: process.env.VOICE_WS_ENDPOINT,
   tools: { getWeather: weatherTool },
});

agent.on("text", ({ role, text }) => {
   const prefix = role === "user" ? "👤" : "🤖";
   console.log(prefix, text);
});

agent.on("chunk:text_delta", ({ text }) => process.stdout.write(text));
agent.on("speech_start", ({ streaming }) => console.log("speech_start", streaming));
agent.on("audio_chunk", ({ chunkId, format, uint8Array }) => {
   console.log("audio_chunk", chunkId, format, uint8Array.length);
});

await agent.sendText("What's the weather in San Francisco?");

if (process.env.VOICE_WS_ENDPOINT) {
   await agent.connect(process.env.VOICE_WS_ENDPOINT);
}

Configuration options

The agent accepts:

Option Required Default Description
model yes AI SDK chat model (e.g. openai("gpt-4o"))
transcriptionModel no AI SDK transcription model (e.g. openai.transcription("whisper-1"))
speechModel no AI SDK speech model (e.g. openai.speech("gpt-4o-mini-tts"))
instructions no "You are a helpful voice assistant." System prompt
stopWhen no stepCountIs(5) Stopping condition for multi-step tool loops
tools no {} AI SDK tools map
endpoint no Default WebSocket URL for connect()
voice no "alloy" TTS voice
speechInstructions no Style instructions passed to the speech model
outputFormat no "mp3" Audio output format (mp3, opus, wav, …)
streamingSpeech no see below Streaming TTS chunk tuning
history no see below Conversation memory limits
maxAudioInputSize no 10485760 (10 MB) Maximum accepted audio input in bytes

streamingSpeech

Key Default Description
minChunkSize 50 Min characters before a sentence is sent to TTS
maxChunkSize 200 Max characters per chunk (force-split at clause boundary)
parallelGeneration true Start TTS for upcoming chunks while the current one plays
maxParallelRequests 3 Cap on concurrent TTS requests

history

Key Default Description
maxMessages 100 Max messages kept in history (0 = unlimited). Oldest are trimmed in pairs.
maxTotalChars 0 (unlimited) Max total characters across all messages. Oldest are trimmed when exceeded.

Methods

Method Description
sendText(text) Process text input. Returns a promise with the full assistant response. Requests are queued serially.
sendAudio(base64Audio) Transcribe base64 audio and process the result.
sendAudioBuffer(buffer) Same as above, accepts a raw Buffer / Uint8Array.
transcribeAudio(buffer) Transcribe audio to text without generating a response.
generateAndSendSpeechFull(text) Non-streaming TTS fallback (entire text at once).
interruptSpeech(reason?) Cancel in-flight TTS only (LLM stream keeps running).
interruptCurrentResponse(reason?) Cancel both the LLM stream and TTS. Used for barge-in.
connect(url?) / handleSocket(ws) Establish or attach a WebSocket. Safe to call multiple times.
disconnect() Close the socket and abort all in-flight work.
destroy() Permanently release all resources. The agent cannot be reused.
clearHistory() Clear conversation history.
getHistory() / setHistory(msgs) Read or restore conversation history.
registerTools(tools) Merge additional tools into the agent.

Read-only properties

Property Type Description
connected boolean Whether a WebSocket is connected
processing boolean Whether a request is currently being processed
speaking boolean Whether audio is currently being generated / sent
pendingSpeechChunks number Number of queued TTS chunks
destroyed boolean Whether destroy() has been called

Events

Event Payload When
text { role, text } User input received or full assistant response ready
chunk:text_delta { id, text } Each streaming text token from the LLM
chunk:reasoning_delta { id, text } Each reasoning token (models that support it)
chunk:tool_call { toolName, toolCallId, input } Tool invocation detected
tool_result { name, toolCallId, result } Tool execution finished
speech_start { streaming } TTS generation begins
speech_complete { streaming } All TTS chunks sent
speech_interrupted { reason } Speech was cancelled (barge-in, disconnect, error)
speech_chunk_queued { id, text } A text chunk entered the TTS queue
audio_chunk { chunkId, data, format, text, uint8Array } One TTS chunk is ready
audio { data, format, uint8Array } Full non-streaming TTS audio
transcription { text, language } Audio transcription result
audio_received { size } Raw audio input received (before transcription)
history_trimmed { removedCount, reason } Oldest messages evicted from history
connected / disconnected WebSocket lifecycle
warning string Non-fatal issues (empty input, etc.)
error Error Errors from LLM, TTS, transcription, or WebSocket

Run (text-only check)

This validates LLM + tool + streaming speech without requiring WebSocket:

pnpm demo

Expected logs include text, chunk:text_delta, tool events, and speech chunk events.

Run (WebSocket check)

  1. Start the local WS server:

    pnpm ws:server
    
  2. In another terminal, run the demo:

    pnpm demo
    

The demo will:

  • run sendText() first (text-only sanity check), then
  • connect to VOICE_WS_ENDPOINT if provided,
  • emit streaming protocol messages (text_delta, tool_call, audio_chunk, response_complete, etc.).

Browser voice client (HTML)

A simple browser client is available at example/voice-client.html.

What it does:

  • captures microphone speech using Web Speech API (speech-to-text)
  • sends transcript to the agent via WebSocket (type: "transcript")
  • receives streaming audio_chunk messages and plays them in order

How to use:

  1. Start your agent server/WebSocket endpoint.
  2. Open example/voice-client.html in a browser (Chrome/Edge recommended).
  3. Connect to ws://localhost:8080 (or your endpoint), then click Start Mic.

Scripts

  • pnpm build build TypeScript
  • pnpm dev watch TypeScript
  • pnpm demo run demo client
  • pnpm ws:server run local test WebSocket server

Notes

  • If VOICE_WS_ENDPOINT is empty, WebSocket connect is skipped.
  • The sample WS server sends a mock transcript message for end-to-end testing.
  • Streaming TTS uses chunk queueing and supports interruption (interrupt).
Description
AI Voice Agent
Readme MIT 558 KiB
Languages
TypeScript 69.5%
HTML 30%
JavaScript 0.3%
Shell 0.2%