Files
VoiceAgent/README.md
Bijit Mondal ce10d521f3 feat: add dist directory with compiled files and type definitions
- Created dist/index.js and dist/index.d.ts for main entry points.
- Added source maps for index.js and index.d.ts.
- Introduced dist/utils/StreamBuffer.js and StreamBuffer.d.ts with source maps.
- Updated package.json to point main and types to dist files.
- Included additional files in package.json for distribution.
- Added peerDependencies and updated devDependencies.
2026-02-14 14:39:23 +05:30

232 lines
9.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# voice-agent-ai-sdk
[![npm version](https://badge.fury.io/js/voice-agent-ai-sdk.svg)](https://www.npmjs.com/package/voice-agent-ai-sdk)
Streaming voice/text agent SDK built on [AI SDK](https://sdk.vercel.ai/) with optional WebSocket transport.
## Features
- **Streaming text generation** via AI SDK `streamText` with multi-step tool calling.
- **Chunked streaming TTS** — text is split at sentence boundaries and converted to speech in parallel as the LLM streams, giving low time-to-first-audio.
- **Audio transcription** via AI SDK `experimental_transcribe` (e.g. Whisper).
- **Barge-in / interruption** — user speech cancels both the in-flight LLM stream and pending TTS, saving tokens and latency.
- **Memory management** — configurable sliding-window on conversation history (`maxMessages`, `maxTotalChars`) and audio input size limits.
- **Serial request queue** — concurrent `sendText` / audio inputs are queued and processed one at a time, preventing race conditions.
- **Graceful lifecycle** — `disconnect()` aborts all in-flight work; `destroy()` permanently releases every resource.
- **WebSocket transport** with a full protocol of stream, tool, and speech lifecycle events.
- **Works without WebSocket** — call `sendText()` directly for text-only or server-side use.
## Prerequisites
- Node.js 20+
- pnpm
- OpenAI API key
## Setup
1. Install dependencies:
pnpm install
2. Configure environment variables in `.env`:
OPENAI_API_KEY=your_openai_api_key
VOICE_WS_ENDPOINT=ws://localhost:8080
`VOICE_WS_ENDPOINT` is optional for text-only usage.
## VoiceAgent usage (as in the demo)
Minimal end-to-end example using AI SDK tools, streaming text, and streaming TTS:
```ts
import "dotenv/config";
import { VoiceAgent } from "./src";
import { tool } from "ai";
import { z } from "zod";
import { openai } from "@ai-sdk/openai";
const weatherTool = tool({
description: "Get the weather in a location",
inputSchema: z.object({ location: z.string() }),
execute: async ({ location }) => ({ location, temperature: 72, conditions: "sunny" }),
});
const agent = new VoiceAgent({
model: openai("gpt-4o"),
transcriptionModel: openai.transcription("whisper-1"),
speechModel: openai.speech("gpt-4o-mini-tts"),
instructions: "You are a helpful voice assistant.",
voice: "alloy",
speechInstructions: "Speak in a friendly, natural conversational tone.",
outputFormat: "mp3",
streamingSpeech: {
minChunkSize: 40,
maxChunkSize: 180,
parallelGeneration: true,
maxParallelRequests: 2,
},
// Memory management (new in 0.1.0)
history: {
maxMessages: 50, // keep last 50 messages
maxTotalChars: 100_000, // or trim when total chars exceed 100k
},
maxAudioInputSize: 5 * 1024 * 1024, // 5 MB limit
endpoint: process.env.VOICE_WS_ENDPOINT,
tools: { getWeather: weatherTool },
});
agent.on("text", ({ role, text }) => {
const prefix = role === "user" ? "👤" : "🤖";
console.log(prefix, text);
});
agent.on("chunk:text_delta", ({ text }) => process.stdout.write(text));
agent.on("speech_start", ({ streaming }) => console.log("speech_start", streaming));
agent.on("audio_chunk", ({ chunkId, format, uint8Array }) => {
console.log("audio_chunk", chunkId, format, uint8Array.length);
});
await agent.sendText("What's the weather in San Francisco?");
if (process.env.VOICE_WS_ENDPOINT) {
await agent.connect(process.env.VOICE_WS_ENDPOINT);
}
```
### Configuration options
The agent accepts:
| Option | Required | Default | Description |
|---|---|---|---|
| `model` | **yes** | — | AI SDK chat model (e.g. `openai("gpt-4o")`) |
| `transcriptionModel` | no | — | AI SDK transcription model (e.g. `openai.transcription("whisper-1")`) |
| `speechModel` | no | — | AI SDK speech model (e.g. `openai.speech("gpt-4o-mini-tts")`) |
| `instructions` | no | `"You are a helpful voice assistant."` | System prompt |
| `stopWhen` | no | `stepCountIs(5)` | Stopping condition for multi-step tool loops |
| `tools` | no | `{}` | AI SDK tools map |
| `endpoint` | no | — | Default WebSocket URL for `connect()` |
| `voice` | no | `"alloy"` | TTS voice |
| `speechInstructions` | no | — | Style instructions passed to the speech model |
| `outputFormat` | no | `"mp3"` | Audio output format (`mp3`, `opus`, `wav`, …) |
| `streamingSpeech` | no | see below | Streaming TTS chunk tuning |
| `history` | no | see below | Conversation memory limits |
| `maxAudioInputSize` | no | `10485760` (10 MB) | Maximum accepted audio input in bytes |
#### `streamingSpeech`
| Key | Default | Description |
|---|---|---|
| `minChunkSize` | `50` | Min characters before a sentence is sent to TTS |
| `maxChunkSize` | `200` | Max characters per chunk (force-split at clause boundary) |
| `parallelGeneration` | `true` | Start TTS for upcoming chunks while the current one plays |
| `maxParallelRequests` | `3` | Cap on concurrent TTS requests |
#### `history`
| Key | Default | Description |
|---|---|---|
| `maxMessages` | `100` | Max messages kept in history (0 = unlimited). Oldest are trimmed in pairs. |
| `maxTotalChars` | `0` (unlimited) | Max total characters across all messages. Oldest are trimmed when exceeded. |
### Methods
| Method | Description |
|---|---|
| `sendText(text)` | Process text input. Returns a promise with the full assistant response. Requests are queued serially. |
| `sendAudio(base64Audio)` | Transcribe base64 audio and process the result. |
| `sendAudioBuffer(buffer)` | Same as above, accepts a raw `Buffer` / `Uint8Array`. |
| `transcribeAudio(buffer)` | Transcribe audio to text without generating a response. |
| `generateAndSendSpeechFull(text)` | Non-streaming TTS fallback (entire text at once). |
| `interruptSpeech(reason?)` | Cancel in-flight TTS only (LLM stream keeps running). |
| `interruptCurrentResponse(reason?)` | Cancel **both** the LLM stream and TTS. Used for barge-in. |
| `connect(url?)` / `handleSocket(ws)` | Establish or attach a WebSocket. Safe to call multiple times. |
| `disconnect()` | Close the socket and abort all in-flight work. |
| `destroy()` | Permanently release all resources. The agent cannot be reused. |
| `clearHistory()` | Clear conversation history. |
| `getHistory()` / `setHistory(msgs)` | Read or restore conversation history. |
| `registerTools(tools)` | Merge additional tools into the agent. |
### Read-only properties
| Property | Type | Description |
|---|---|---|
| `connected` | `boolean` | Whether a WebSocket is connected |
| `processing` | `boolean` | Whether a request is currently being processed |
| `speaking` | `boolean` | Whether audio is currently being generated / sent |
| `pendingSpeechChunks` | `number` | Number of queued TTS chunks |
| `destroyed` | `boolean` | Whether `destroy()` has been called |
### Events
| Event | Payload | When |
|---|---|---|
| `text` | `{ role, text }` | User input received or full assistant response ready |
| `chunk:text_delta` | `{ id, text }` | Each streaming text token from the LLM |
| `chunk:reasoning_delta` | `{ id, text }` | Each reasoning token (models that support it) |
| `chunk:tool_call` | `{ toolName, toolCallId, input }` | Tool invocation detected |
| `tool_result` | `{ name, toolCallId, result }` | Tool execution finished |
| `speech_start` | `{ streaming }` | TTS generation begins |
| `speech_complete` | `{ streaming }` | All TTS chunks sent |
| `speech_interrupted` | `{ reason }` | Speech was cancelled (barge-in, disconnect, error) |
| `speech_chunk_queued` | `{ id, text }` | A text chunk entered the TTS queue |
| `audio_chunk` | `{ chunkId, data, format, text, uint8Array }` | One TTS chunk is ready |
| `audio` | `{ data, format, uint8Array }` | Full non-streaming TTS audio |
| `transcription` | `{ text, language }` | Audio transcription result |
| `audio_received` | `{ size }` | Raw audio input received (before transcription) |
| `history_trimmed` | `{ removedCount, reason }` | Oldest messages evicted from history |
| `connected` / `disconnected` | — | WebSocket lifecycle |
| `warning` | `string` | Non-fatal issues (empty input, etc.) |
| `error` | `Error` | Errors from LLM, TTS, transcription, or WebSocket |
## Run (text-only check)
This validates LLM + tool + streaming speech without requiring WebSocket:
pnpm demo
Expected logs include `text`, `chunk:text_delta`, tool events, and speech chunk events.
## Run (WebSocket check)
1. Start the local WS server:
pnpm ws:server
2. In another terminal, run the demo:
pnpm demo
The demo will:
- run `sendText()` first (text-only sanity check), then
- connect to `VOICE_WS_ENDPOINT` if provided,
- emit streaming protocol messages (`text_delta`, `tool_call`, `audio_chunk`, `response_complete`, etc.).
## Browser voice client (HTML)
A simple browser client is available at [example/voice-client.html](example/voice-client.html).
What it does:
- captures microphone speech using Web Speech API (speech-to-text)
- sends transcript to the agent via WebSocket (`type: "transcript"`)
- receives streaming `audio_chunk` messages and plays them in order
How to use:
1. Start your agent server/WebSocket endpoint.
2. Open [example/voice-client.html](example/voice-client.html) in a browser (Chrome/Edge recommended).
3. Connect to `ws://localhost:8080` (or your endpoint), then click **Start Mic**.
## Scripts
- `pnpm build` build TypeScript
- `pnpm dev` watch TypeScript
- `pnpm demo` run demo client
- `pnpm ws:server` run local test WebSocket server
## Notes
- If `VOICE_WS_ENDPOINT` is empty, WebSocket connect is skipped.
- The sample WS server sends a mock `transcript` message for end-to-end testing.
- Streaming TTS uses chunk queueing and supports interruption (`interrupt`).