0.2.1-beta.0

0.2.0
Refactor VoiceAgent: Extract types and default configurations into separate types.ts file; remove unused StreamBuffer file
2026-03-02 18:36:39 +00:00 · 2026-02-19 16:06:17 +05:30 · 2026-02-19 16:03:56 +05:30 · 2026-02-19 16:01:25 +05:30 · 2026-02-14 14:39:23 +05:30 · 2026-02-14 14:20:07 +05:30
36 changed files with 5820 additions and 158 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -4,4 +4,6 @@ node_modules

 .marscode

-dist
+# dist
+
+HOW_*.md
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -0,0 +1,81 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/),
+and this project adheres to [Semantic Versioning](https://semver.org/).
+
+## [0.1.0] - 2025-07-15
+
+### Added
+
+- **Conversation history limits** — new `history` option with `maxMessages` (default 100)
+  and `maxTotalChars` (default unlimited) to prevent unbounded memory growth.
+  Oldest messages are trimmed in pairs to preserve user/assistant turn structure.
+  Emits `history_trimmed` event when messages are evicted.
+- **Audio input size validation** — new `maxAudioInputSize` option (default 10 MB).
+  Oversized or empty audio payloads are rejected early with an `error` / `warning` event
+  instead of being forwarded to the transcription model.
+- **Serial input queue** — `sendText()`, WebSocket `transcript` messages, and
+  transcribed audio are now queued and processed one at a time. This prevents
+  race conditions where concurrent calls could corrupt `conversationHistory` or
+  interleave streaming output.
+- **LLM stream cancellation** — an `AbortController` is now threaded into
+  `streamText()` via `abortSignal`. Barge-in, disconnect, and explicit
+  interrupts abort the LLM stream immediately (saving tokens) instead of only
+  cancelling TTS.
+- **`interruptCurrentResponse(reason)`** — new public method that aborts both
+  the LLM stream *and* ongoing speech in a single call. WebSocket barge-in
+  (`transcript` / `audio` / `interrupt` messages) now uses this instead of
+  `interruptSpeech()` alone.
+- **`destroy()`** — permanently tears down the agent, releasing the socket,
+  clearing history and tools, and removing all event listeners.
+  A `destroyed` getter is also exposed. Any subsequent method call throws.
+- **`history_trimmed` event** — emitted with `{ removedCount, reason }` when
+  the sliding-window trims old messages.
+- **Input validation** — `sendText("")` now throws, and incoming WebSocket
+  `transcript` / `audio` messages are validated before processing.
+
+### Changed
+
+- **`disconnect()` is now a full cleanup** — aborts in-flight LLM and TTS
+  streams, clears the speech queue, rejects pending queued inputs, and removes
+  socket listeners before closing. Previously it only called `socket.close()`.
+- **`connect()` and `handleSocket()` are idempotent** — calling either when a
+  socket is already attached will cleanly tear down the old connection first
+  instead of leaking it.
+- **`sendWebSocketMessage()` is resilient** — checks `socket.readyState` and
+  wraps `send()` in a try/catch so a socket that closes mid-send does not throw
+  an unhandled exception.
+- **Speech queue completion uses a promise** — `processUserInput` now awaits a
+  `speechQueueDonePromise` instead of busy-wait polling
+  (`while (queue.length) { await sleep(100) }`), reducing CPU waste and
+  eliminating a race window.
+- **`interruptSpeech()` resolves the speech-done promise** — so
+  `processUserInput` can proceed immediately after a barge-in instead of
+  potentially hanging.
+- **WebSocket message handler uses `if/else if`** — prevents a single message
+  from accidentally matching multiple type branches.
+- **Chunk ID wraps at `Number.MAX_SAFE_INTEGER`** — avoids unbounded counter
+  growth in very long-running sessions.
+- **`processUserInput` catch block cleans up speech state** — on stream error
+  the pending text buffer is cleared and any in-progress speech is interrupted,
+  so the agent does not get stuck in a broken state.
+- **WebSocket close handler calls `cleanupOnDisconnect()`** — aborts LLM + TTS,
+  clears queues, and rejects pending input promises.
+
+### Fixed
+
+- Typo in JSDoc: `"Process text deltra"` → `"Process text delta"`.
+
+## [0.0.1] - 2025-07-14
+
+### Added
+
+- Initial release.
+- Streaming text generation via AI SDK `streamText`.
+- Multi-step tool calling with `stopWhen`.
+- Chunked streaming TTS with parallel generation and barge-in support.
+- Audio transcription via AI SDK `experimental_transcribe`.
+- WebSocket transport with full stream/tool/speech lifecycle events.
+- Browser voice client example (`example/voice-client.html`).
--- a/README.md
+++ b/README.md
@@ -1,14 +1,20 @@
 # voice-agent-ai-sdk

-Streaming voice/text agent SDK built on AI SDK with optional WebSocket transport.
+[![npm version](https://badge.fury.io/js/voice-agent-ai-sdk.svg)](https://www.npmjs.com/package/voice-agent-ai-sdk)

-## Current status
+Streaming voice/text agent SDK built on [AI SDK](https://sdk.vercel.ai/) with optional WebSocket transport.

- Streaming text generation is implemented via `streamText`.
- Tool calling is supported in-stream.
- Speech synthesis is implemented with chunked streaming TTS.
- Audio transcription is supported (when `transcriptionModel` is configured).
- WebSocket protocol events are emitted for stream, tool, and speech lifecycle.
+## Features
+
+- **Streaming text generation** via AI SDK `streamText` with multi-step tool calling.
+- **Chunked streaming TTS** — text is split at sentence boundaries and converted to speech in parallel as the LLM streams, giving low time-to-first-audio.
+- **Audio transcription** via AI SDK `experimental_transcribe` (e.g. Whisper).
+- **Barge-in / interruption** — user speech cancels both the in-flight LLM stream and pending TTS, saving tokens and latency.
+- **Memory management** — configurable sliding-window on conversation history (`maxMessages`, `maxTotalChars`) and audio input size limits.
+- **Serial request queue** — concurrent `sendText` / audio inputs are queued and processed one at a time, preventing race conditions.
+- **Graceful lifecycle** — `disconnect()` aborts all in-flight work; `destroy()` permanently releases every resource.
+- **WebSocket transport** with a full protocol of stream, tool, and speech lifecycle events.
+- **Works without WebSocket** — call `sendText()` directly for text-only or server-side use.

 ## Prerequisites

@@ -29,25 +35,150 @@ Streaming voice/text agent SDK built on AI SDK with optional WebSocket transport

 `VOICE_WS_ENDPOINT` is optional for text-only usage.

-## VoiceAgent configuration
+## VoiceAgent usage (as in the demo)
+
+Minimal end-to-end example using AI SDK tools, streaming text, and streaming TTS:
+
+```ts
+import "dotenv/config";
+import { VoiceAgent } from "./src";
+import { tool } from "ai";
+import { z } from "zod";
+import { openai } from "@ai-sdk/openai";
+
+const weatherTool = tool({
+   description: "Get the weather in a location",
+   inputSchema: z.object({ location: z.string() }),
+   execute: async ({ location }) => ({ location, temperature: 72, conditions: "sunny" }),
+});
+
+const agent = new VoiceAgent({
+   model: openai("gpt-4o"),
+   transcriptionModel: openai.transcription("whisper-1"),
+   speechModel: openai.speech("gpt-4o-mini-tts"),
+   instructions: "You are a helpful voice assistant.",
+   voice: "alloy",
+   speechInstructions: "Speak in a friendly, natural conversational tone.",
+   outputFormat: "mp3",
+   streamingSpeech: {
+      minChunkSize: 40,
+      maxChunkSize: 180,
+      parallelGeneration: true,
+      maxParallelRequests: 2,
+   },
+   // Memory management (new in 0.1.0)
+   history: {
+      maxMessages: 50,       // keep last 50 messages
+      maxTotalChars: 100_000, // or trim when total chars exceed 100k
+   },
+   maxAudioInputSize: 5 * 1024 * 1024, // 5 MB limit
+   endpoint: process.env.VOICE_WS_ENDPOINT,
+   tools: { getWeather: weatherTool },
+});
+
+agent.on("text", ({ role, text }) => {
+   const prefix = role === "user" ? "👤" : "🤖";
+   console.log(prefix, text);
+});
+
+agent.on("chunk:text_delta", ({ text }) => process.stdout.write(text));
+agent.on("speech_start", ({ streaming }) => console.log("speech_start", streaming));
+agent.on("audio_chunk", ({ chunkId, format, uint8Array }) => {
+   console.log("audio_chunk", chunkId, format, uint8Array.length);
+});
+
+await agent.sendText("What's the weather in San Francisco?");
+
+if (process.env.VOICE_WS_ENDPOINT) {
+   await agent.connect(process.env.VOICE_WS_ENDPOINT);
+}
+```
+
+### Configuration options

 The agent accepts:

- `model` (required): chat model
- `transcriptionModel` (optional): STT model
- `speechModel` (optional): TTS model
- `instructions` (optional): system prompt
- `stopWhen` (optional): stopping condition
- `tools` (optional): AI SDK tools map
- `endpoint` (optional): WebSocket endpoint
- `voice` (optional): TTS voice, default `alloy`
- `speechInstructions` (optional): style instructions for TTS
- `outputFormat` (optional): audio format, default `mp3`
- `streamingSpeech` (optional):
-   - `minChunkSize`
-   - `maxChunkSize`
-   - `parallelGeneration`
-   - `maxParallelRequests`
+| Option | Required | Default | Description |
+|---|---|---|---|
+| `model` | **yes** | — | AI SDK chat model (e.g. `openai("gpt-4o")`) |
+| `transcriptionModel` | no | — | AI SDK transcription model (e.g. `openai.transcription("whisper-1")`) |
+| `speechModel` | no | — | AI SDK speech model (e.g. `openai.speech("gpt-4o-mini-tts")`) |
+| `instructions` | no | `"You are a helpful voice assistant."` | System prompt |
+| `stopWhen` | no | `stepCountIs(5)` | Stopping condition for multi-step tool loops |
+| `tools` | no | `{}` | AI SDK tools map |
+| `endpoint` | no | — | Default WebSocket URL for `connect()` |
+| `voice` | no | `"alloy"` | TTS voice |
+| `speechInstructions` | no | — | Style instructions passed to the speech model |
+| `outputFormat` | no | `"mp3"` | Audio output format (`mp3`, `opus`, `wav`, …) |
+| `streamingSpeech` | no | see below | Streaming TTS chunk tuning |
+| `history` | no | see below | Conversation memory limits |
+| `maxAudioInputSize` | no | `10485760` (10 MB) | Maximum accepted audio input in bytes |
+
+#### `streamingSpeech`
+
+| Key | Default | Description |
+|---|---|---|
+| `minChunkSize` | `50` | Min characters before a sentence is sent to TTS |
+| `maxChunkSize` | `200` | Max characters per chunk (force-split at clause boundary) |
+| `parallelGeneration` | `true` | Start TTS for upcoming chunks while the current one plays |
+| `maxParallelRequests` | `3` | Cap on concurrent TTS requests |
+
+#### `history`
+
+| Key | Default | Description |
+|---|---|---|
+| `maxMessages` | `100` | Max messages kept in history (0 = unlimited). Oldest are trimmed in pairs. |
+| `maxTotalChars` | `0` (unlimited) | Max total characters across all messages. Oldest are trimmed when exceeded. |
+
+### Methods
+
+| Method | Description |
+|---|---|
+| `sendText(text)` | Process text input. Returns a promise with the full assistant response. Requests are queued serially. |
+| `sendAudio(base64Audio)` | Transcribe base64 audio and process the result. |
+| `sendAudioBuffer(buffer)` | Same as above, accepts a raw `Buffer` / `Uint8Array`. |
+| `transcribeAudio(buffer)` | Transcribe audio to text without generating a response. |
+| `generateAndSendSpeechFull(text)` | Non-streaming TTS fallback (entire text at once). |
+| `interruptSpeech(reason?)` | Cancel in-flight TTS only (LLM stream keeps running). |
+| `interruptCurrentResponse(reason?)` | Cancel **both** the LLM stream and TTS. Used for barge-in. |
+| `connect(url?)` / `handleSocket(ws)` | Establish or attach a WebSocket. Safe to call multiple times. |
+| `disconnect()` | Close the socket and abort all in-flight work. |
+| `destroy()` | Permanently release all resources. The agent cannot be reused. |
+| `clearHistory()` | Clear conversation history. |
+| `getHistory()` / `setHistory(msgs)` | Read or restore conversation history. |
+| `registerTools(tools)` | Merge additional tools into the agent. |
+
+### Read-only properties
+
+| Property | Type | Description |
+|---|---|---|
+| `connected` | `boolean` | Whether a WebSocket is connected |
+| `processing` | `boolean` | Whether a request is currently being processed |
+| `speaking` | `boolean` | Whether audio is currently being generated / sent |
+| `pendingSpeechChunks` | `number` | Number of queued TTS chunks |
+| `destroyed` | `boolean` | Whether `destroy()` has been called |
+
+### Events
+
+| Event | Payload | When |
+|---|---|---|
+| `text` | `{ role, text }` | User input received or full assistant response ready |
+| `chunk:text_delta` | `{ id, text }` | Each streaming text token from the LLM |
+| `chunk:reasoning_delta` | `{ id, text }` | Each reasoning token (models that support it) |
+| `chunk:tool_call` | `{ toolName, toolCallId, input }` | Tool invocation detected |
+| `tool_result` | `{ name, toolCallId, result }` | Tool execution finished |
+| `speech_start` | `{ streaming }` | TTS generation begins |
+| `speech_complete` | `{ streaming }` | All TTS chunks sent |
+| `speech_interrupted` | `{ reason }` | Speech was cancelled (barge-in, disconnect, error) |
+| `speech_chunk_queued` | `{ id, text }` | A text chunk entered the TTS queue |
+| `audio_chunk` | `{ chunkId, data, format, text, uint8Array }` | One TTS chunk is ready |
+| `audio` | `{ data, format, uint8Array }` | Full non-streaming TTS audio |
+| `transcription` | `{ text, language }` | Audio transcription result |
+| `audio_received` | `{ size }` | Raw audio input received (before transcription) |
+| `history_trimmed` | `{ removedCount, reason }` | Oldest messages evicted from history |
+| `connected` / `disconnected` | — | WebSocket lifecycle |
+| `warning` | `string` | Non-fatal issues (empty input, etc.) |
+| `error` | `Error` | Errors from LLM, TTS, transcription, or WebSocket |

 ## Run (text-only check)

@@ -59,13 +190,13 @@ Expected logs include `text`, `chunk:text_delta`, tool events, and speech chunk

 ## Run (WebSocket check)

-1. Start local WS server:
+1. Start the local WS server:

-   pnpm ws:server
+       pnpm ws:server

-2. In another terminal, run demo:
+2. In another terminal, run the demo:

-   pnpm demo
+       pnpm demo

 The demo will:
 - run `sendText()` first (text-only sanity check), then
--- a/dist/VideoAgent.d.ts
+++ b/dist/VideoAgent.d.ts
@@ -0,0 +1,339 @@
+import { WebSocket } from "ws";
+import { EventEmitter } from "events";
+import { streamText, LanguageModel, type Tool, type ModelMessage, type TranscriptionModel, type SpeechModel } from "ai";
+import { type StreamingSpeechConfig, type HistoryConfig } from "./types";
+/**
+ * Trigger reasons for frame capture
+ */
+type FrameTriggerReason = "scene_change" | "user_request" | "timer" | "initial";
+/**
+ * Video frame data structure sent to/from the client
+ */
+interface VideoFrame {
+    type: "video_frame";
+    sessionId: string;
+    sequence: number;
+    timestamp: number;
+    triggerReason: FrameTriggerReason;
+    previousFrameRef?: string;
+    image: {
+        data: string;
+        format: string;
+        width: number;
+        height: number;
+    };
+}
+/**
+ * Audio data structure
+ */
+interface AudioData {
+    type: "audio";
+    sessionId: string;
+    data: string;
+    format: string;
+    sampleRate?: number;
+    duration?: number;
+    timestamp: number;
+}
+/**
+ * Backend configuration for video processing
+ */
+interface VideoAgentConfig {
+    /** Maximum frames to keep in context buffer for conversation history */
+    maxContextFrames: number;
+}
+/**
+ * Frame context for maintaining visual conversation history
+ */
+interface FrameContext {
+    sequence: number;
+    timestamp: number;
+    triggerReason: FrameTriggerReason;
+    frameHash: string;
+    description?: string;
+}
+export interface VideoAgentOptions {
+    model: LanguageModel;
+    transcriptionModel?: TranscriptionModel;
+    speechModel?: SpeechModel;
+    instructions?: string;
+    stopWhen?: NonNullable<Parameters<typeof streamText>[0]["stopWhen"]>;
+    tools?: Record<string, Tool>;
+    endpoint?: string;
+    voice?: string;
+    speechInstructions?: string;
+    outputFormat?: string;
+    /** Configuration for streaming speech generation */
+    streamingSpeech?: Partial<StreamingSpeechConfig>;
+    /** Configuration for conversation history memory limits */
+    history?: Partial<HistoryConfig>;
+    /** Maximum audio input size in bytes (default: 10 MB) */
+    maxAudioInputSize?: number;
+    /** Maximum frame input size in bytes (default: 5 MB) */
+    maxFrameInputSize?: number;
+    /** Maximum frames to keep in context buffer (default: 10) */
+    maxContextFrames?: number;
+    /** Session ID for this video agent instance */
+    sessionId?: string;
+}
+export declare class VideoAgent extends EventEmitter {
+    private socket?;
+    private tools;
+    private model;
+    private transcriptionModel?;
+    private speechModel?;
+    private instructions;
+    private stopWhen;
+    private endpoint?;
+    private isConnected;
+    private conversationHistory;
+    private voice;
+    private speechInstructions?;
+    private outputFormat;
+    private isProcessing;
+    private isDestroyed;
+    private sessionId;
+    private frameSequence;
+    private lastFrameTimestamp;
+    private lastFrameHash?;
+    private frameContextBuffer;
+    private currentFrameData?;
+    private videoConfig;
+    private inputQueue;
+    private processingQueue;
+    private currentStreamAbortController?;
+    private historyConfig;
+    private maxAudioInputSize;
+    private maxFrameInputSize;
+    private streamingSpeechConfig;
+    private currentSpeechAbortController?;
+    private speechChunkQueue;
+    private nextChunkId;
+    private isSpeaking;
+    private pendingTextBuffer;
+    private speechQueueDonePromise?;
+    private speechQueueDoneResolve?;
+    constructor(options: VideoAgentOptions);
+    /**
+     * Generate a unique session ID
+     */
+    private generateSessionId;
+    /**
+     * Simple hash function for frame comparison
+     */
+    private hashFrame;
+    /**
+     * Ensure the agent has not been destroyed. Throws if it has.
+     */
+    private ensureNotDestroyed;
+    /**
+     * Get current video agent configuration
+     */
+    getConfig(): VideoAgentConfig;
+    /**
+     * Update video agent configuration
+     */
+    updateConfig(config: Partial<VideoAgentConfig>): void;
+    private setupListeners;
+    /**
+     * Handle client ready signal
+     */
+    private handleClientReady;
+    /**
+     * Handle incoming video frame
+     */
+    private handleVideoFrame;
+    /**
+     * Add frame to context buffer
+     */
+    private addFrameToContext;
+    /**
+     * Request client to capture and send a frame
+     */
+    requestFrameCapture(reason: FrameTriggerReason): void;
+    /**
+     * Clean up all in-flight state when the connection drops.
+     */
+    private cleanupOnDisconnect;
+    registerTools(tools: Record<string, Tool>): void;
+    /**
+     * Transcribe audio data to text using the configured transcription model
+     */
+    transcribeAudio(audioData: Buffer | Uint8Array): Promise<string>;
+    /**
+     * Generate speech from text using the configured speech model
+     */
+    generateSpeechFromText(text: string, abortSignal?: AbortSignal): Promise<Uint8Array>;
+    /**
+     * Interrupt ongoing speech generation and playback
+     */
+    interruptSpeech(reason?: string): void;
+    /**
+     * Interrupt both the current LLM stream and ongoing speech
+     */
+    interruptCurrentResponse(reason?: string): void;
+    /**
+     * Extract complete sentences from text buffer
+     */
+    private extractSentences;
+    /**
+     * Trim conversation history to stay within configured limits
+     */
+    private trimHistory;
+    /**
+     * Queue a text chunk for speech generation
+     */
+    private queueSpeechChunk;
+    /**
+     * Generate audio for a single chunk
+     */
+    private generateChunkAudio;
+    /**
+     * Process the speech queue and send audio chunks in order
+     */
+    private processSpeechQueue;
+    /**
+     * Process text delta for streaming speech
+     */
+    private processTextForStreamingSpeech;
+    /**
+     * Flush any remaining text in the buffer to speech
+     */
+    private flushStreamingSpeech;
+    /**
+     * Process incoming audio data: transcribe and generate response
+     */
+    private processAudioInput;
+    connect(url?: string): Promise<void>;
+    /**
+     * Attach an existing WebSocket (server-side usage)
+     */
+    handleSocket(socket: WebSocket): void;
+    /**
+     * Send text input for processing (bypasses transcription)
+     */
+    sendText(text: string): Promise<string>;
+    /**
+     * Send audio data to be transcribed and processed
+     */
+    sendAudio(audioData: string): Promise<void>;
+    /**
+     * Send raw audio buffer to be transcribed and processed
+     */
+    sendAudioBuffer(audioBuffer: Buffer | Uint8Array): Promise<void>;
+    /**
+     * Send a video frame with optional text query for vision analysis
+     */
+    sendFrame(frameData: string, query?: string, options?: {
+        width?: number;
+        height?: number;
+        format?: string;
+    }): Promise<string>;
+    /**
+     * Enqueue a text input for serial processing
+     */
+    private enqueueTextInput;
+    /**
+     * Enqueue a multimodal input (text + frame) for serial processing
+     */
+    private enqueueMultimodalInput;
+    /**
+     * Drain the input queue, processing one request at a time
+     */
+    private drainInputQueue;
+    /**
+     * Build the message content array for multimodal input
+     */
+    private buildMultimodalContent;
+    /**
+     * Process multimodal input (text + video frame)
+     */
+    private processMultimodalInput;
+    /**
+     * Process user input with streaming text generation
+     */
+    private processUserInput;
+    /**
+     * Handle individual stream chunks
+     */
+    private handleStreamChunk;
+    /**
+     * Process the full stream result and return the response text
+     */
+    private processStreamResult;
+    /**
+     * Send a message via WebSocket if connected
+     */
+    private sendWebSocketMessage;
+    /**
+     * Start listening for voice/video input
+     */
+    startListening(): void;
+    /**
+     * Stop listening for voice/video input
+     */
+    stopListening(): void;
+    /**
+     * Clear conversation history
+     */
+    clearHistory(): void;
+    /**
+     * Get current conversation history
+     */
+    getHistory(): ModelMessage[];
+    /**
+     * Set conversation history
+     */
+    setHistory(history: ModelMessage[]): void;
+    /**
+     * Get frame context buffer
+     */
+    getFrameContext(): FrameContext[];
+    /**
+     * Get session ID
+     */
+    getSessionId(): string;
+    /**
+     * Internal helper to close and clean up the current socket
+     */
+    private disconnectSocket;
+    /**
+     * Disconnect from WebSocket and stop all in-flight work
+     */
+    disconnect(): void;
+    /**
+     * Permanently destroy the agent, releasing all resources
+     */
+    destroy(): void;
+    /**
+     * Check if agent is connected to WebSocket
+     */
+    get connected(): boolean;
+    /**
+     * Check if agent is currently processing a request
+     */
+    get processing(): boolean;
+    /**
+     * Check if agent is currently speaking
+     */
+    get speaking(): boolean;
+    /**
+     * Get the number of pending speech chunks in the queue
+     */
+    get pendingSpeechChunks(): number;
+    /**
+     * Check if agent has been permanently destroyed
+     */
+    get destroyed(): boolean;
+    /**
+     * Get current frame sequence number
+     */
+    get currentFrameSequence(): number;
+    /**
+     * Check if there is visual context available
+     */
+    get hasVisualContext(): boolean;
+}
+export type { VideoFrame, AudioData, VideoAgentConfig, FrameContext, FrameTriggerReason, };
+export type { StreamingSpeechConfig, HistoryConfig } from "./types";
+//# sourceMappingURL=VideoAgent.d.ts.map
--- a/dist/VideoAgent.d.ts.map
+++ b/dist/VideoAgent.d.ts.map
--- a/dist/VideoAgent.js
+++ b/dist/VideoAgent.js
--- a/dist/VideoAgent.js.map
+++ b/dist/VideoAgent.js.map
--- a/dist/VoiceAgent.d.ts
+++ b/dist/VoiceAgent.d.ts
@@ -0,0 +1,220 @@
+import { WebSocket } from "ws";
+import { EventEmitter } from "events";
+import { streamText, LanguageModel, type Tool, type ModelMessage, type TranscriptionModel, type SpeechModel } from "ai";
+import { type StreamingSpeechConfig, type HistoryConfig } from "./types";
+export interface VoiceAgentOptions {
+    model: LanguageModel;
+    transcriptionModel?: TranscriptionModel;
+    speechModel?: SpeechModel;
+    instructions?: string;
+    stopWhen?: NonNullable<Parameters<typeof streamText>[0]["stopWhen"]>;
+    tools?: Record<string, Tool>;
+    endpoint?: string;
+    voice?: string;
+    speechInstructions?: string;
+    outputFormat?: string;
+    /** Configuration for streaming speech generation */
+    streamingSpeech?: Partial<StreamingSpeechConfig>;
+    /** Configuration for conversation history memory limits */
+    history?: Partial<HistoryConfig>;
+    /** Maximum audio input size in bytes (default: 10 MB) */
+    maxAudioInputSize?: number;
+}
+export declare class VoiceAgent extends EventEmitter {
+    private socket?;
+    private tools;
+    private model;
+    private transcriptionModel?;
+    private speechModel?;
+    private instructions;
+    private stopWhen;
+    private endpoint?;
+    private isConnected;
+    private conversationHistory;
+    private voice;
+    private speechInstructions?;
+    private outputFormat;
+    private isProcessing;
+    private isDestroyed;
+    private inputQueue;
+    private processingQueue;
+    private currentStreamAbortController?;
+    private historyConfig;
+    private maxAudioInputSize;
+    private streamingSpeechConfig;
+    private currentSpeechAbortController?;
+    private speechChunkQueue;
+    private nextChunkId;
+    private isSpeaking;
+    private pendingTextBuffer;
+    private speechQueueDonePromise?;
+    private speechQueueDoneResolve?;
+    constructor(options: VoiceAgentOptions);
+    /**
+     * Ensure the agent has not been destroyed. Throws if it has.
+     */
+    private ensureNotDestroyed;
+    private setupListeners;
+    /**
+     * Clean up all in-flight state when the connection drops.
+     */
+    private cleanupOnDisconnect;
+    registerTools(tools: Record<string, Tool>): void;
+    /**
+     * Transcribe audio data to text using the configured transcription model
+     */
+    transcribeAudio(audioData: Buffer | Uint8Array): Promise<string>;
+    /**
+     * Generate speech from text using the configured speech model
+     * @param abortSignal Optional signal to cancel the speech generation
+     */
+    generateSpeechFromText(text: string, abortSignal?: AbortSignal): Promise<Uint8Array>;
+    /**
+     * Interrupt ongoing speech generation and playback (barge-in support).
+     * This only interrupts TTS — the LLM stream is left running.
+     */
+    interruptSpeech(reason?: string): void;
+    /**
+     * Interrupt both the current LLM stream and ongoing speech.
+     * Use this for barge-in scenarios where the entire response should be cancelled.
+     */
+    interruptCurrentResponse(reason?: string): void;
+    /**
+     * Extract complete sentences from text buffer
+     * Returns [extractedSentences, remainingBuffer]
+     */
+    private extractSentences;
+    /**
+     * Trim conversation history to stay within configured limits.
+     * Removes oldest messages (always in pairs to preserve user/assistant turns).
+     */
+    private trimHistory;
+    /**
+     * Queue a text chunk for speech generation
+     */
+    private queueSpeechChunk;
+    /**
+     * Generate audio for a single chunk
+     */
+    private generateChunkAudio;
+    /**
+     * Process the speech queue and send audio chunks in order
+     */
+    private processSpeechQueue;
+    /**
+     * Process text delta for streaming speech.
+     * Call this as text chunks arrive from LLM.
+     */
+    private processTextForStreamingSpeech;
+    /**
+     * Flush any remaining text in the buffer to speech
+     * Call this when stream ends
+     */
+    private flushStreamingSpeech;
+    /**
+     * Process incoming audio data: transcribe and generate response
+     */
+    private processAudioInput;
+    connect(url?: string): Promise<void>;
+    /**
+     * Attach an existing WebSocket (server-side usage).
+     * Use this when a WS server accepts a connection and you want the
+     * agent to handle messages on that socket.
+     */
+    handleSocket(socket: WebSocket): void;
+    /**
+     * Send text input for processing (bypasses transcription).
+     * Requests are queued and processed serially to prevent race conditions.
+     */
+    sendText(text: string): Promise<string>;
+    /**
+     * Send audio data to be transcribed and processed
+     * @param audioData Base64 encoded audio data
+     */
+    sendAudio(audioData: string): Promise<void>;
+    /**
+     * Send raw audio buffer to be transcribed and processed
+     */
+    sendAudioBuffer(audioBuffer: Buffer | Uint8Array): Promise<void>;
+    /**
+     * Enqueue a text input for serial processing.
+     * This ensures only one processUserInput runs at a time, preventing
+     * race conditions on conversationHistory, fullText accumulation, etc.
+     */
+    private enqueueInput;
+    /**
+     * Drain the input queue, processing one request at a time.
+     */
+    private drainInputQueue;
+    /**
+     * Process user input with streaming text generation.
+     * Handles the full pipeline: text -> LLM (streaming) -> TTS -> WebSocket.
+     *
+     * This method is designed to be called serially via drainInputQueue().
+     */
+    private processUserInput;
+    /**
+     * Generate speech for full text at once (non-streaming fallback)
+     * Useful when you want to bypass streaming speech for short responses
+     */
+    generateAndSendSpeechFull(text: string): Promise<void>;
+    /**
+     * Send a message via WebSocket if connected.
+     * Gracefully handles send failures (e.g., socket closing mid-send).
+     */
+    private sendWebSocketMessage;
+    /**
+     * Start listening for voice input
+     */
+    startListening(): void;
+    /**
+     * Stop listening for voice input
+     */
+    stopListening(): void;
+    /**
+     * Clear conversation history
+     */
+    clearHistory(): void;
+    /**
+     * Get current conversation history
+     */
+    getHistory(): ModelMessage[];
+    /**
+     * Set conversation history (useful for restoring sessions)
+     */
+    setHistory(history: ModelMessage[]): void;
+    /**
+     * Internal helper to close and clean up the current socket.
+     */
+    private disconnectSocket;
+    /**
+     * Disconnect from WebSocket and stop all in-flight work.
+     */
+    disconnect(): void;
+    /**
+     * Permanently destroy the agent, releasing all resources.
+     * After calling this, the agent cannot be reused.
+     */
+    destroy(): void;
+    /**
+     * Check if agent is connected to WebSocket
+     */
+    get connected(): boolean;
+    /**
+     * Check if agent is currently processing a request
+     */
+    get processing(): boolean;
+    /**
+     * Check if agent is currently speaking (generating/playing audio)
+     */
+    get speaking(): boolean;
+    /**
+     * Get the number of pending speech chunks in the queue
+     */
+    get pendingSpeechChunks(): number;
+    /**
+     * Check if agent has been permanently destroyed
+     */
+    get destroyed(): boolean;
+}
+//# sourceMappingURL=VoiceAgent.d.ts.map
--- a/dist/VoiceAgent.d.ts.map
+++ b/dist/VoiceAgent.d.ts.map
@@ -0,0 +1 @@
+{"version":3,"file":"VoiceAgent.d.ts","sourceRoot":"","sources":["../src/VoiceAgent.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,SAAS,EAAE,MAAM,IAAI,CAAC;AAC/B,OAAO,EAAE,YAAY,EAAE,MAAM,QAAQ,CAAC;AACtC,OAAO,EACL,UAAU,EACV,aAAa,EAEb,KAAK,IAAI,EACT,KAAK,YAAY,EAGjB,KAAK,kBAAkB,EACvB,KAAK,WAAW,EACjB,MAAM,IAAI,CAAC;AACZ,OAAO,EAEL,KAAK,qBAAqB,EAC1B,KAAK,aAAa,EAInB,MAAM,SAAS,CAAC;AAEjB,MAAM,WAAW,iBAAiB;IAChC,KAAK,EAAE,aAAa,CAAC;IACrB,kBAAkB,CAAC,EAAE,kBAAkB,CAAC;IACxC,WAAW,CAAC,EAAE,WAAW,CAAC;IAC1B,YAAY,CAAC,EAAE,MAAM,CAAC;IACtB,QAAQ,CAAC,EAAE,WAAW,CAAC,UAAU,CAAC,OAAO,UAAU,CAAC,CAAC,CAAC,CAAC,CAAC,UAAU,CAAC,CAAC,CAAC;IACrE,KAAK,CAAC,EAAE,MAAM,CAAC,MAAM,EAAE,IAAI,CAAC,CAAC;IAC7B,QAAQ,CAAC,EAAE,MAAM,CAAC;IAClB,KAAK,CAAC,EAAE,MAAM,CAAC;IACf,kBAAkB,CAAC,EAAE,MAAM,CAAC;IAC5B,YAAY,CAAC,EAAE,MAAM,CAAC;IACtB,oDAAoD;IACpD,eAAe,CAAC,EAAE,OAAO,CAAC,qBAAqB,CAAC,CAAC;IACjD,2DAA2D;IAC3D,OAAO,CAAC,EAAE,OAAO,CAAC,aAAa,CAAC,CAAC;IACjC,yDAAyD;IACzD,iBAAiB,CAAC,EAAE,MAAM,CAAC;CAC5B;AAED,qBAAa,UAAW,SAAQ,YAAY;IAC1C,OAAO,CAAC,MAAM,CAAC,CAAY;IAC3B,OAAO,CAAC,KAAK,CAA4B;IACzC,OAAO,CAAC,KAAK,CAAgB;IAC7B,OAAO,CAAC,kBAAkB,CAAC,CAAqB;IAChD,OAAO,CAAC,WAAW,CAAC,CAAc;IAClC,OAAO,CAAC,YAAY,CAAS;IAC7B,OAAO,CAAC,QAAQ,CAA4D;IAC5E,OAAO,CAAC,QAAQ,CAAC,CAAS;IAC1B,OAAO,CAAC,WAAW,CAAS;IAC5B,OAAO,CAAC,mBAAmB,CAAsB;IACjD,OAAO,CAAC,KAAK,CAAS;IACtB,OAAO,CAAC,kBAAkB,CAAC,CAAS;IACpC,OAAO,CAAC,YAAY,CAAS;IAC7B,OAAO,CAAC,YAAY,CAAS;IAC7B,OAAO,CAAC,WAAW,CAAS;IAG5B,OAAO,CAAC,UAAU,CAA2F;IAC7G,OAAO,CAAC,eAAe,CAAS;IAGhC,OAAO,CAAC,4BAA4B,CAAC,CAAkB;IAGvD,OAAO,CAAC,aAAa,CAAgB;IACrC,OAAO,CAAC,iBAAiB,CAAS;IAGlC,OAAO,CAAC,qBAAqB,CAAwB;IACrD,OAAO,CAAC,4BAA4B,CAAC,CAAkB;IACvD,OAAO,CAAC,gBAAgB,CAAqB;IAC7C,OAAO,CAAC,WAAW,CAAK;IACxB,OAAO,CAAC,UAAU,CAAS;IAC3B,OAAO,CAAC,iBAAiB,CAAM;IAG/B,OAAO,CAAC,sBAAsB,CAAC,CAAgB;IAC/C,OAAO,CAAC,sBAAsB,CAAC,CAAa;gBAEhC,OAAO,EAAE,iBAAiB;IA8BtC;;OAEG;IACH,OAAO,CAAC,kBAAkB;IAM1B,OAAO,CAAC,cAAc;IAuDtB;;OAEG;IACH,OAAO,CAAC,mBAAmB;IA8BpB,aAAa,CAAC,KAAK,EAAE,MAAM,CAAC,MAAM,EAAE,IAAI,CAAC;IAIhD;;OAEG;IACU,eAAe,CAAC,SAAS,EAAE,MAAM,GAAG,UAAU,GAAG,OAAO,CAAC,MAAM,CAAC;IAuC7E;;;OAGG;IACU,sBAAsB,CACjC,IAAI,EAAE,MAAM,EACZ,WAAW,CAAC,EAAE,WAAW,GACxB,OAAO,CAAC,UAAU,CAAC;IAiBtB;;;OAGG;IACI,eAAe,CAAC,MAAM,GAAE,MAAsB,GAAG,IAAI;IAgC5D;;;OAGG;IACI,wBAAwB,CAAC,MAAM,GAAE,MAAsB,GAAG,IAAI;IAUrE;;;OAGG;IACH,OAAO,CAAC,gBAAgB;IA8CxB;;;OAGG;IACH,OAAO,CAAC,WAAW;IAmCnB;;OAEG;IACH,OAAO,CAAC,gBAAgB;IAsCxB;;OAEG;YACW,kBAAkB;IAwBhC;;OAEG;YACW,kBAAkB;IA+FhC;;;OAGG;IACH,OAAO,CAAC,6BAA6B;IAarC;;;OAGG;IACH,OAAO,CAAC,oBAAoB;IAO5B;;OAEG;YACW,iBAAiB;IAiDlB,OAAO,CAAC,GAAG,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,IAAI,CAAC;IA8BjD;;;;OAIG;IACI,YAAY,CAAC,MAAM,EAAE,SAAS,GAAG,IAAI;IAc5C;;;OAGG;IACU,QAAQ,CAAC,IAAI,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC;IAQpD;;;OAGG;IACU,SAAS,CAAC,SAAS,EAAE,MAAM,GAAG,OAAO,CAAC,IAAI,CAAC;IAKxD;;OAEG;IACU,eAAe,CAAC,WAAW,EAAE,MAAM,GAAG,UAAU,GAAG,OAAO,CAAC,IAAI,CAAC;IAM7E;;;;OAIG;IACH,OAAO,CAAC,YAAY;IAOpB;;OAEG;YACW,eAAe;IAmB7B;;;;;OAKG;YACW,gBAAgB;IAuT9B;;;OAGG;IACU,yBAAyB,CAAC,IAAI,EAAE,MAAM,GAAG,OAAO,CAAC,IAAI,CAAC;IA8BnE;;;OAGG;IACH,OAAO,CAAC,oBAAoB;IA2B5B;;OAEG;IACH,cAAc;IAKd;;OAEG;IACH,aAAa;IAKb;;OAEG;IACH,YAAY;IAKZ;;OAEG;IACH,UAAU,IAAI,YAAY,EAAE;IAI5B;;OAEG;IACH,UAAU,CAAC,OAAO,EAAE,YAAY,EAAE;IAIlC;;OAEG;IACH,OAAO,CAAC,gBAAgB;IAmBxB;;OAEG;IACH,UAAU;IAIV;;;OAGG;IACH,OAAO;IAQP;;OAEG;IACH,IAAI,SAAS,IAAI,OAAO,CAEvB;IAED;;OAEG;IACH,IAAI,UAAU,IAAI,OAAO,CAExB;IAED;;OAEG;IACH,IAAI,QAAQ,IAAI,OAAO,CAEtB;IAED;;OAEG;IACH,IAAI,mBAAmB,IAAI,MAAM,CAEhC;IAED;;OAEG;IACH,IAAI,SAAS,IAAI,OAAO,CAEvB;CACF"}
--- a/dist/VoiceAgent.js
+++ b/dist/VoiceAgent.js
--- a/dist/VoiceAgent.js.map
+++ b/dist/VoiceAgent.js.map
--- a/dist/index.d.ts
+++ b/dist/index.d.ts
@@ -0,0 +1,4 @@
+export { VoiceAgent, type VoiceAgentOptions } from "./VoiceAgent";
+export { VideoAgent, type VideoAgentOptions, type VideoFrame, type AudioData, type VideoAgentConfig, type FrameContext, type FrameTriggerReason, } from "./VideoAgent";
+export { type SpeechChunk, type StreamingSpeechConfig, type HistoryConfig, type StopWhenCondition, DEFAULT_STREAMING_SPEECH_CONFIG, DEFAULT_HISTORY_CONFIG, DEFAULT_MAX_AUDIO_SIZE, } from "./types";
+//# sourceMappingURL=index.d.ts.map
--- a/dist/index.d.ts.map
+++ b/dist/index.d.ts.map
@@ -0,0 +1 @@
+{"version":3,"file":"index.d.ts","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AACA,OAAO,EAAE,UAAU,EAAE,KAAK,iBAAiB,EAAE,MAAM,cAAc,CAAC;AAClE,OAAO,EACH,UAAU,EACV,KAAK,iBAAiB,EACtB,KAAK,UAAU,EACf,KAAK,SAAS,EACd,KAAK,gBAAgB,EACrB,KAAK,YAAY,EACjB,KAAK,kBAAkB,GAC1B,MAAM,cAAc,CAAC;AAGtB,OAAO,EACH,KAAK,WAAW,EAChB,KAAK,qBAAqB,EAC1B,KAAK,aAAa,EAClB,KAAK,iBAAiB,EACtB,+BAA+B,EAC/B,sBAAsB,EACtB,sBAAsB,GACzB,MAAM,SAAS,CAAC"}
--- a/dist/index.js
+++ b/dist/index.js
@@ -0,0 +1,14 @@
+"use strict";
+Object.defineProperty(exports, "__esModule", { value: true });
+exports.DEFAULT_MAX_AUDIO_SIZE = exports.DEFAULT_HISTORY_CONFIG = exports.DEFAULT_STREAMING_SPEECH_CONFIG = exports.VideoAgent = exports.VoiceAgent = void 0;
+// Agents
+var VoiceAgent_1 = require("./VoiceAgent");
+Object.defineProperty(exports, "VoiceAgent", { enumerable: true, get: function () { return VoiceAgent_1.VoiceAgent; } });
+var VideoAgent_1 = require("./VideoAgent");
+Object.defineProperty(exports, "VideoAgent", { enumerable: true, get: function () { return VideoAgent_1.VideoAgent; } });
+// Shared types
+var types_1 = require("./types");
+Object.defineProperty(exports, "DEFAULT_STREAMING_SPEECH_CONFIG", { enumerable: true, get: function () { return types_1.DEFAULT_STREAMING_SPEECH_CONFIG; } });
+Object.defineProperty(exports, "DEFAULT_HISTORY_CONFIG", { enumerable: true, get: function () { return types_1.DEFAULT_HISTORY_CONFIG; } });
+Object.defineProperty(exports, "DEFAULT_MAX_AUDIO_SIZE", { enumerable: true, get: function () { return types_1.DEFAULT_MAX_AUDIO_SIZE; } });
+//# sourceMappingURL=index.js.map
--- a/dist/index.js.map
+++ b/dist/index.js.map
@@ -0,0 +1 @@
+{"version":3,"file":"index.js","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":";;;AAAA,SAAS;AACT,2CAAkE;AAAzD,wGAAA,UAAU,OAAA;AACnB,2CAQsB;AAPlB,wGAAA,UAAU,OAAA;AASd,eAAe;AACf,iCAQiB;AAHb,wHAAA,+BAA+B,OAAA;AAC/B,+GAAA,sBAAsB,OAAA;AACtB,+GAAA,sBAAsB,OAAA"}
--- a/dist/types.d.ts
+++ b/dist/types.d.ts
@@ -0,0 +1,46 @@
+import type { streamText } from "ai";
+/**
+ * Represents a chunk of text to be converted to speech
+ */
+export interface SpeechChunk {
+    id: number;
+    text: string;
+    audioPromise?: Promise<Uint8Array | null>;
+}
+/**
+ * Configuration for streaming speech behavior
+ */
+export interface StreamingSpeechConfig {
+    /** Minimum characters before generating speech for a chunk */
+    minChunkSize: number;
+    /** Maximum characters per chunk (will split at sentence boundary before this) */
+    maxChunkSize: number;
+    /** Whether to enable parallel TTS generation */
+    parallelGeneration: boolean;
+    /** Maximum number of parallel TTS requests */
+    maxParallelRequests: number;
+}
+/**
+ * Configuration for conversation history memory management
+ */
+export interface HistoryConfig {
+    /** Maximum number of messages to keep in history. When exceeded, oldest messages are trimmed. Set to 0 for unlimited. */
+    maxMessages: number;
+    /** Maximum total character count across all messages. When exceeded, oldest messages are trimmed. Set to 0 for unlimited. */
+    maxTotalChars: number;
+}
+/**
+ * Default streaming speech configuration
+ */
+export declare const DEFAULT_STREAMING_SPEECH_CONFIG: StreamingSpeechConfig;
+/**
+ * Default history configuration
+ */
+export declare const DEFAULT_HISTORY_CONFIG: HistoryConfig;
+/** Default maximum audio input size (10 MB) */
+export declare const DEFAULT_MAX_AUDIO_SIZE: number;
+/**
+ * Default stop condition type from streamText
+ */
+export type StopWhenCondition = NonNullable<Parameters<typeof streamText>[0]["stopWhen"]>;
+//# sourceMappingURL=types.d.ts.map
--- a/dist/types.d.ts.map
+++ b/dist/types.d.ts.map
@@ -0,0 +1 @@
+{"version":3,"file":"types.d.ts","sourceRoot":"","sources":["../src/types.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EAAE,UAAU,EAAE,MAAM,IAAI,CAAC;AAErC;;GAEG;AACH,MAAM,WAAW,WAAW;IACxB,EAAE,EAAE,MAAM,CAAC;IACX,IAAI,EAAE,MAAM,CAAC;IACb,YAAY,CAAC,EAAE,OAAO,CAAC,UAAU,GAAG,IAAI,CAAC,CAAC;CAC7C;AAED;;GAEG;AACH,MAAM,WAAW,qBAAqB;IAClC,8DAA8D;IAC9D,YAAY,EAAE,MAAM,CAAC;IACrB,iFAAiF;IACjF,YAAY,EAAE,MAAM,CAAC;IACrB,gDAAgD;IAChD,kBAAkB,EAAE,OAAO,CAAC;IAC5B,8CAA8C;IAC9C,mBAAmB,EAAE,MAAM,CAAC;CAC/B;AAED;;GAEG;AACH,MAAM,WAAW,aAAa;IAC1B,yHAAyH;IACzH,WAAW,EAAE,MAAM,CAAC;IACpB,6HAA6H;IAC7H,aAAa,EAAE,MAAM,CAAC;CACzB;AAED;;GAEG;AACH,eAAO,MAAM,+BAA+B,EAAE,qBAK7C,CAAC;AAEF;;GAEG;AACH,eAAO,MAAM,sBAAsB,EAAE,aAGpC,CAAC;AAEF,+CAA+C;AAC/C,eAAO,MAAM,sBAAsB,QAAmB,CAAC;AAEvD;;GAEG;AACH,MAAM,MAAM,iBAAiB,GAAG,WAAW,CAAC,UAAU,CAAC,OAAO,UAAU,CAAC,CAAC,CAAC,CAAC,CAAC,UAAU,CAAC,CAAC,CAAC"}
--- a/dist/types.js
+++ b/dist/types.js
@@ -0,0 +1,22 @@
+"use strict";
+Object.defineProperty(exports, "__esModule", { value: true });
+exports.DEFAULT_MAX_AUDIO_SIZE = exports.DEFAULT_HISTORY_CONFIG = exports.DEFAULT_STREAMING_SPEECH_CONFIG = void 0;
+/**
+ * Default streaming speech configuration
+ */
+exports.DEFAULT_STREAMING_SPEECH_CONFIG = {
+    minChunkSize: 50,
+    maxChunkSize: 200,
+    parallelGeneration: true,
+    maxParallelRequests: 3,
+};
+/**
+ * Default history configuration
+ */
+exports.DEFAULT_HISTORY_CONFIG = {
+    maxMessages: 100,
+    maxTotalChars: 0, // unlimited by default
+};
+/** Default maximum audio input size (10 MB) */
+exports.DEFAULT_MAX_AUDIO_SIZE = 10 * 1024 * 1024;
+//# sourceMappingURL=types.js.map
--- a/dist/types.js.map
+++ b/dist/types.js.map
@@ -0,0 +1 @@
+{"version":3,"file":"types.js","sourceRoot":"","sources":["../src/types.ts"],"names":[],"mappings":";;;AAmCA;;GAEG;AACU,QAAA,+BAA+B,GAA0B;IAClE,YAAY,EAAE,EAAE;IAChB,YAAY,EAAE,GAAG;IACjB,kBAAkB,EAAE,IAAI;IACxB,mBAAmB,EAAE,CAAC;CACzB,CAAC;AAEF;;GAEG;AACU,QAAA,sBAAsB,GAAkB;IACjD,WAAW,EAAE,GAAG;IAChB,aAAa,EAAE,CAAC,EAAE,uBAAuB;CAC5C,CAAC;AAEF,+CAA+C;AAClC,QAAA,sBAAsB,GAAG,EAAE,GAAG,IAAI,GAAG,IAAI,CAAC"}
--- a/dist/utils/StreamBuffer.d.ts
+++ b/dist/utils/StreamBuffer.d.ts
@@ -0,0 +1 @@
+//# sourceMappingURL=StreamBuffer.d.ts.map
--- a/dist/utils/StreamBuffer.d.ts.map
+++ b/dist/utils/StreamBuffer.d.ts.map
@@ -0,0 +1 @@
+{"version":3,"file":"StreamBuffer.d.ts","sourceRoot":"","sources":["../../src/utils/StreamBuffer.ts"],"names":[],"mappings":""}
--- a/dist/utils/StreamBuffer.js
+++ b/dist/utils/StreamBuffer.js
@@ -0,0 +1,2 @@
+"use strict";
+//# sourceMappingURL=StreamBuffer.js.map
--- a/dist/utils/StreamBuffer.js.map
+++ b/dist/utils/StreamBuffer.js.map
@@ -0,0 +1 @@
+{"version":3,"file":"StreamBuffer.js","sourceRoot":"","sources":["../../src/utils/StreamBuffer.ts"],"names":[],"mappings":""}
--- a/example/serve-client.js
+++ b/example/serve-client.js
@@ -0,0 +1,29 @@
+const http = require('http');
+const fs = require('fs');
+const path = require('path');
+
+const PORT = 3000;
+
+// Create a simple HTTP server to serve the voice client HTML
+const server = http.createServer((req, res) => {
+  if (req.url === '/' || req.url === '/index.html') {
+    const htmlPath = path.join(__dirname, 'voice-client.html');
+    fs.readFile(htmlPath, (err, data) => {
+      if (err) {
+        res.writeHead(500);
+        res.end('Error loading voice-client.html');
+        return;
+      }
+      res.writeHead(200, {'Content-Type': 'text/html'});
+      res.end(data);
+    });
+  } else {
+    res.writeHead(404);
+    res.end('Not found');
+  }
+});
+
+server.listen(PORT, () => {
+  console.log(`Voice client available at: http://localhost:${PORT}`);
+  console.log(`Make sure to also start the WebSocket server with: npm run ws:server`);
+});
--- a/example/start-test-environment.sh
+++ b/example/start-test-environment.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+
+# Kill any previously running servers
+echo "Cleaning up any existing processes..."
+pkill -f "node.*example/serve-client.js" || true
+pkill -f "tsx.*example/ws-server.ts" || true
+
+echo "Starting WebSocket server..."
+npm run ws:server &
+WS_SERVER_PID=$!
+
+# Sleep to ensure WebSocket server has time to start
+sleep 2
+
+echo "Starting web client server..."
+npm run client &
+CLIENT_SERVER_PID=$!
+
+echo "✅ Test environment started!"
+echo "📱 Open http://localhost:3000 in your browser"
+echo ""
+echo "Press Ctrl+C to shut down both servers"
+
+# Wait for user to Ctrl+C
+trap "kill $WS_SERVER_PID $CLIENT_SERVER_PID; echo 'Servers stopped'; exit" INT
+wait
--- a/example/voice-client.html
+++ b/example/voice-client.html
@@ -258,11 +258,13 @@
    <!-- Connection -->
    <div class="card">
        <div class="row">
-            <input type="text" id="endpoint" value="ws://localhost:8080" placeholder="WebSocket endpoint" />
+            <input type="text" id="endpoint" value="ws://localhost:8081/ws/voice" placeholder="WebSocket endpoint" />
            <button id="connectBtn" class="primary">Connect</button>
            <button id="disconnectBtn" disabled>Disconnect</button>
        </div>
        <div id="status"><span class="status-dot disconnected"></span>Disconnected</div>
+        <div style="font-size: 13px; color: #666; margin-top: 4px;">Debug: Check browser console (F12) for detailed logs
+        </div>
    </div>

    <!-- Input Controls -->
@@ -440,6 +442,7 @@
            setStatus("Playing audio...", "speaking");

            const { bytes, format } = audioQueue.shift();
+            console.log(`Playing audio chunk: ${bytes.length} bytes, format: ${format}`);

            try {
                const ctx = getAudioContext();
@@ -449,7 +452,10 @@
                );

                try {
+                    console.log(`Attempting to decode audio with WebAudio API...`);
                    const audioBuffer = await ctx.decodeAudioData(arrayBuffer.slice(0));
+                    console.log(`Decoded audio successfully: ${audioBuffer.duration.toFixed(2)}s, ${audioBuffer.numberOfChannels} channels, ${audioBuffer.sampleRate}Hz`);
+
                    await new Promise((resolve) => {
                        const source = ctx.createBufferSource();
                        source.buffer = audioBuffer;
@@ -457,29 +463,49 @@
                        currentAudioSource = source;
                        source.onended = resolve;
                        source.start(0);
+                        console.log(`Audio playback started`);
                    });
+                    console.log(`Audio playback completed`);
                    currentAudioSource = null;
-                } catch (_decodeErr) {
+                } catch (decodeErr) {
+                    console.warn(`WebAudio decode failed, falling back to Audio element:`, decodeErr);
                    const mime = getMimeTypeForFormat(format);
+                    console.log(`Using MIME type: ${mime}`);
+
                    const blob = new Blob([bytes], { type: mime });
                    const url = URL.createObjectURL(blob);
                    const audio = new Audio(url);
+
+                    audio.onerror = (e) => console.error(`Audio element error:`, e);
+                    audio.oncanplaythrough = () => console.log(`Audio ready to play: ${audio.duration.toFixed(2)}s`);
+
                    currentAudioElement = audio;
                    await audio.play();
+                    console.log(`Audio element playback started`);
+
                    await new Promise((resolve) => {
-                        audio.onended = resolve;
-                        audio.onerror = resolve;
+                        audio.onended = () => {
+                            console.log(`Audio element playback completed`);
+                            resolve();
+                        };
+                        audio.onerror = (e) => {
+                            console.error(`Audio element playback failed:`, e);
+                            resolve();
+                        };
                    });
                    currentAudioElement = null;
                    URL.revokeObjectURL(url);
                }
            } catch (err) {
+                console.error(`Audio playback error:`, err);
                log(`Audio play error: ${err?.message || err}`);
            } finally {
                isPlaying = false;
                if (audioQueue.length > 0) {
+                    console.log(`${audioQueue.length} more audio chunks in queue, continuing playback`);
                    playNextAudioChunk();
                } else if (connected) {
+                    console.log(`Audio queue empty, returning to ${whisperListening || micShouldRun ? 'listening' : 'connected'} state`);
                    setStatus(whisperListening || micShouldRun ? "Listening..." : "Connected", whisperListening || micShouldRun ? "listening" : "connected");
                }
            }
@@ -521,6 +547,7 @@
                case "webm":
                    return "audio/webm";
                default:
+                    console.log(`Unknown audio format: ${format}, defaulting to mpeg`);
                    return `audio/${format || "mpeg"}`;
            }
        }
@@ -543,7 +570,10 @@
                analyserNode = ctx.createAnalyser();
                analyserNode.fftSize = 256;
                analyserSource.connect(analyserNode);
-            } catch (_) { }
+                console.log('Audio analyser setup complete');
+            } catch (err) {
+                console.error('Audio analyser setup failed:', err);
+            }
        }

        function teardownAnalyser() {
@@ -671,15 +701,26 @@
         */
        async function startWhisperListening() {
            try {
+                console.log("Starting Whisper VAD listening");
                mediaStream = await navigator.mediaDevices.getUserMedia({
                    audio: {
                        channelCount: 1,
                        sampleRate: 16000,
                        echoCancellation: true,
                        noiseSuppression: true,
+                        autoGainControl: true
                    }
                });
+
+                // Log the actual constraints we got
+                const tracks = mediaStream.getAudioTracks();
+                if (tracks.length > 0) {
+                    const settings = tracks[0].getSettings();
+                    console.log('Audio track settings:', settings);
+                    log(`🎵 Audio: ${settings.sampleRate}Hz, ${settings.channelCount}ch`);
+                }
            } catch (err) {
+                console.error("Mic permission error:", err);
                log(`Mic permission failed: ${err?.message || err}`);
                setStatus("Mic permission denied", "disconnected");
                return;
@@ -808,13 +849,45 @@
            whisperSegmentActive = true;
            segmentStartTime = Date.now();

-            const mimeType = MediaRecorder.isTypeSupported("audio/webm;codecs=opus")
-                ? "audio/webm;codecs=opus"
-                : MediaRecorder.isTypeSupported("audio/mp4")
-                    ? "audio/mp4"
-                    : "audio/webm";
+            // Try to choose the best format for Whisper compatibility
+            let mimeType = '';
+            const supportedTypes = [
+                "audio/webm;codecs=opus",  // Best compatibility with Whisper
+                "audio/webm",
+                "audio/ogg;codecs=opus",
+                "audio/mp4",
+                "audio/wav"
+            ];

-            mediaRecorder = new MediaRecorder(mediaStream, { mimeType });
+            for (const type of supportedTypes) {
+                if (MediaRecorder.isTypeSupported(type)) {
+                    mimeType = type;
+                    break;
+                }
+            }
+
+            if (!mimeType) {
+                console.warn("No preferred MIME types supported, using default");
+                mimeType = ''; // Let the browser choose
+            }
+
+            console.log(`Using MediaRecorder with MIME type: ${mimeType || 'browser default'}`);
+
+            // Create recorder with bitrate suitable for speech
+            const recorderOptions = {
+                mimeType: mimeType,
+                audioBitsPerSecond: 128000 // 128kbps is good for speech
+            };
+
+            try {
+                mediaRecorder = new MediaRecorder(mediaStream, recorderOptions);
+                console.log(`MediaRecorder created with mimeType: ${mediaRecorder.mimeType}`);
+            } catch (err) {
+                console.error("MediaRecorder creation failed:", err);
+                // Fallback to default options
+                mediaRecorder = new MediaRecorder(mediaStream);
+                console.log(`Fallback MediaRecorder created with mimeType: ${mediaRecorder.mimeType}`);
+            }

            mediaRecorder.ondataavailable = (event) => {
                if (event.data.size > 0) audioChunks.push(event.data);
@@ -843,27 +916,45 @@
                    return;
                }

-                const blob = new Blob(audioChunks, { type: mediaRecorder.mimeType });
-                audioChunks = [];
+                try {
+                    const blob = new Blob(audioChunks, { type: mediaRecorder.mimeType });
+                    audioChunks = [];

-                const arrayBuffer = await blob.arrayBuffer();
-                const uint8 = new Uint8Array(arrayBuffer);
+                    const arrayBuffer = await blob.arrayBuffer();
+                    const uint8 = new Uint8Array(arrayBuffer);

-                // Base64 encode in chunks to avoid stack overflow
-                let binary = "";
-                const chunkSize = 8192;
-                for (let i = 0; i < uint8.length; i += chunkSize) {
-                    const slice = uint8.subarray(i, Math.min(i + chunkSize, uint8.length));
-                    binary += String.fromCharCode.apply(null, slice);
-                }
-                const base64 = btoa(binary);
+                    // Base64 encode in chunks to avoid stack overflow
+                    let binary = "";
+                    const chunkSize = 8192;
+                    for (let i = 0; i < uint8.length; i += chunkSize) {
+                        const slice = uint8.subarray(i, Math.min(i + chunkSize, uint8.length));
+                        binary += String.fromCharCode.apply(null, slice);
+                    }
+                    const base64 = btoa(binary);

-                resetOutputPanels();
+                    // Log audio size for debugging
+                    console.log(`Prepared audio segment: ${(uint8.length / 1024).toFixed(1)}KB, duration: ${duration}ms, mime: ${mediaRecorder.mimeType}`);

-                if (ws && connected) {
-                    ws.send(JSON.stringify({ type: "audio", data: base64 }));
-                    log(`→ Sent audio segment (${(uint8.length / 1024).toFixed(1)} KB, ${duration}ms) for Whisper`);
-                    transcriptEl.textContent = "🎙️ Transcribing audio...";
+                    resetOutputPanels();
+
+                    if (ws && connected) {
+                        // Add format information to help server decode the audio
+                        const message = {
+                            type: "audio",
+                            data: base64,
+                            format: mediaRecorder.mimeType,
+                            sampleRate: 16000,  // Match the constraint we requested
+                            duration: duration
+                        };
+
+                        console.log(`Sending audio to server: ${(base64.length / 1000).toFixed(1)}KB, format: ${mediaRecorder.mimeType}`);
+                        ws.send(JSON.stringify(message));
+                        log(`→ Sent audio segment (${(uint8.length / 1024).toFixed(1)} KB, ${duration}ms) for Whisper`);
+                        transcriptEl.textContent = "🎙️ Transcribing audio...";
+                    }
+                } catch (err) {
+                    console.error("Error processing audio segment:", err);
+                    log(`❌ Error processing audio: ${err.message || err}`);
                }
            };

@@ -921,6 +1012,18 @@
        // ── Server Message Handler ──────────────────────────────────────────
        function handleServerMessage(msg) {
            switch (msg.type) {
+                // ── Transcription feedback ────────────────
+                case "transcription_result":
+                    console.log(`Received transcription: "${msg.text}", language: ${msg.language || 'unknown'}`);
+                    transcriptEl.textContent = msg.text;
+                    log(`🎙️ Transcription: ${msg.text}`);
+                    break;
+
+                case "transcription_error":
+                    console.error(`Transcription error: ${msg.error}`);
+                    transcriptEl.textContent = `⚠️ ${msg.error}`;
+                    log(`❌ Transcription error: ${msg.error}`);
+                    break;
                // ── Stream lifecycle ────────────────────
                case "stream_start":
                    assistantEl.textContent = "";
@@ -1022,8 +1125,9 @@

                case "audio_chunk": {
                    const bytes = decodeBase64ToBytes(msg.data);
-                    audioQueue.push({ bytes, format: msg.format || "opus" });
-                    log(`🔊 Audio chunk #${msg.chunkId ?? "?"} (${bytes.length} bytes, ${msg.format || "opus"})`);
+                    audioQueue.push({ bytes, format: msg.format || "mp3" });
+                    log(`🔊 Audio chunk #${msg.chunkId ?? "?"} (${bytes.length} bytes, ${msg.format || "mp3"})`);
+                    console.log(`Received audio chunk #${msg.chunkId ?? "?"}: ${bytes.length} bytes, format: ${msg.format || "mp3"}`);
                    playNextAudioChunk();
                    break;
                }
@@ -1096,9 +1200,12 @@

            ws.onmessage = (event) => {
                try {
+                    console.log(`← Received WebSocket message: ${event.data.length} bytes`);
                    const msg = JSON.parse(event.data);
+                    console.log('Message parsed:', msg.type);
                    handleServerMessage(msg);
-                } catch {
+                } catch (err) {
+                    console.error('Parse error:', err);
                    log("Received non-JSON message");
                }
            };
@@ -1146,4 +1253,4 @@
    </script>
 </body>

-</html>
+</html>
--- a/example/ws-server-2.ts
+++ b/example/ws-server-2.ts
@@ -96,10 +96,16 @@ Use tools when needed to provide accurate information.`,
        console.log(`[ws-server] 🔊 Audio chunk #${chunkId}: ${uint8Array.length} bytes (${format})`);
    });

+    agent.on("history_trimmed", ({ removedCount, reason }: { removedCount: number; reason: string }) => {
+        console.log(`[ws-server] 🧹 History trimmed: removed ${removedCount} messages (${reason})`);
+    });
+
    agent.on("error", (err: Error) => console.error("[ws-server] ❌ Error:", err.message));

    agent.on("disconnected", () => {
-        console.log("[ws-server] ✗ client disconnected\n");
+        // Permanently release all agent resources for this connection
+        agent.destroy();
+        console.log("[ws-server] ✗ client disconnected (agent destroyed)\n");
    });

    // Hand the accepted socket to the agent – this is the key line.
--- a/example/ws-server.ts
+++ b/example/ws-server.ts
@@ -40,6 +40,7 @@ const wss = new WebSocketServer({ port, host });
 wss.on("listening", () => {
    console.log(`[ws-server] listening on ${endpoint}`);
    console.log("[ws-server] Waiting for connections...\n");
+    console.log(`[ws-server] 🌐 Open voice-client.html in your browser and connect to ${endpoint}`);
 });

 wss.on("connection", (socket) => {
@@ -55,12 +56,12 @@ Keep responses concise and conversational since they will be spoken aloud.
 Use tools when needed to provide accurate information.`,
        voice: "alloy",
        speechInstructions: "Speak in a friendly, natural conversational tone.",
-        outputFormat: "opus",
+        outputFormat: "mp3",   // Using mp3 for better browser compatibility
        streamingSpeech: {
-            minChunkSize: 40,
-            maxChunkSize: 180,
-            parallelGeneration: true,
-            maxParallelRequests: 2,
+            minChunkSize: 20,          // Smaller chunks for faster streaming
+            maxChunkSize: 150,         // Not too large to ensure timely audio delivery
+            parallelGeneration: true,  // Generate audio chunks in parallel
+            maxParallelRequests: 3,    // Allow up to 3 concurrent TTS requests
        },
        tools: {
            getWeather: weatherTool,
@@ -96,6 +97,50 @@ Use tools when needed to provide accurate information.`,
        console.log(`[ws-server] 🔊 Audio chunk #${chunkId}: ${uint8Array.length} bytes (${format})`);
    });

+    // Log raw WebSocket messages for debugging
+    const originalSend = socket.send.bind(socket);
+
+    // Define a wrapper function to log messages before sending
+    function loggedSend(data: any): void {
+        let dataSize = 'unknown size';
+        if (typeof data === 'string') {
+            dataSize = `${data.length} chars`;
+        } else if (data instanceof Buffer || data instanceof ArrayBuffer) {
+            dataSize = `${Buffer.byteLength(data)} bytes`;
+        } else if (data instanceof Uint8Array) {
+            dataSize = `${data.byteLength} bytes`;
+        }
+        console.log(`[ws-server] → Sending WebSocket data (${dataSize})`);
+    }
+
+    // Create a proxy for the send method that logs but preserves original signatures
+    socket.send = function (data: any, optionsOrCallback?: any, callback?: (err?: Error) => void): void {
+        loggedSend(data);
+
+        if (typeof optionsOrCallback === 'function') {
+            // Handle the (data, callback) signature
+            return originalSend(data, optionsOrCallback);
+        } else if (optionsOrCallback) {
+            // Handle the (data, options, callback) signature
+            return originalSend(data, optionsOrCallback, callback);
+        } else {
+            // Handle the (data) signature
+            return originalSend(data);
+        }
+    };
+
+    socket.on('message', (data: any) => {
+        let dataSize = 'unknown size';
+        if (data instanceof Buffer) {
+            dataSize = `${data.length} bytes`;
+        } else if (data instanceof ArrayBuffer) {
+            dataSize = `${data.byteLength} bytes`;
+        } else if (typeof data === 'string') {
+            dataSize = `${data.length} chars`;
+        }
+        console.log(`[ws-server] ← Received WebSocket data (${dataSize})`);
+    });
+
    agent.on("transcription", ({ text, language }: { text: string; language?: string }) => {
        console.log(`[ws-server] 📝 Transcription (${language || "unknown"}): ${text}`);
    });
@@ -116,10 +161,16 @@ Use tools when needed to provide accurate information.`,
        console.log(`[ws-server] 🔊 Queued speech chunk #${id}: ${text.substring(0, 50)}...`);
    });

+    agent.on("history_trimmed", ({ removedCount, reason }: { removedCount: number; reason: string }) => {
+        console.log(`[ws-server] 🧹 History trimmed: removed ${removedCount} messages (${reason})`);
+    });
+
    agent.on("error", (err: Error) => console.error("[ws-server] ❌ Error:", err.message));

    agent.on("disconnected", () => {
-        console.log("[ws-server] ✗ client disconnected\n");
+        // Permanently release all agent resources for this connection
+        agent.destroy();
+        console.log("[ws-server] ✗ client disconnected (agent destroyed)\n");
    });

    // Hand the accepted socket to the agent – this is the key line.
--- a/output.mp3
+++ b/output.mp3
--- a/output_chunk_0.mp3
+++ b/output_chunk_0.mp3
--- a/package.json
+++ b/package.json
@@ -1,13 +1,20 @@
 {
  "name": "voice-agent-ai-sdk",
-  "version": "0.0.1",
+  "version": "0.2.1-beta.0",
  "description": "Voice AI Agent with ai-sdk",
-  "main": "src/index.ts",
+  "main": "dist/index.js",
+  "types": "dist/index.d.ts",
+  "files": [
+    "dist",
+    "README.md",
+    "LICENSE"
+  ],
  "scripts": {
    "build": "tsc",
    "dev": "tsc -w",
    "demo": "tsx example/demo.ts",
    "ws:server": "tsx example/ws-server.ts",
+    "client": "node example/serve-client.js",
    "prepublishOnly": "pnpm build"
  },
  "keywords": [
@@ -15,23 +22,38 @@
    "websocket",
    "ai",
    "agent",
-    "tools"
+    "tools",
+    "tts",
+    "speech",
+    "ai-sdk",
+    "streaming"
  ],
  "author": "Bijit Mondal",
  "license": "MIT",
+  "repository": {
+    "type": "git",
+    "url": "git+https://github.com/Bijit-Mondal/voiceAgent.git"
+  },
+  "bugs": {
+    "url": "https://github.com/Bijit-Mondal/voiceAgent/issues"
+  },
+  "homepage": "https://github.com/Bijit-Mondal/voiceAgent#readme",
  "packageManager": "pnpm@10.27.0",
-  "devDependencies": {
-    "@ai-sdk/openai": "^3.0.28",
-    "@types/node": "^25.2.3",
-    "@types/ws": "^8.18.1",
-    "tsx": "^4.20.5",
-    "typescript": "^5.9.3"
+  "peerDependencies": {
+    "ai": "^6.0.0"
  },
  "dependencies": {
-    "ai": "^6.0.85",
    "dotenv": "^17.2.3",
    "ws": "^8.19.0",
    "zod": "^4.3.6",
    "zod-to-json-schema": "^3.25.1"
+  },
+  "devDependencies": {
+    "@ai-sdk/openai": "^3.0.28",
+    "@types/node": "^25.2.3",
+    "@types/ws": "^8.18.1",
+    "ai": "^6.0.85",
+    "tsx": "^4.20.5",
+    "typescript": "^5.9.3"
  }
-}
+}
--- a/src/VideoAgent.ts
+++ b/src/VideoAgent.ts
--- a/src/VoiceAgent.ts
+++ b/src/VoiceAgent.ts
@@ -11,29 +11,14 @@ import {
  type TranscriptionModel,
  type SpeechModel,
 } from "ai";
-
-/**
- * Represents a chunk of text to be converted to speech
- */
-interface SpeechChunk {
-  id: number;
-  text: string;
-  audioPromise?: Promise<Uint8Array | null>;
-}
-
-/**
- * Configuration for streaming speech behavior
- */
-interface StreamingSpeechConfig {
-  /** Minimum characters before generating speech for a chunk */
-  minChunkSize: number;
-  /** Maximum characters per chunk (will split at sentence boundary before this) */
-  maxChunkSize: number;
-  /** Whether to enable parallel TTS generation */
-  parallelGeneration: boolean;
-  /** Maximum number of parallel TTS requests */
-  maxParallelRequests: number;
-}
+import {
+  type SpeechChunk,
+  type StreamingSpeechConfig,
+  type HistoryConfig,
+  DEFAULT_STREAMING_SPEECH_CONFIG,
+  DEFAULT_HISTORY_CONFIG,
+  DEFAULT_MAX_AUDIO_SIZE,
+} from "./types";

 export interface VoiceAgentOptions {
  model: LanguageModel; // AI SDK Model for chat (e.g., openai('gpt-4o'))
@@ -48,6 +33,10 @@ export interface VoiceAgentOptions {
  outputFormat?: string; // Audio output format (e.g., 'mp3', 'opus', 'wav')
  /** Configuration for streaming speech generation */
  streamingSpeech?: Partial<StreamingSpeechConfig>;
+  /** Configuration for conversation history memory limits */
+  history?: Partial<HistoryConfig>;
+  /** Maximum audio input size in bytes (default: 10 MB) */
+  maxAudioInputSize?: number;
 }

 export class VoiceAgent extends EventEmitter {
@@ -65,6 +54,18 @@ export class VoiceAgent extends EventEmitter {
  private speechInstructions?: string;
  private outputFormat: string;
  private isProcessing = false;
+  private isDestroyed = false;
+
+  // Concurrency: queue incoming requests so they run serially
+  private inputQueue: Array<{ text: string; resolve: (v: string) => void; reject: (e: unknown) => void }> = [];
+  private processingQueue = false;
+
+  // Abort controller for the current LLM stream so we can cancel it on interrupt/disconnect
+  private currentStreamAbortController?: AbortController;
+
+  // Memory management
+  private historyConfig: HistoryConfig;
+  private maxAudioInputSize: number;

  // Streaming speech state
  private streamingSpeechConfig: StreamingSpeechConfig;
@@ -74,6 +75,10 @@ export class VoiceAgent extends EventEmitter {
  private isSpeaking = false;
  private pendingTextBuffer = "";

+  // Promise-based signal for speech queue completion (replaces busy-wait polling)
+  private speechQueueDonePromise?: Promise<void>;
+  private speechQueueDoneResolve?: () => void;
+
  constructor(options: VoiceAgentOptions) {
    super();
    this.model = options.model;
@@ -86,18 +91,31 @@ export class VoiceAgent extends EventEmitter {
    this.voice = options.voice || "alloy";
    this.speechInstructions = options.speechInstructions;
    this.outputFormat = options.outputFormat || "mp3";
+    this.maxAudioInputSize = options.maxAudioInputSize ?? DEFAULT_MAX_AUDIO_SIZE;
    if (options.tools) {
      this.tools = { ...options.tools };
    }

    // Initialize streaming speech config with defaults
    this.streamingSpeechConfig = {
-      minChunkSize: 50,
-      maxChunkSize: 200,
-      parallelGeneration: true,
-      maxParallelRequests: 3,
+      ...DEFAULT_STREAMING_SPEECH_CONFIG,
      ...options.streamingSpeech,
    };
+
+    // Initialize history config with defaults
+    this.historyConfig = {
+      ...DEFAULT_HISTORY_CONFIG,
+      ...options.history,
+    };
+  }
+
+  /**
+   * Ensure the agent has not been destroyed. Throws if it has.
+   */
+  private ensureNotDestroyed(): void {
+    if (this.isDestroyed) {
+      throw new Error("VoiceAgent has been destroyed and cannot be used");
+    }
  }

  private setupListeners() {
@@ -106,26 +124,34 @@ export class VoiceAgent extends EventEmitter {
    this.socket.on("message", async (data) => {
      try {
        const message = JSON.parse(data.toString());
+        console.log(`Received WebSocket message of type: ${message.type}`);

        // Handle transcribed text from the client/STT
        if (message.type === "transcript") {
-          // Interrupt ongoing speech when user starts speaking (barge-in)
-          if (this.isSpeaking) {
-            this.interruptSpeech("user_speaking");
+          if (typeof message.text !== "string" || !message.text.trim()) {
+            this.emit("warning", "Received empty or invalid transcript message");
+            return;
          }
-          await this.processUserInput(message.text);
+          // Interrupt ongoing speech when user starts speaking (barge-in)
+          this.interruptCurrentResponse("user_speaking");
+          console.log(`Processing transcript: "${message.text}"`);
+          await this.enqueueInput(message.text);
        }
        // Handle raw audio data that needs transcription
-        if (message.type === "audio") {
-          // Interrupt ongoing speech when user starts speaking (barge-in)
-          if (this.isSpeaking) {
-            this.interruptSpeech("user_speaking");
+        else if (message.type === "audio") {
+          if (typeof message.data !== "string" || !message.data) {
+            this.emit("warning", "Received empty or invalid audio message");
+            return;
          }
-          await this.processAudioInput(message.data);
+          // Interrupt ongoing speech when user starts speaking (barge-in)
+          this.interruptCurrentResponse("user_speaking");
+          console.log(`Received audio data (${message.data.length / 1000}KB) for processing, format: ${message.format || 'unknown'}`);
+          await this.processAudioInput(message.data, message.format);
        }
        // Handle explicit interrupt request from client
-        if (message.type === "interrupt") {
-          this.interruptSpeech(message.reason || "client_request");
+        else if (message.type === "interrupt") {
+          console.log(`Received interrupt request: ${message.reason || "client_request"}`);
+          this.interruptCurrentResponse(message.reason || "client_request");
        }
      } catch (err) {
        console.error("Failed to process message:", err);
@@ -136,6 +162,8 @@ export class VoiceAgent extends EventEmitter {
    this.socket.on("close", () => {
      console.log("Disconnected");
      this.isConnected = false;
+      // Clean up all in-flight work when the socket closes
+      this.cleanupOnDisconnect();
      this.emit("disconnected");
    });

@@ -145,6 +173,39 @@ export class VoiceAgent extends EventEmitter {
    });
  }

+  /**
+   * Clean up all in-flight state when the connection drops.
+   */
+  private cleanupOnDisconnect(): void {
+    // Abort ongoing LLM stream
+    if (this.currentStreamAbortController) {
+      this.currentStreamAbortController.abort();
+      this.currentStreamAbortController = undefined;
+    }
+    // Abort ongoing speech generation
+    if (this.currentSpeechAbortController) {
+      this.currentSpeechAbortController.abort();
+      this.currentSpeechAbortController = undefined;
+    }
+    // Clear speech state
+    this.speechChunkQueue = [];
+    this.pendingTextBuffer = "";
+    this.isSpeaking = false;
+    this.isProcessing = false;
+    // Resolve any pending speech-done waiters
+    if (this.speechQueueDoneResolve) {
+      this.speechQueueDoneResolve();
+      this.speechQueueDoneResolve = undefined;
+      this.speechQueueDonePromise = undefined;
+    }
+    // Reject any queued inputs
+    for (const item of this.inputQueue) {
+      item.reject(new Error("Connection closed"));
+    }
+    this.inputQueue = [];
+    this.processingQueue = false;
+  }
+
  public registerTools(tools: Record<string, Tool>) {
    this.tools = { ...this.tools, ...tools };
  }
@@ -157,17 +218,38 @@ export class VoiceAgent extends EventEmitter {
      throw new Error("Transcription model not configured");
    }

-    const result = await transcribe({
-      model: this.transcriptionModel,
-      audio: audioData,
-    });
+    console.log(`Sending ${audioData.byteLength} bytes to Whisper for transcription`);

-    this.emit("transcription", {
-      text: result.text,
-      language: result.language,
-    });
+    try {
+      // Note: The AI SDK transcribe function only accepts these parameters
+      // We can't directly pass language or temperature to it
+      const result = await transcribe({
+        model: this.transcriptionModel,
+        audio: audioData,
+        // If we need to pass additional options to OpenAI Whisper,
+        // we would need to do it via providerOptions if supported
+      });

-    return result.text;
+      console.log(`Whisper transcription result: "${result.text}", language: ${result.language || 'unknown'}`);
+
+      this.emit("transcription", {
+        text: result.text,
+        language: result.language,
+      });
+
+      // Also send transcription to client for immediate feedback
+      this.sendWebSocketMessage({
+        type: "transcription_result",
+        text: result.text,
+        language: result.language,
+      });
+
+      return result.text;
+    } catch (error) {
+      console.error("Whisper transcription failed:", error);
+      // Re-throw to be handled by the caller
+      throw error;
+    }
  }

  /**
@@ -195,7 +277,8 @@ export class VoiceAgent extends EventEmitter {
  }

  /**
-   * Interrupt ongoing speech generation and playback (barge-in support)
+   * Interrupt ongoing speech generation and playback (barge-in support).
+   * This only interrupts TTS — the LLM stream is left running.
   */
  public interruptSpeech(reason: string = "interrupted"): void {
    if (!this.isSpeaking && this.speechChunkQueue.length === 0) {
@@ -213,6 +296,13 @@ export class VoiceAgent extends EventEmitter {
    this.pendingTextBuffer = "";
    this.isSpeaking = false;

+    // Resolve any pending speech-done waiters so processUserInput can finish
+    if (this.speechQueueDoneResolve) {
+      this.speechQueueDoneResolve();
+      this.speechQueueDoneResolve = undefined;
+      this.speechQueueDonePromise = undefined;
+    }
+
    // Notify clients to stop audio playback
    this.sendWebSocketMessage({
      type: "speech_interrupted",
@@ -222,6 +312,20 @@ export class VoiceAgent extends EventEmitter {
    this.emit("speech_interrupted", { reason });
  }

+  /**
+   * Interrupt both the current LLM stream and ongoing speech.
+   * Use this for barge-in scenarios where the entire response should be cancelled.
+   */
+  public interruptCurrentResponse(reason: string = "interrupted"): void {
+    // Abort the LLM stream first
+    if (this.currentStreamAbortController) {
+      this.currentStreamAbortController.abort();
+      this.currentStreamAbortController = undefined;
+    }
+    // Then interrupt speech
+    this.interruptSpeech(reason);
+  }
+
  /**
   * Extract complete sentences from text buffer
   * Returns [extractedSentences, remainingBuffer]
@@ -272,17 +376,68 @@ export class VoiceAgent extends EventEmitter {
    return [sentences, remaining];
  }

+  /**
+   * Trim conversation history to stay within configured limits.
+   * Removes oldest messages (always in pairs to preserve user/assistant turns).
+   */
+  private trimHistory(): void {
+    const { maxMessages, maxTotalChars } = this.historyConfig;
+
+    // Trim by message count
+    if (maxMessages > 0 && this.conversationHistory.length > maxMessages) {
+      const excess = this.conversationHistory.length - maxMessages;
+      // Remove from the front, ensuring we remove at least `excess` messages
+      // Round up to even number to preserve turn pairs
+      const toRemove = excess % 2 === 0 ? excess : excess + 1;
+      this.conversationHistory.splice(0, toRemove);
+      this.emit("history_trimmed", { removedCount: toRemove, reason: "max_messages" });
+    }
+
+    // Trim by total character count
+    if (maxTotalChars > 0) {
+      let totalChars = this.conversationHistory.reduce((sum, msg) => {
+        const content = typeof msg.content === "string" ? msg.content : JSON.stringify(msg.content);
+        return sum + content.length;
+      }, 0);
+
+      let removedCount = 0;
+      while (totalChars > maxTotalChars && this.conversationHistory.length > 2) {
+        const removed = this.conversationHistory.shift();
+        if (removed) {
+          const content = typeof removed.content === "string" ? removed.content : JSON.stringify(removed.content);
+          totalChars -= content.length;
+          removedCount++;
+        }
+      }
+      if (removedCount > 0) {
+        this.emit("history_trimmed", { removedCount, reason: "max_total_chars" });
+      }
+    }
+  }
+
  /**
   * Queue a text chunk for speech generation
   */
  private queueSpeechChunk(text: string): void {
    if (!this.speechModel || !text.trim()) return;

+    // Wrap chunk ID to prevent unbounded growth in very long sessions
+    if (this.nextChunkId >= Number.MAX_SAFE_INTEGER) {
+      this.nextChunkId = 0;
+    }
+
    const chunk: SpeechChunk = {
      id: this.nextChunkId++,
      text: text.trim(),
    };

+    // Create the speech-done promise if not already present
+    if (!this.speechQueueDonePromise) {
+      this.speechQueueDonePromise = new Promise<void>((resolve) => {
+        this.speechQueueDoneResolve = resolve;
+      });
+    }
+
    // Start generating audio immediately (parallel generation)
    if (this.streamingSpeechConfig.parallelGeneration) {
      const activeRequests = this.speechChunkQueue.filter(c => c.audioPromise).length;
@@ -310,13 +465,16 @@ export class VoiceAgent extends EventEmitter {
    }

    try {
+      console.log(`Generating audio for chunk ${chunk.id}: "${chunk.text.substring(0, 50)}${chunk.text.length > 50 ? '...' : ''}"`);
      const audioData = await this.generateSpeechFromText(
        chunk.text,
        this.currentSpeechAbortController.signal
      );
+      console.log(`Generated audio for chunk ${chunk.id}: ${audioData.length} bytes`);
      return audioData;
    } catch (error) {
      if ((error as Error).name === "AbortError") {
+        console.log(`Audio generation aborted for chunk ${chunk.id}`);
        return null; // Cancelled, don't report as error
      }
      console.error(`Failed to generate audio for chunk ${chunk.id}:`, error);
@@ -332,6 +490,7 @@ export class VoiceAgent extends EventEmitter {
    if (this.isSpeaking) return;
    this.isSpeaking = true;

+    console.log(`Starting speech queue processing with ${this.speechChunkQueue.length} chunks`);
    this.emit("speech_start", { streaming: true });
    this.sendWebSocketMessage({ type: "speech_stream_start" });

@@ -339,6 +498,8 @@ export class VoiceAgent extends EventEmitter {
      while (this.speechChunkQueue.length > 0) {
        const chunk = this.speechChunkQueue[0];

+        console.log(`Processing speech chunk #${chunk.id} (${this.speechChunkQueue.length - 1} remaining)`);
+
        // Ensure audio generation has started
        if (!chunk.audioPromise) {
          chunk.audioPromise = this.generateChunkAudio(chunk);
@@ -348,13 +509,17 @@ export class VoiceAgent extends EventEmitter {
        const audioData = await chunk.audioPromise;

        // Check if we were interrupted while waiting
-        if (!this.isSpeaking) break;
+        if (!this.isSpeaking) {
+          console.log(`Speech interrupted during chunk #${chunk.id}`);
+          break;
+        }

        // Remove from queue after processing
        this.speechChunkQueue.shift();

        if (audioData) {
          const base64Audio = Buffer.from(audioData).toString("base64");
+          console.log(`Sending audio chunk #${chunk.id} (${audioData.length} bytes, ${this.outputFormat})`);

          // Send audio chunk via WebSocket
          this.sendWebSocketMessage({
@@ -373,6 +538,8 @@ export class VoiceAgent extends EventEmitter {
            text: chunk.text,
            uint8Array: audioData,
          });
+        } else {
+          console.log(`No audio data generated for chunk #${chunk.id}`);
        }

        // Start generating next chunks in parallel
@@ -383,26 +550,40 @@ export class VoiceAgent extends EventEmitter {
            this.speechChunkQueue.length
          );

-          for (let i = 0; i < toStart; i++) {
-            const nextChunk = this.speechChunkQueue.find(c => !c.audioPromise);
-            if (nextChunk) {
-              nextChunk.audioPromise = this.generateChunkAudio(nextChunk);
+          if (toStart > 0) {
+            console.log(`Starting parallel generation for ${toStart} more chunks`);
+            for (let i = 0; i < toStart; i++) {
+              const nextChunk = this.speechChunkQueue.find(c => !c.audioPromise);
+              if (nextChunk) {
+                nextChunk.audioPromise = this.generateChunkAudio(nextChunk);
+              }
            }
          }
        }
      }
+    } catch (error) {
+      console.error("Error in speech queue processing:", error);
+      this.emit("error", error);
    } finally {
      this.isSpeaking = false;
      this.currentSpeechAbortController = undefined;

+      // Signal that the speech queue is fully drained
+      if (this.speechQueueDoneResolve) {
+        this.speechQueueDoneResolve();
+        this.speechQueueDoneResolve = undefined;
+        this.speechQueueDonePromise = undefined;
+      }
+
+      console.log(`Speech queue processing complete`);
      this.sendWebSocketMessage({ type: "speech_stream_end" });
      this.emit("speech_complete", { streaming: true });
    }
  }

  /**
-   * Process text deltra for streaming speech
-   * Call this as text chunks arrive from LLM
+   * Process text delta for streaming speech.
+   * Call this as text chunks arrive from LLM.
   */
  private processTextForStreamingSpeech(textDelta: string): void {
    if (!this.speechModel) return;
@@ -431,7 +612,7 @@ export class VoiceAgent extends EventEmitter {
  /**
   * Process incoming audio data: transcribe and generate response
   */
-  private async processAudioInput(base64Audio: string): Promise<void> {
+  private async processAudioInput(base64Audio: string, format?: string): Promise<void> {
    if (!this.transcriptionModel) {
      this.emit("error", new Error("Transcription model not configured for audio input"));
      return;
@@ -439,20 +620,55 @@ export class VoiceAgent extends EventEmitter {

    try {
      const audioBuffer = Buffer.from(base64Audio, "base64");
-      this.emit("audio_received", { size: audioBuffer.length });
+
+      // Validate audio size to prevent memory issues
+      if (audioBuffer.length > this.maxAudioInputSize) {
+        const sizeMB = (audioBuffer.length / (1024 * 1024)).toFixed(1);
+        const maxMB = (this.maxAudioInputSize / (1024 * 1024)).toFixed(1);
+        this.emit("error", new Error(
+          `Audio input too large (${sizeMB} MB). Maximum allowed: ${maxMB} MB`
+        ));
+        return;
+      }
+
+      if (audioBuffer.length === 0) {
+        this.emit("warning", "Received empty audio data");
+        return;
+      }
+
+      this.emit("audio_received", { size: audioBuffer.length, format });
+      console.log(`Processing audio input: ${audioBuffer.length} bytes, format: ${format || 'unknown'}`);

      const transcribedText = await this.transcribeAudio(audioBuffer);
+      console.log(`Transcribed text: "${transcribedText}"`);

      if (transcribedText.trim()) {
-        await this.processUserInput(transcribedText);
+        await this.enqueueInput(transcribedText);
+      } else {
+        this.emit("warning", "Transcription returned empty text");
+        this.sendWebSocketMessage({
+          type: "transcription_error",
+          error: "Whisper returned empty text"
+        });
      }
    } catch (error) {
      console.error("Failed to process audio input:", error);
      this.emit("error", error);
+      this.sendWebSocketMessage({
+        type: "transcription_error",
+        error: `Transcription failed: ${(error as Error).message || String(error)}`
+      });
    }
  }

  public async connect(url?: string): Promise<void> {
+    this.ensureNotDestroyed();
+
+    // Clean up any existing connection first
+    if (this.socket) {
+      this.disconnectSocket();
+    }
+
    return new Promise((resolve, reject) => {
      try {
        // Use provided URL, configured endpoint, or default URL
@@ -481,6 +697,13 @@ export class VoiceAgent extends EventEmitter {
   * agent to handle messages on that socket.
   */
  public handleSocket(socket: WebSocket): void {
+    this.ensureNotDestroyed();
+
+    // Clean up any existing connection first
+    if (this.socket) {
+      this.disconnectSocket();
+    }
+
    this.socket = socket;
    this.isConnected = true;
    this.setupListeners();
@@ -488,10 +711,15 @@ export class VoiceAgent extends EventEmitter {
  }

  /**
-   * Send text input for processing (bypasses transcription)
+   * Send text input for processing (bypasses transcription).
+   * Requests are queued and processed serially to prevent race conditions.
   */
  public async sendText(text: string): Promise<string> {
-    return this.processUserInput(text);
+    this.ensureNotDestroyed();
+    if (!text || !text.trim()) {
+      throw new Error("Text input cannot be empty");
+    }
+    return this.enqueueInput(text);
  }

  /**
@@ -499,6 +727,7 @@ export class VoiceAgent extends EventEmitter {
   * @param audioData Base64 encoded audio data
   */
  public async sendAudio(audioData: string): Promise<void> {
+    this.ensureNotDestroyed();
    await this.processAudioInput(audioData);
  }

@@ -506,26 +735,65 @@ export class VoiceAgent extends EventEmitter {
   * Send raw audio buffer to be transcribed and processed
   */
  public async sendAudioBuffer(audioBuffer: Buffer | Uint8Array): Promise<void> {
+    this.ensureNotDestroyed();
    const base64Audio = Buffer.from(audioBuffer).toString("base64");
    await this.processAudioInput(base64Audio);
  }

  /**
-   * Process user input with streaming text generation
-   * Handles the full pipeline: text -> LLM (streaming) -> TTS -> WebSocket
+   * Enqueue a text input for serial processing.
+   * This ensures only one processUserInput runs at a time, preventing
+   * race conditions on conversationHistory, fullText accumulation, etc.
+   */
+  private enqueueInput(text: string): Promise<string> {
+    return new Promise<string>((resolve, reject) => {
+      this.inputQueue.push({ text, resolve, reject });
+      this.drainInputQueue();
+    });
+  }
+
+  /**
+   * Drain the input queue, processing one request at a time.
+   */
+  private async drainInputQueue(): Promise<void> {
+    if (this.processingQueue) return;
+    this.processingQueue = true;
+
+    try {
+      while (this.inputQueue.length > 0) {
+        const item = this.inputQueue.shift()!;
+        try {
+          const result = await this.processUserInput(item.text);
+          item.resolve(result);
+        } catch (error) {
+          item.reject(error);
+        }
+      }
+    } finally {
+      this.processingQueue = false;
+    }
+  }
+
+  /**
+   * Process user input with streaming text generation.
+   * Handles the full pipeline: text -> LLM (streaming) -> TTS -> WebSocket.
+   *
+   * This method is designed to be called serially via drainInputQueue().
   */
  private async processUserInput(text: string): Promise<string> {
-    if (this.isProcessing) {
-      this.emit("warning", "Already processing a request, queuing...");
-    }
    this.isProcessing = true;

+    // Create an abort controller for this LLM stream so it can be cancelled
+    this.currentStreamAbortController = new AbortController();
+    const streamAbortSignal = this.currentStreamAbortController.signal;
+
    try {
      // Emit text event for incoming user input
      this.emit("text", { role: "user", text });

-      // Add user message to conversation history
+      // Add user message to conversation history and trim if needed
      this.conversationHistory.push({ role: "user", content: text });
+      this.trimHistory();

      // Use streamText for streaming responses with tool support
      const result = streamText({
@@ -534,6 +802,7 @@ export class VoiceAgent extends EventEmitter {
        messages: this.conversationHistory,
        tools: this.tools,
        stopWhen: this.stopWhen,
+        abortSignal: streamAbortSignal,
        onChunk: ({ chunk }) => {
          // Emit streaming chunks for real-time updates
          // Note: onChunk only receives a subset of stream events
@@ -782,18 +1051,19 @@ export class VoiceAgent extends EventEmitter {
        }
      }

-      // Add assistant response to conversation history
+      // Add assistant response to conversation history and trim
      if (fullText) {
        this.conversationHistory.push({ role: "assistant", content: fullText });
+        this.trimHistory();
      }

      // Ensure any remaining speech is flushed (in case text-end wasn't triggered)
      this.flushStreamingSpeech();

-      // Wait for all speech chunks to complete before signaling response complete
-      // This ensures audio playback can finish
-      while (this.speechChunkQueue.length > 0 || this.isSpeaking) {
-        await new Promise(resolve => setTimeout(resolve, 100));
+      // Wait for all speech chunks to complete using promise-based signaling
+      // (replaces the previous busy-wait polling loop)
+      if (this.speechQueueDonePromise) {
+        await this.speechQueueDonePromise;
      }

      // Send the complete response
@@ -808,8 +1078,16 @@ export class VoiceAgent extends EventEmitter {
      });

      return fullText;
+    } catch (error) {
+      // Clean up speech state on error so the agent isn't stuck in a broken state
+      this.pendingTextBuffer = "";
+      if (this.speechChunkQueue.length > 0 || this.isSpeaking) {
+        this.interruptSpeech("stream_error");
+      }
+      throw error;
    } finally {
      this.isProcessing = false;
+      this.currentStreamAbortController = undefined;
    }
  }

@@ -848,11 +1126,33 @@ export class VoiceAgent extends EventEmitter {
  }

  /**
-   * Send a message via WebSocket if connected
+   * Send a message via WebSocket if connected.
+   * Gracefully handles send failures (e.g., socket closing mid-send).
   */
  private sendWebSocketMessage(message: Record<string, unknown>): void {
-    if (this.socket && this.isConnected) {
-      this.socket.send(JSON.stringify(message));
+    if (!this.socket || !this.isConnected) return;
+
+    try {
+      if (this.socket.readyState === WebSocket.OPEN) {
+        // Skip logging huge audio data for better readability
+        if (message.type === "audio_chunk" || message.type === "audio") {
+          const { data, ...rest } = message as any;
+          console.log(`Sending WebSocket message: ${message.type}`,
+            data ? `(${(data.length / 1000).toFixed(1)}KB audio data)` : "",
+            rest
+          );
+        } else {
+          console.log(`Sending WebSocket message: ${message.type}`);
+        }
+
+        this.socket.send(JSON.stringify(message));
+      } else {
+        console.warn(`Cannot send message, socket state: ${this.socket.readyState}`);
+      }
+    } catch (error) {
+      // Socket may have closed between the readyState check and send()
+      console.error("Failed to send WebSocket message:", error);
+      this.emit("error", error);
    }
  }

@@ -895,14 +1195,44 @@ export class VoiceAgent extends EventEmitter {
  }

  /**
-   * Disconnect from WebSocket
+   * Internal helper to close and clean up the current socket.
+   */
+  private disconnectSocket(): void {
+    if (!this.socket) return;
+
+    // Stop all in-flight work tied to this connection
+    this.cleanupOnDisconnect();
+
+    try {
+      this.socket.removeAllListeners();
+      if (this.socket.readyState === WebSocket.OPEN ||
+        this.socket.readyState === WebSocket.CONNECTING) {
+        this.socket.close();
+      }
+    } catch {
+      // Ignore close errors — socket may already be dead
+    }
+    this.socket = undefined;
+    this.isConnected = false;
+  }
+
+  /**
+   * Disconnect from WebSocket and stop all in-flight work.
   */
  disconnect() {
-    if (this.socket) {
-      this.socket.close();
-      this.socket = undefined;
-      this.isConnected = false;
-    }
+    this.disconnectSocket();
+  }
+
+  /**
+   * Permanently destroy the agent, releasing all resources.
+   * After calling this, the agent cannot be reused.
+   */
+  destroy() {
+    this.isDestroyed = true;
+    this.disconnectSocket();
+    this.conversationHistory = [];
+    this.tools = {};
+    this.removeAllListeners();
  }

  /**
@@ -932,4 +1262,11 @@ export class VoiceAgent extends EventEmitter {
  get pendingSpeechChunks(): number {
    return this.speechChunkQueue.length;
  }
+
+  /**
+   * Check if agent has been permanently destroyed
+   */
+  get destroyed(): boolean {
+    return this.isDestroyed;
+  }
 }
--- a/src/index.ts
+++ b/src/index.ts
@@ -1 +1,22 @@
+// Agents
 export { VoiceAgent, type VoiceAgentOptions } from "./VoiceAgent";
+export {
+    VideoAgent,
+    type VideoAgentOptions,
+    type VideoFrame,
+    type AudioData,
+    type VideoAgentConfig,
+    type FrameContext,
+    type FrameTriggerReason,
+} from "./VideoAgent";
+
+// Shared types
+export {
+    type SpeechChunk,
+    type StreamingSpeechConfig,
+    type HistoryConfig,
+    type StopWhenCondition,
+    DEFAULT_STREAMING_SPEECH_CONFIG,
+    DEFAULT_HISTORY_CONFIG,
+    DEFAULT_MAX_AUDIO_SIZE,
+} from "./types";
--- a/src/types.ts
+++ b/src/types.ts
@@ -0,0 +1,60 @@
+import type { streamText } from "ai";
+
+/**
+ * Represents a chunk of text to be converted to speech
+ */
+export interface SpeechChunk {
+    id: number;
+    text: string;
+    audioPromise?: Promise<Uint8Array | null>;
+}
+
+/**
+ * Configuration for streaming speech behavior
+ */
+export interface StreamingSpeechConfig {
+    /** Minimum characters before generating speech for a chunk */
+    minChunkSize: number;
+    /** Maximum characters per chunk (will split at sentence boundary before this) */
+    maxChunkSize: number;
+    /** Whether to enable parallel TTS generation */
+    parallelGeneration: boolean;
+    /** Maximum number of parallel TTS requests */
+    maxParallelRequests: number;
+}
+
+/**
+ * Configuration for conversation history memory management
+ */
+export interface HistoryConfig {
+    /** Maximum number of messages to keep in history. When exceeded, oldest messages are trimmed. Set to 0 for unlimited. */
+    maxMessages: number;
+    /** Maximum total character count across all messages. When exceeded, oldest messages are trimmed. Set to 0 for unlimited. */
+    maxTotalChars: number;
+}
+
+/**
+ * Default streaming speech configuration
+ */
+export const DEFAULT_STREAMING_SPEECH_CONFIG: StreamingSpeechConfig = {
+    minChunkSize: 50,
+    maxChunkSize: 200,
+    parallelGeneration: true,
+    maxParallelRequests: 3,
+};
+
+/**
+ * Default history configuration
+ */
+export const DEFAULT_HISTORY_CONFIG: HistoryConfig = {
+    maxMessages: 100,
+    maxTotalChars: 0, // unlimited by default
+};
+
+/** Default maximum audio input size (10 MB) */
+export const DEFAULT_MAX_AUDIO_SIZE = 10 * 1024 * 1024;
+
+/**
+ * Default stop condition type from streamText
+ */
+export type StopWhenCondition = NonNullable<Parameters<typeof streamText>[0]["stopWhen"]>;
--- a/src/utils/StreamBuffer.ts
+++ b/src/utils/StreamBuffer.ts
Author	SHA1	Message	Date
Bijit Mondal	bbe354b70b	0.2.1-beta.0	2026-02-19 16:06:17 +05:30
Bijit Mondal	6ab04788e1	0.2.0	2026-02-19 16:03:56 +05:30
Bijit Mondal	ac505c4ed9	Refactor VoiceAgent: Extract types and default configurations into separate types.ts file; remove unused StreamBuffer file	2026-02-19 16:01:25 +05:30
Bijit Mondal	ce10d521f3	feat: add dist directory with compiled files and type definitions - Created dist/index.js and dist/index.d.ts for main entry points. - Added source maps for index.js and index.d.ts. - Introduced dist/utils/StreamBuffer.js and StreamBuffer.d.ts with source maps. - Updated package.json to point main and types to dist files. - Included additional files in package.json for distribution. - Added peerDependencies and updated devDependencies.	2026-02-14 14:39:23 +05:30
Bijit Mondal	637d57fb41	feat: add serve-client and start-test-environment scripts, enhance voice-client with debugging info	2026-02-14 14:20:07 +05:30
Bijit Mondal	8e8dd9d9f6	WIP	2026-02-14 12:26:47 +05:30
Bijit Mondal	7725f66e39	feat: enhance README with VoiceAgent usage example and configuration options	2026-02-13 17:36:18 +05:30
				`@@ -0,0 +1 @@`
				{"version":3,"file":"VoiceAgent.d.ts","sourceRoot":"","sources":["../src/VoiceAgent.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,SAAS,EAAE,MAAM,IAAI,CAAC;AAC/B,OAAO,EAAE,YAAY,EAAE,MAAM,QAAQ,CAAC;AACtC,OAAO,EACL,UAAU,EACV,aAAa,EAEb,KAAK,IAAI,EACT,KAAK,YAAY,EAGjB,KAAK,kBAAkB,EACvB,KAAK,WAAW,EACjB,MAAM,IAAI,CAAC;AACZ,OAAO,EAEL,KAAK,qBAAqB,EAC1B,KAAK,aAAa,EAInB,MAAM,SAAS,CAAC;AAEjB,MAAM,WAAW,iBAAiB;IAChC,KAAK,EAAE,aAAa,CAAC;IACrB,kBAAkB,CAAC,EAAE,kBAAkB,CAAC;IACxC,WAAW,CAAC,EAAE,WAAW,CAAC;IAC1B,YAAY,CAAC,EAAE,MAAM,CAAC;IACtB,QAAQ,CAAC,EAAE,WAAW,CAAC,UAAU,CAAC,OAAO,UAAU,CAAC,CAAC,CAAC,CAAC,CAAC,UAAU,CAAC,CAAC,CAAC;IACrE,KAAK,CAAC,EAAE,MAAM,CAAC,MAAM,EAAE,IAAI,CAAC,CAAC;IAC7B,QAAQ,CAAC,EAAE,MAAM,CAAC;IAClB,KAAK,CAAC,EAAE,MAAM,CAAC;IACf,kBAAkB,CAAC,EAAE,MAAM,CAAC;IAC5B,YAAY,CAAC,EAAE,MAAM,CAAC;IACtB,oDAAoD;IACpD,eAAe,CAAC,EAAE,OAAO,CAAC,qBAAqB,CAAC,CAAC;IACjD,2DAA2D;IAC3D,OAAO,CAAC,EAAE,OAAO,CAAC,aAAa,CAAC,CAAC;IACjC,yDAAyD;IACzD,iBAAiB,CAAC,EAAE,MAAM,CAAC;CAC5B;AAED,qBAAa,UAAW,SAAQ,YAAY;IAC1C,OAAO,CAAC,MAAM,CAAC,CAAY;IAC3B,OAAO,CAAC,KAAK,CAA4B;IACzC,OAAO,CAAC,KAAK,CAAgB;IAC7B,OAAO,CAAC,kBAAkB,CAAC,CAAqB;IAChD,OAAO,CAAC,WAAW,CAAC,CAAc;IAClC,OAAO,CAAC,YAAY,CAAS;IAC7B,OAAO,CAAC,QAAQ,CAA4D;IAC5E,OAAO,CAAC,QAAQ,CAAC,CAAS;IAC1B,OAAO,CAAC,WAAW,CAAS;IAC5B,OAAO,CAAC,mBAAmB,CAAsB;IACjD,OAAO,CAAC,KAAK,CAAS;IACtB,OAAO,CAAC,kBAAkB,CAAC,CAAS;IACpC,OAAO,CAAC,YAAY,CAAS;IAC7B,OAAO,CAAC,YAAY,CAAS;IAC7B,OAAO,CAAC,WAAW,CAAS;IAG5B,OAAO,CAAC,UAAU,CAA2F;IAC7G,OAAO,CAAC,eAAe,CAAS;IAGhC,OAAO,CAAC,4BAA4B,CAAC,CAAkB;IAGvD,OAAO,CAAC,aAAa,CAAgB;IACrC,OAAO,CAAC,iBAAiB,CAAS;IAGlC,OAAO,CAAC,qBAAqB,CAAwB;IACrD,OAAO,CAAC,4BAA4B,CAAC,CAAkB;IACvD,OAAO,CAAC,gBAAgB,CAAqB;IAC7C,OAAO,CAAC,WAAW,CAAK;IACxB,OAAO,CAAC,UAAU,CAAS;IAC3B,OAAO,CAAC,iBAAiB,CAAM;IAG/B,OAAO,CAAC,sBAAsB,CAAC,CAAgB;IAC/C,OAAO,CAAC,sBAAsB,CAAC,CAAa;gBAEhC,OAAO,EAAE,iBAAiB;IA8BtC;;OAEG;IACH,OAAO,CAAC,kBAAkB;IAM1B,OAAO,CAAC,cAAc;IAuDtB;;OAEG;IACH,OAAO,CAAC,mBAAmB;IA8BpB,aAAa,CAAC,KAAK,EAAE,MAAM,CAAC,MAAM,EAAE,IAAI,CAAC;IAIhD;;OAEG;IACU,eAAe,CAAC,SAAS,EAAE,MAAM,GAAG,UAAU,GAAG,OAAO,CAAC,MAAM,CAAC;IAuC7E;;;OAGG;IACU,sBAAsB,CACjC,IAAI,EAAE,MAAM,EACZ,WAAW,CAAC,EAAE,WAAW,GACxB,OAAO,CAAC,UAAU,CAAC;IAiBtB;;;OAGG;IACI,eAAe,CAAC,MAAM,GAAE,MAAsB,GAAG,IAAI;IAgC5D;;;OAGG;IACI,wBAAwB,CAAC,MAAM,GAAE,MAAsB,GAAG,IAAI;IAUrE;;;OAGG;IACH,OAAO,CAAC,gBAAgB;IA8CxB;;;OAGG;IACH,OAAO,CAAC,WAAW;IAmCnB;;OAEG;IACH,OAAO,CAAC,gBAAgB;IAsCxB;;OAEG;YACW,kBAAkB;IAwBhC;;OAEG;YACW,kBAAkB;IA+FhC;;;OAGG;IACH,OAAO,CAAC,6BAA6B;IAarC;;;OAGG;IACH,OAAO,CAAC,oBAAoB;IAO5B;;OAEG;YACW,iBAAiB;IAiDlB,OAAO,CAAC,GAAG,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,IAAI,CAAC;IA8BjD;;;;OAIG;IACI,YAAY,CAAC,MAAM,EAAE,SAAS,GAAG,IAAI;IAc5C;;;OAGG;IACU,QAAQ,CAAC,IAAI,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC;IAQpD;;;OAGG;IACU,SAAS,CAAC,SAAS,EAAE,MAAM,GAAG,OAAO,CAAC,IAAI,CAAC;IAKxD;;OAEG;IACU,eAAe,CAAC,WAAW,EAAE,MAAM,GAAG,UAAU,GAAG,OAAO,CAAC,IAAI,CAAC;IAM7E;;;;OAIG;IACH,OAAO,CAAC,YAAY;IAOpB;;OAEG;YACW,eAAe;IAmB7B;;;;;OAKG;YACW,gBAAgB;IAuT9B;;;OAGG;IACU,yBAAyB,CAAC,IAAI,EAAE,MAAM,GAAG,OAAO,CAAC,IAAI,CAAC;IA8BnE;;;OAGG;IACH,OAAO,CAAC,oBAAoB;IA2B5B;;OAEG;IACH,cAAc;IAKd;;OAEG;IACH,aAAa;IAKb;;OAEG;IACH,YAAY;IAKZ;;OAEG;IACH,UAAU,IAAI,YAAY,EAAE;IAI5B;;OAEG;IACH,UAAU,CAAC,OAAO,EAAE,YAAY,EAAE;IAIlC;;OAEG;IACH,OAAO,CAAC,gBAAgB;IAmBxB;;OAEG;IACH,UAAU;IAIV;;;OAGG;IACH,OAAO;IAQP;;OAEG;IACH,IAAI,SAAS,IAAI,OAAO,CAEvB;IAED;;OAEG;IACH,IAAI,UAAU,IAAI,OAAO,CAExB;IAED;;OAEG;IACH,IAAI,QAAQ,IAAI,OAAO,CAEtB;IAED;;OAEG;IACH,IAAI,mBAAmB,IAAI,MAAM,CAEhC;IAED;;OAEG;IACH,IAAI,SAAS,IAAI,OAAO,CAEvB;CACF"}
				`@@ -0,0 +1 @@`
				`{"version":3,"file":"index.d.ts","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AACA,OAAO,EAAE,UAAU,EAAE,KAAK,iBAAiB,EAAE,MAAM,cAAc,CAAC;AAClE,OAAO,EACH,UAAU,EACV,KAAK,iBAAiB,EACtB,KAAK,UAAU,EACf,KAAK,SAAS,EACd,KAAK,gBAAgB,EACrB,KAAK,YAAY,EACjB,KAAK,kBAAkB,GAC1B,MAAM,cAAc,CAAC;AAGtB,OAAO,EACH,KAAK,WAAW,EAChB,KAAK,qBAAqB,EAC1B,KAAK,aAAa,EAClB,KAAK,iBAAiB,EACtB,+BAA+B,EAC/B,sBAAsB,EACtB,sBAAsB,GACzB,MAAM,SAAS,CAAC"}`
				`@@ -0,0 +1 @@`
				`{"version":3,"file":"index.js","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":";;;AAAA,SAAS;AACT,2CAAkE;AAAzD,wGAAA,UAAU,OAAA;AACnB,2CAQsB;AAPlB,wGAAA,UAAU,OAAA;AASd,eAAe;AACf,iCAQiB;AAHb,wHAAA,+BAA+B,OAAA;AAC/B,+GAAA,sBAAsB,OAAA;AACtB,+GAAA,sBAAsB,OAAA"}`
				`@@ -0,0 +1 @@`
				{"version":3,"file":"types.d.ts","sourceRoot":"","sources":["../src/types.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EAAE,UAAU,EAAE,MAAM,IAAI,CAAC;AAErC;;GAEG;AACH,MAAM,WAAW,WAAW;IACxB,EAAE,EAAE,MAAM,CAAC;IACX,IAAI,EAAE,MAAM,CAAC;IACb,YAAY,CAAC,EAAE,OAAO,CAAC,UAAU,GAAG,IAAI,CAAC,CAAC;CAC7C;AAED;;GAEG;AACH,MAAM,WAAW,qBAAqB;IAClC,8DAA8D;IAC9D,YAAY,EAAE,MAAM,CAAC;IACrB,iFAAiF;IACjF,YAAY,EAAE,MAAM,CAAC;IACrB,gDAAgD;IAChD,kBAAkB,EAAE,OAAO,CAAC;IAC5B,8CAA8C;IAC9C,mBAAmB,EAAE,MAAM,CAAC;CAC/B;AAED;;GAEG;AACH,MAAM,WAAW,aAAa;IAC1B,yHAAyH;IACzH,WAAW,EAAE,MAAM,CAAC;IACpB,6HAA6H;IAC7H,aAAa,EAAE,MAAM,CAAC;CACzB;AAED;;GAEG;AACH,eAAO,MAAM,+BAA+B,EAAE,qBAK7C,CAAC;AAEF;;GAEG;AACH,eAAO,MAAM,sBAAsB,EAAE,aAGpC,CAAC;AAEF,+CAA+C;AAC/C,eAAO,MAAM,sBAAsB,QAAmB,CAAC;AAEvD;;GAEG;AACH,MAAM,MAAM,iBAAiB,GAAG,WAAW,CAAC,UAAU,CAAC,OAAO,UAAU,CAAC,CAAC,CAAC,CAAC,CAAC,UAAU,CAAC,CAAC,CAAC"}
				`@@ -0,0 +1 @@`
				`{"version":3,"file":"types.js","sourceRoot":"","sources":["../src/types.ts"],"names":[],"mappings":";;;AAmCA;;GAEG;AACU,QAAA,+BAA+B,GAA0B;IAClE,YAAY,EAAE,EAAE;IAChB,YAAY,EAAE,GAAG;IACjB,kBAAkB,EAAE,IAAI;IACxB,mBAAmB,EAAE,CAAC;CACzB,CAAC;AAEF;;GAEG;AACU,QAAA,sBAAsB,GAAkB;IACjD,WAAW,EAAE,GAAG;IAChB,aAAa,EAAE,CAAC,EAAE,uBAAuB;CAC5C,CAAC;AAEF,+CAA+C;AAClC,QAAA,sBAAsB,GAAG,EAAE,GAAG,IAAI,GAAG,IAAI,CAAC"}`
				`@@ -0,0 +1 @@`
				`{"version":3,"file":"StreamBuffer.d.ts","sourceRoot":"","sources":["../../src/utils/StreamBuffer.ts"],"names":[],"mappings":""}`
				`@@ -0,0 +1 @@`
				`{"version":3,"file":"StreamBuffer.js","sourceRoot":"","sources":["../../src/utils/StreamBuffer.ts"],"names":[],"mappings":""}`