mirror of
https://github.com/Bijit-Mondal/VoiceAgent.git
synced 2026-03-02 10:36:37 +00:00
7725f66e3942acfa9859d342bed4fcbf6dc0572c
voice-agent-ai-sdk
Streaming voice/text agent SDK built on AI SDK with optional WebSocket transport.
Current status
- Streaming text generation is implemented via
streamText. - Tool calling is supported in-stream.
- Speech synthesis is implemented with chunked streaming TTS.
- Audio transcription is supported (when
transcriptionModelis configured). - WebSocket protocol events are emitted for stream, tool, and speech lifecycle.
Prerequisites
- Node.js 20+
- pnpm
- OpenAI API key
Setup
-
Install dependencies:
pnpm install
-
Configure environment variables in
.env:OPENAI_API_KEY=your_openai_api_key VOICE_WS_ENDPOINT=ws://localhost:8080
VOICE_WS_ENDPOINT is optional for text-only usage.
VoiceAgent usage (as in the demo)
Minimal end-to-end example using AI SDK tools, streaming text, and streaming TTS:
import "dotenv/config";
import { VoiceAgent } from "./src";
import { tool } from "ai";
import { z } from "zod";
import { openai } from "@ai-sdk/openai";
const weatherTool = tool({
description: "Get the weather in a location",
inputSchema: z.object({ location: z.string() }),
execute: async ({ location }) => ({ location, temperature: 72, conditions: "sunny" }),
});
const agent = new VoiceAgent({
model: openai("gpt-4o"),
transcriptionModel: openai.transcription("whisper-1"),
speechModel: openai.speech("gpt-4o-mini-tts"),
instructions: "You are a helpful voice assistant.",
voice: "alloy",
speechInstructions: "Speak in a friendly, natural conversational tone.",
outputFormat: "mp3",
streamingSpeech: {
minChunkSize: 40,
maxChunkSize: 180,
parallelGeneration: true,
maxParallelRequests: 2,
},
endpoint: process.env.VOICE_WS_ENDPOINT,
tools: { getWeather: weatherTool },
});
agent.on("text", ({ role, text }) => {
const prefix = role === "user" ? "👤" : "🤖";
console.log(prefix, text);
});
agent.on("chunk:text_delta", ({ text }) => process.stdout.write(text));
agent.on("speech_start", ({ streaming }) => console.log("speech_start", streaming));
agent.on("audio_chunk", ({ chunkId, format, uint8Array }) => {
console.log("audio_chunk", chunkId, format, uint8Array.length);
});
await agent.sendText("What's the weather in San Francisco?");
if (process.env.VOICE_WS_ENDPOINT) {
await agent.connect(process.env.VOICE_WS_ENDPOINT);
}
Configuration options
The agent accepts:
model(required): chat modeltranscriptionModel(optional): STT modelspeechModel(optional): TTS modelinstructions(optional): system promptstopWhen(optional): stopping conditiontools(optional): AI SDK tools mapendpoint(optional): WebSocket endpointvoice(optional): TTS voice, defaultalloyspeechInstructions(optional): style instructions for TTSoutputFormat(optional): audio format, defaultmp3streamingSpeech(optional):minChunkSizemaxChunkSizeparallelGenerationmaxParallelRequests
Common methods
sendText(text)– process text input (streamed response)sendAudio(base64Audio)– process base64 audio inputsendAudioBuffer(buffer)– process raw audio buffer inputtranscribeAudio(buffer)– transcribe audio directlygenerateAndSendSpeechFull(text)– non-streaming TTS fallbackinterruptSpeech(reason)– interrupt streaming speech (barge‑in)connect(url?)/handleSocket(ws)– WebSocket usage
Key events (from demo)
text– user/assistant messageschunk:text_delta– streaming text deltaschunk:tool_call/tool_result– tool lifecyclespeech_start/speech_complete/speech_interruptedspeech_chunk_queued/audio_chunk/audioconnected/disconnected
Run (text-only check)
This validates LLM + tool + streaming speech without requiring WebSocket:
pnpm demo
Expected logs include text, chunk:text_delta, tool events, and speech chunk events.
Run (WebSocket check)
-
Start local WS server:
pnpm ws:server
-
In another terminal, run demo:
pnpm demo
The demo will:
- run
sendText()first (text-only sanity check), then - connect to
VOICE_WS_ENDPOINTif provided, - emit streaming protocol messages (
text_delta,tool_call,audio_chunk,response_complete, etc.).
Browser voice client (HTML)
A simple browser client is available at example/voice-client.html.
What it does:
- captures microphone speech using Web Speech API (speech-to-text)
- sends transcript to the agent via WebSocket (
type: "transcript") - receives streaming
audio_chunkmessages and plays them in order
How to use:
- Start your agent server/WebSocket endpoint.
- Open example/voice-client.html in a browser (Chrome/Edge recommended).
- Connect to
ws://localhost:8080(or your endpoint), then click Start Mic.
Scripts
pnpm build– build TypeScriptpnpm dev– watch TypeScriptpnpm demo– run demo clientpnpm ws:server– run local test WebSocket server
Notes
- If
VOICE_WS_ENDPOINTis empty, WebSocket connect is skipped. - The sample WS server sends a mock
transcriptmessage for end-to-end testing. - Streaming TTS uses chunk queueing and supports interruption (
interrupt).
Languages
TypeScript
69.5%
HTML
30%
JavaScript
0.3%
Shell
0.2%