API Guides & Tutorials
Real-time voice AI agent powered by LiveKit (WebRTC) — Sarvam STT, Kimi K2.5 LLM, Sarvam TTS.
Overview
The Voice Agent is a real-time conversational AI powered by LiveKit (open-source WebRTC). It provides an end-to-end streaming pipeline:
Mic (WebRTC) → Sarvam STT → Kimi K2.5 LLM (414 tok/s) → Sarvam TTS → Speaker (WebRTC)All stages stream concurrently — the LLM pushes sentence chunks to TTS while still generating, and audio plays back while TTS is still synthesizing. This minimizes time-to-first-audio.
Default stack:
- STT: Sarvam saaras:v3 (streaming WebSocket, 23 languages)
- LLM: Kimi K2.5 via Clarifai (414 tok/s, 256K context)
- TTS: Sarvam bulbul:v3 (streaming WebSocket, 39 voices)
- Transport: LiveKit (WebRTC, self-hosted)
Architecture
Browser (livekit-client SDK)
↓ WebRTC audio
LiveKit Server (self-hosted, port 7880)
↓ dispatches to
LiveKit Agent Process (backend/agent.py)
→ Sarvam STT (streaming transcription)
→ Kimi K2.5 via Clarifai (streaming LLM, 414 tok/s)
→ Sarvam TTS (streaming synthesis)
↓ WebRTC audio back to browserThe backend's role is session management — it creates sessions, generates LiveKit tokens, and tracks usage. The actual voice processing happens in the LiveKit agent process.
Quickstart
1. Create a session:
curl -X POST https://api.callmissed.com/v1/voice/sessions \
-H "Authorization: Bearer cm_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"system_prompt": "You are a helpful assistant.",
"voice": "shubh",
"language": "en-IN",
"llm_model": "kimi-k2.5"
}'Response:
{
"id": "uuid",
"ws_url": "ws://livekit-server:7880",
"token": "eyJhbGciOi...",
"status": "created"
}2. Connect via LiveKit client:
import { Room, RoomEvent, Track } from "livekit-client";
const room = new Room();
room.on(RoomEvent.TrackSubscribed, (track, pub, participant) => {
if (track.kind === Track.Kind.Audio) {
const el = track.attach();
document.body.appendChild(el);
}
});
room.on(RoomEvent.TranscriptionReceived, (segments, participant) => {
for (const seg of segments) {
if (seg.final) {
const who = participant?.isLocal ? "You" : "Agent";
console.log(who + ": " + seg.text);
}
}
});
await room.connect(session.ws_url, session.token);
await room.localParticipant.setMicrophoneEnabled(true);The agent joins automatically, greets the user, and responds to speech.
Configuration
| Field | Type | Default | Description |
|---|---|---|---|
system_prompt | string | "You are a helpful voice assistant..." | System prompt for LLM |
voice | string | shubh | Sarvam TTS voice ID (39 voices available) |
language | string | en-IN | Language code for STT and TTS |
llm_model | string | kimi-k2.5 | LLM model (kimi-k2.5, sarvam-105b, or any OpenRouter model). kimi-k2.5-fast is currently under maintenance. |
tts_provider | string | sarvam | TTS provider (currently only sarvam) |
max_duration_seconds | int | 300 | Max session duration (30-3600) |
Features
- Interruption handling — speak while the agent is talking and it stops immediately, listens to you
- STT-based turn detection — Sarvam's server-side VAD detects speech start/end with 50ms endpointing
- Preemptive generation — LLM starts generating before STT fully confirms the transcript
- Streaming pipeline — each stage streams to the next, no buffering between stages
- Session management — REST API for creating, listing, deleting sessions and retrieving transcripts
- Per-model pricing — usage tracked and billed per model (Clarifai $0.52/$2.30 per 1M tokens)
Legacy WebSocket
The direct WebSocket endpoint is still available for backward compatibility:
WS /ws/voice-agent?key=cm_your_api_keySend a config message after connecting, then stream PCM audio. This uses the custom backend pipeline (not LiveKit). See the Session API for the recommended LiveKit-based approach.