Skip to main content

API Guides & Tutorials

VOICE AGENT

Real-time voice AI agent powered by LiveKit (WebRTC) — Sarvam STT, Kimi K2.5 LLM, Sarvam TTS.

Overview

The Voice Agent is a real-time conversational AI powered by LiveKit (open-source WebRTC). It provides an end-to-end streaming pipeline:

bash
Mic (WebRTC) → Sarvam STT → Kimi K2.5 LLM (414 tok/s) → Sarvam TTS → Speaker (WebRTC)

All stages stream concurrently — the LLM pushes sentence chunks to TTS while still generating, and audio plays back while TTS is still synthesizing. This minimizes time-to-first-audio.

Default stack:

  • STT: Sarvam saaras:v3 (streaming WebSocket, 23 languages)
  • LLM: Kimi K2.5 via Clarifai (414 tok/s, 256K context)
  • TTS: Sarvam bulbul:v3 (streaming WebSocket, 39 voices)
  • Transport: LiveKit (WebRTC, self-hosted)

Architecture

bash
Browser (livekit-client SDK)
  ↓ WebRTC audio
LiveKit Server (self-hosted, port 7880)
  ↓ dispatches to
LiveKit Agent Process (backend/agent.py)
  → Sarvam STT (streaming transcription)
  → Kimi K2.5 via Clarifai (streaming LLM, 414 tok/s)
  → Sarvam TTS (streaming synthesis)
  ↓ WebRTC audio back to browser

The backend's role is session management — it creates sessions, generates LiveKit tokens, and tracks usage. The actual voice processing happens in the LiveKit agent process.

Quickstart

1. Create a session:

bash
curl -X POST https://api.callmissed.com/v1/voice/sessions \
  -H "Authorization: Bearer cm_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "system_prompt": "You are a helpful assistant.",
    "voice": "shubh",
    "language": "en-IN",
    "llm_model": "kimi-k2.5"
  }'

Response:

json
{
  "id": "uuid",
  "ws_url": "ws://livekit-server:7880",
  "token": "eyJhbGciOi...",
  "status": "created"
}

2. Connect via LiveKit client:

javascript
import { Room, RoomEvent, Track } from "livekit-client";

const room = new Room();

room.on(RoomEvent.TrackSubscribed, (track, pub, participant) => {
  if (track.kind === Track.Kind.Audio) {
    const el = track.attach();
    document.body.appendChild(el);
  }
});

room.on(RoomEvent.TranscriptionReceived, (segments, participant) => {
  for (const seg of segments) {
    if (seg.final) {
      const who = participant?.isLocal ? "You" : "Agent";
      console.log(who + ": " + seg.text);
    }
  }
});

await room.connect(session.ws_url, session.token);
await room.localParticipant.setMicrophoneEnabled(true);

The agent joins automatically, greets the user, and responds to speech.

Configuration

FieldTypeDefaultDescription
system_promptstring"You are a helpful voice assistant..."System prompt for LLM
voicestringshubhSarvam TTS voice ID (39 voices available)
languagestringen-INLanguage code for STT and TTS
llm_modelstringkimi-k2.5LLM model (kimi-k2.5, sarvam-105b, or any OpenRouter model). kimi-k2.5-fast is currently under maintenance.
tts_providerstringsarvamTTS provider (currently only sarvam)
max_duration_secondsint300Max session duration (30-3600)

Features

  • Interruption handling — speak while the agent is talking and it stops immediately, listens to you
  • STT-based turn detection — Sarvam's server-side VAD detects speech start/end with 50ms endpointing
  • Preemptive generation — LLM starts generating before STT fully confirms the transcript
  • Streaming pipeline — each stage streams to the next, no buffering between stages
  • Session management — REST API for creating, listing, deleting sessions and retrieving transcripts
  • Per-model pricing — usage tracked and billed per model (Clarifai $0.52/$2.30 per 1M tokens)

Legacy WebSocket

The direct WebSocket endpoint is still available for backward compatibility:

bash
WS /ws/voice-agent?key=cm_your_api_key

Send a config message after connecting, then stream PCM audio. This uses the custom backend pipeline (not LiveKit). See the Session API for the recommended LiveKit-based approach.

Was this page helpful?