Voice Agent — CallMissed Docs

API Guides & Tutorials

VOICE AGENT

Real-time voice AI agent powered by LiveKit (WebRTC) — Sarvam STT, Kimi K2.5 LLM, Sarvam TTS.

Overview

The Voice Agent is a real-time conversational AI powered by LiveKit (open-source WebRTC). It provides an end-to-end streaming pipeline:

bash

Mic (WebRTC) → Sarvam STT → Kimi K2.5 LLM (414 tok/s) → Sarvam TTS → Speaker (WebRTC)

All stages stream concurrently — the LLM pushes sentence chunks to TTS while still generating, and audio plays back while TTS is still synthesizing. This minimizes time-to-first-audio.

Default stack:

STT: Sarvam saaras:v3 (streaming WebSocket, 23 languages)
LLM: Kimi K2.5 via Clarifai (414 tok/s, 256K context)
TTS: Sarvam bulbul:v3 (streaming WebSocket, 39 voices)
Transport: LiveKit (WebRTC, self-hosted)

Architecture

bash

Browser (livekit-client SDK)
  ↓ WebRTC audio
LiveKit Server (self-hosted, port 7880)
  ↓ dispatches to
LiveKit Agent Process (backend/agent.py)
  → Sarvam STT (streaming transcription)
  → Kimi K2.5 via Clarifai (streaming LLM, 414 tok/s)
  → Sarvam TTS (streaming synthesis)
  ↓ WebRTC audio back to browser

The backend's role is session management — it creates sessions, generates LiveKit tokens, and tracks usage. The actual voice processing happens in the LiveKit agent process.

Quickstart

1. Create a session:

bash

curl -X POST https://api.callmissed.com/v1/voice/sessions \
  -H "Authorization: Bearer cm_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "system_prompt": "You are a helpful assistant.",
    "voice": "shubh",
    "language": "en-IN",
    "llm_model": "kimi-k2.5"
  }'

Response:

json

{
  "id": "uuid",
  "ws_url": "ws://livekit-server:7880",
  "token": "eyJhbGciOi...",
  "status": "created"
}

2. Connect via LiveKit client:

javascript

import { Room, RoomEvent, Track } from "livekit-client";

const room = new Room();

room.on(RoomEvent.TrackSubscribed, (track, pub, participant) => {
  if (track.kind === Track.Kind.Audio) {
    const el = track.attach();
    document.body.appendChild(el);
  }
});

room.on(RoomEvent.TranscriptionReceived, (segments, participant) => {
  for (const seg of segments) {
    if (seg.final) {
      const who = participant?.isLocal ? "You" : "Agent";
      console.log(who + ": " + seg.text);
    }
  }
});

await room.connect(session.ws_url, session.token);
await room.localParticipant.setMicrophoneEnabled(true);

The agent joins automatically, greets the user, and responds to speech.

Configuration

Field	Type	Default	Description
`system_prompt`	string	"You are a helpful voice assistant..."	System prompt for LLM
`voice`	string	`shubh`	Sarvam TTS voice ID (39 voices available)
`language`	string	`en-IN`	Language code for STT and TTS
`llm_model`	string	`kimi-k2.5`	LLM model (`kimi-k2.5`, `sarvam-105b`, or any OpenRouter model). `kimi-k2.5-fast` is currently under maintenance.
`tts_provider`	string	`sarvam`	TTS provider (currently only `sarvam`)
`max_duration_seconds`	int	300	Max session duration (30-3600)

Features

Interruption handling — speak while the agent is talking and it stops immediately, listens to you
STT-based turn detection — Sarvam's server-side VAD detects speech start/end with 50ms endpointing
Preemptive generation — LLM starts generating before STT fully confirms the transcript
Streaming pipeline — each stage streams to the next, no buffering between stages
Session management — REST API for creating, listing, deleting sessions and retrieving transcripts
Per-model pricing — usage tracked and billed per model (Clarifai $0.52/$2.30 per 1M tokens)

Legacy WebSocket

The direct WebSocket endpoint is still available for backward compatibility:

bash

WS /ws/voice-agent?key=cm_your_api_key

Send a config message after connecting, then stream PCM audio. This uses the custom backend pipeline (not LiveKit). See the Session API for the recommended LiveKit-based approach.

Embeddings

Session API

Was this page helpful?