Skip to main content

API Guides & Tutorials

Speech to Text

Transcribe audio to text with our Indic saaras model and 22 Indic language support.

Overview

The Speech to Text API transcribes audio files into text. Uses our saaras:v3 model with support for 22 Indian languages + English. Supports auto language detection.

Endpoint: POST /v1/audio/transcriptions

How transcription works

app

Your app

Upload an audio file (WAV/MP3) to POST /v1/audio/transcriptions

gateway

CallMissed gateway

Validate the key, detect language (or use language), apply mode

stt

saaras:v3

Run speech recognition across 22 Indic languages + English

done

Your app

Receive text (plus word timestamps in verbose_json)

tip
Leave language unset and saaras:v3 auto-detects it. Set mode=translate to get English text out of any supported language in a single call.

Basic Usage

from openai import OpenAI

client = OpenAI(
    api_key="cm_your_key",
    base_url="https://api.callmissed.com/v1"
)

with open("audio.wav", "rb") as f:
    response = client.audio.transcriptions.create(
        model="saaras:v3",
        file=f
    )

print(response.text)

Parameters

ParameterTypeDescription
modelstringsaaras:v3
filefileAudio file (WAV, MP3, etc.)
languagestringLanguage code (auto-detected if omitted)
modestringOutput mode — see below
response_formatstringjson, text, or verbose_json
timestamp_granularities[]array["word"] for word-level timestamps (OpenAI-compatible)

Output Modes

ModeDescription
transcribeStandard transcription (default)
translateTranscribe and translate to English
verbatimExact transcription including filler words
translitTransliteration to Latin script
codemixCode-mixed output (Indic + English)

Deepgram feature parameters

When you select a Deepgram model (deepgram-nova-3, deepgram-nova-2, deepgram-flux-general-en, etc.), these extra form fields are accepted. They are ignored for non-Deepgram models. Model-restricted features are dropped automatically when the chosen model doesn't support them.

ParameterTypeDescription
diarizebooleanLabel each speaker ([Speaker 0], [Speaker 1], …)
utterancesbooleanSegment the transcript into utterances
utt_splitnumberSilence gap (seconds) used to split utterances
paragraphsbooleanSplit the transcript into paragraphs
numeralsbooleanWrite numbers as digits (e.g. "five" → "5")
measurementsbooleanAbbreviate measurement units (English)
dictationbooleanConvert spoken "comma"/"period" to punctuation (English)
profanity_filterbooleanMask recognized profanity with ****
filler_wordsbooleanKeep "uh"/"um" (Nova / Nova-2 / Nova-3)
multichannelbooleanTranscribe each audio channel independently
detect_entitiesbooleanTag entities like names and locations (English)
detect_languagestringtrue to auto-detect, or repeat with codes to restrict
redactstringpci, pii, phi, numbers, or a specific entity type (repeatable)
keytermstringBoost recognition of a term/phrase (Nova-3 + Flux; repeatable)
keywordsstringkeyword:intensifier boost/suppress (Nova-2 / legacy; repeatable)
searchstringPhonetically search the audio for a term (repeatable)
replacestringfind:replacement substitution (repeatable)

Dialects & locales

Deepgram models accept locale-specific language codes so you can pin a dialect for best accuracy. Pass the code in the language field. Each model's exact dialect list is published in the dialects array on GET /v1/models. Examples:

  • English: en-US, en-GB, en-IN, en-AU, en-NZ, en-CA, en-IE
  • Spanish: es, es-419 (Latin America)
  • Portuguese: pt-BR, pt-PT
  • Chinese: zh-CN, zh-TW, zh-HK (Cantonese)
  • Multilingual code-switching: multi (Nova-3, Nova-2, Flux multilingual)
Was this page helpful?