Skip to main content

Getting Started

API SPEED: BEST PRACTICES

How to make CallMissed API calls feel as fast as the playground — streaming, model choice, connection reuse, and Kimi instant mode.

Why the playground feels faster

A single playground call and a typical API call hit the exact same endpoint at https://api.callmissed.com/v1/chat/completions. When the API feels slower, three compounding factors are usually at work:

SettingPlayground defaultCommon API defaultEffect
Streamingstream: truestream: falseNon-streaming waits for the *whole* generation before any byte returns
Modelauto (fast free model)anthropic/claude-sonnet-4.6Bigger model → 2-3× wall-clock for the same prompt
ConnectionOne persistent HTTP/2 connectionNew TCP+TLS per callAdds ~150-300ms handshake to every request

Same endpoint, very different perceived speed. Below: how to close the gap.

1

Use stream: true

With stream: false your client waits for the full generation. With stream: true first-byte-time is sub-100ms regardless of model size — tokens flow as the model produces them.

python [Python]
from openai import OpenAI

client = OpenAI(
    api_key="cm_your_key",
    base_url="https://api.callmissed.com/v1",
)

# stream=True is the default in the Anthropic SDK; explicit here.
stream = client.chat.completions.create(
    model="gpt-oss-120b",
    messages=[{"role": "user", "content": "Explain quicksort in one paragraph."}],
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)
ts [TypeScript]
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "cm_your_key",
  baseURL: "https://api.callmissed.com/v1",
});

const stream = await client.chat.completions.create({
  model: "gpt-oss-120b",
  messages: [{ role: "user", content: "Explain quicksort in one paragraph." }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}
2

Pick the right model

Same prompt, three different routes — measured first-token and total times:

ModelRouteTTFBTotal
gpt-oss-120bCloudflare Workers AI~50ms~1.5s
autoOpenRouter free~90ms~2.0s
anthropic/claude-sonnet-4.6OpenRouter~70ms~4.8s
anthropic/claude-haiku-4.5OpenRouter~80ms~2.0s

For latency-sensitive integrations (autocomplete, agent tool loops), prefer `gpt-oss-120b` or `kimi-k2.6` — both Cloudflare-hosted, both OpenAI-compatible, both sub-2s on small prompts.

3

Reasoning_effort matrix (per model — verified empirically)

Reasoning models can spend 100+ tokens "thinking" before producing visible content. On short answers, that's both a wall-clock and a credit cost the user never sees. Use reasoning_effort to dial it down — or off, where supported.

json
{
  "model": "kimi-k2.6",
  "messages": [{ "role": "user", "content": "What is 2+2?" }],
  "reasoning_effort": "none"
}

Behaviour per model — verified live against each upstream on 2026-05-01:

Model`"none"``"low"``"medium"``"high"``"minimal"`
kimi-k2.5✅ off"none"
kimi-k2.6✅ off"none"
gemma-4-26b-a4b-it✅ off"none"
gpt-oss-120b"low""low"
nemotron-3-super"low""low"
glm-4.7-flash"low""low"
sarvam-30b / sarvam-105b"low""low"
mistral-small-3.1(no reasoning surface — silently dropped)
OpenRouter (openai/*, anthropic/*, …)forwarded as-is — underlying model gates valid values

Legend: ✅ = sent to upstream verbatim · ↓ = the API maps it to the next-supported value before forwarding.

Concrete numbers on kimi-k2.6 answering "What is 2+2?":

ModeAnswerCompletion tokens
Default (thinking on)2 + 2 = 4101
reasoning_effort: "none"2 + 2 = 49

Same answer, 11× fewer tokens and a proportionally shorter wall-clock. Pass per-request based on whether you need the trace.

4

Reuse the HTTP connection

CallMissed serves HTTP/2. A persistent connection multiplexes many requests with no extra TLS handshake per call. Most modern HTTP clients do this *if you reuse the same client instance*:

python [Python]
# Bad — new connection per call (~150-300ms TLS overhead each time)
def ask(prompt):
    return OpenAI(api_key=K, base_url=URL).chat.completions.create(...)

# Good — reuse one client (and its underlying connection pool)
client = OpenAI(api_key=K, base_url=URL)
def ask(prompt):
    return client.chat.completions.create(...)
ts [TypeScript]
// Bad
async function ask(prompt: string) {
  const c = new OpenAI({ apiKey: K, baseURL: URL });
  return c.chat.completions.create(...);
}

// Good — module-scoped client
const client = new OpenAI({ apiKey: K, baseURL: URL });
async function ask(prompt: string) {
  return client.chat.completions.create(...);
}

Three back-to-back streaming calls on a reused HTTP/2 connection clock in around 400-460ms each for a small prompt; the first call without reuse pays an extra TLS handshake on top.

Checklist

Before reporting "the API feels slow," verify:

  • [ ] stream: true is set on every chat completion request
  • [ ] Model is one of gpt-oss-120b, kimi-k2.6, anthropic/claude-haiku-4.5, or auto (avoid Sonnet/Opus for latency-bound paths)
  • [ ] For Kimi: reasoning_effort: "none" is passed when you don't need the reasoning trace
  • [ ] One OpenAI() / Anthropic() client instance is shared across requests
  • [ ] HTTP client supports HTTP/2 (the official OpenAI / Anthropic SDKs do)

If all four are checked and you still see latency that doesn't match the playground, share the request and the upstream model — there are very few request shapes that beat playground default for the same model.

Was this page helpful?