What's a realistic SIP-to-first-audio latency budget on Indian telephony in 2026?

On Plivo, Exotel or Twilio Mumbai PoPs with PCMA codec, expect 40–60ms SIP ingress, 40ms frame buffering, 150–200ms VAD with semantic endpointing, 120–180ms STT final flush, 180–280ms LLM first-token with prompt caching, 90–130ms TTS first chunk, 40ms egress. That sums to a 480ms p50 first-audible and 720ms p95. Sub-400ms is achievable in best case but the variance kills the median experience.

Does Deepgram Nova-3 beat AssemblyAI Universal-2 on Hinglish code-switching?

On English-heavy Hinglish, Nova-3 wins on latency (90ms first-partial vs 140ms) and is competitive on WER. On Hindi-heavy Hinglish with frequent code-switching mid-utterance, both struggle — Nova-3 misroutes language detection on roughly 1 in 6 utterances. Sarvam Saaras v2 outperforms both on heavy Hinglish at the cost of 20–30ms additional latency, and that is the right choice if Hindi is dominant in your traffic.

How much latency does prompt caching actually save on the LLM stage?

On a 3,800-token system prompt with a 200-token turn delta, Anthropic prompt caching drops first-token from ~420ms to ~180ms on warm cache hits — roughly a 60% reduction. OpenAI's implicit caching is less aggressive but still meaningful. Cost savings are typically 70–80% on the cached portion. Every production voice stack in 2026 should be using prompt caching; many still are not. Two HTTP headers and a code review.

Should I use Opus or PCMA for the SIP codec?

PCMA on Indian PSTN traffic. Opus sounds better and gives STT cleaner input, but Indian carriers transcode Opus back to G.711 at the PSTN edge, so you pay Opus's 20–40ms encoding latency for no audio-quality gain. Stay on PCMA unless your traffic is end-to-end WebRTC (in which case Opus is correct). For mixed traffic, route per call origin.

What's the right VAD silence threshold for Indian voice AI?

There is no single right answer — there is a per-language right answer. English: 200–250ms. Hindi: 280–350ms because median inter-phrase pauses are longer. Tamil and Telugu: 300–380ms. Run a semantic endpointer in addition to silence detection — that lets you drop silence threshold to 150ms without false end-of-speech triggers, because the semantic model holds for incomplete utterances regardless of pause length. This is the single biggest latency lever after region routing.

How do speech-to-speech models like GPT Realtime change the architecture?

They collapse STT + LLM + TTS into a single model with first-audible under 250ms — about 200ms faster than the best-tuned pipeline. The architecture simplifies massively. The tradeoffs: less control over voice (TTS is baked in), no transcript intermediate for audit logs, harder to swap components, and DPDP compliance becomes trickier because the boundary between processing stages disappears. We expect the production answer in 2026–27 to be hybrid — speech-to-speech for low-stakes turns, classical pipeline for compliance-sensitive turns.

What does the per-call cost look like on a sub-500ms architecture?

At 100,000 minutes/day scale on the reference architecture (Plivo + Deepgram/Sarvam + tiered LLM with caching + Cartesia/Bulbul), all-in cost lands at roughly ₹4.20–6.80 per call-minute depending on language mix and reasoning depth. English-heavy deployments cluster at the lower end; Hindi-heavy and reasoning-heavy at the higher end. Detailed breakdown is in our [voice AI pricing post for India](/voice-ai-pricing-india). For [telephony integration depth](/integrations/telephony) the SIP layer is a smaller share than most teams expect.

Sub-500ms Latency Voice AI India 2026 Architecture

A platform architect at a Mumbai fintech opened a Loom from his QA lead at 11:42 on a Tuesday night. Three calls, all on the same Plivo trunk, all routed to the same agent stack. On the first call the bot replied 380ms after the caller stopped speaking. Crisp. Human. On the second, 1,140ms. Awkward. On the third, 1,890ms — the caller said "hello?" twice before the bot answered. Same code path, same prompt, same model. He scrubbed the logs. STT first-token at 220ms on call one, 410ms on call two, 690ms on call three. The variance was not in his code. It was in eight stages of a pipeline he had not measured end-to-end.

This is the post we wished existed when the team first started chasing low latency ai on Indian telephony. Not a benchmark, not a vendor scorecard — those exist at our voice AI latency benchmarks post and the foundational low-latency primer. This is the architecture. The millisecond budget, stage by stage. The STT, LLM and TTS choices that survive Patna Hindi at 9pm on a 2G fallback. The endpointing tuning that stops the bot from talking over the caller. And the five mistakes we still see senior teams make in 2026.

This is written for the lead engineer or voice platform architect at an Indian fintech, healthcare network or telco who has read the marketing pages, run a demo, and now has to decide whether the stack can actually do 200,000 calls a day at sub-500ms perceived latency without the CFO asking why GPU spend tripled.

What "sub-500ms latency" actually means

The number gets thrown around like it is one number. It is at least three.

End-to-end round-trip latency. The total time from the last phoneme of the caller's utterance leaving their phone to the first phoneme of the bot's reply arriving back at their phone. This is what the caller experiences as the silence between turns. Sub-500ms here is the goal.

First-audible-response latency. The time from end-of-user-speech to the first audio byte playing out on the caller's handset. This is shorter than end-to-end because TTS streams — the first 80ms of the reply plays while the rest is still being generated. On a well-tuned stack, first-audible can be 280–380ms even when full end-to-end is 500–700ms.

Perceived latency. What the caller actually notices. Driven by first-audible plus the prosody of the opening phoneme. A bot that starts with "Umm" or "So" at 320ms feels faster than one that starts with a crisp "Yes" at 280ms — because human listeners forgive filler. Perceived is the metric that closes deals; first-audible is the metric the architect can control.

Most vendor pitches quote first-audible and call it round-trip. Most buyer expectations are calibrated against end-to-end. Get this distinction wrong in your SLA and you will be arguing about a metric that nobody agrees on for the next six months.

The end-to-end latency budget on Indian telephony

Here is the budget that a well-engineered voice AI stack actually spends, broken down stage by stage, on a real Indian carrier — measured across roughly 2.3 million production minutes on Jio, Airtel and Vi over the last three quarters.

Stage	Best case	Typical	Worst case	What dominates
SIP ingress + jitter buffer	20ms	40ms	90ms	Carrier RTT to PoP, codec negotiation
Audio frame buffering (20ms frames)	20ms	40ms	60ms	Frame alignment for STT
VAD end-of-speech detection	80ms	200ms	450ms	Silence threshold + min_silence config
STT final-transcript flush	60ms	120ms	280ms	Model size, language, code-switching
LLM first-token	120ms	280ms	700ms	Prompt size, KV-cache hit, model
TTS first-audio chunk	60ms	140ms	380ms	Model, voice, language, region
SIP egress + carrier delivery	20ms	40ms	90ms	RTT + codec encoding
End-to-end total	380ms	860ms	2,050ms	—
First-audible (with streaming)	280ms	480ms	920ms	—

The honest read on this table: best case sub-500ms is achievable on Indian telephony in 2026. Typical case is not. The gap between best and typical is almost entirely VAD tuning, LLM choice, and region routing. Three knobs, in that order.

The VAD line is the one most teams underestimate. A 200ms silence threshold sounds aggressive on paper. On a real call with a caller who pauses mid-sentence to think, it triggers a false end-of-speech, the bot interrupts, the caller restarts, latency on the next turn doubles. The number that matters is not the threshold — it is the variance of the threshold across caller demographics.

The SIP layer — where Indian telephony starts the clock

The latency budget begins at the carrier. Before STT runs, before the LLM thinks, before TTS speaks, the audio has already spent 40–90ms in transit.

Provider	Mumbai PoP RTT	Singapore PoP RTT	Default codec	Jitter (p95)
Plivo (India)	18ms	64ms	PCMA	22ms
Exotel	22ms	71ms	PCMA	28ms
Twilio (Mumbai)	26ms	68ms	PCMU/Opus	31ms
Ozonetel	24ms	—	PCMA	26ms
Knowlarity	30ms	—	PCMA	34ms

Three operational truths from running on these:

Codec choice matters more than vendor. PCMA (G.711 A-law) is 8kHz, 64kbps, near-zero encoding latency. Opus is 16–48kHz, 6–510kbps, 2.5–60ms encoding latency depending on frame size. Opus sounds better and gives STT cleaner audio — but on Indian carriers most PSTN handoff goes through G.711 anyway, so Opus gets transcoded back to PCMA at the carrier edge, and you have paid the encoding latency for nothing. Stay on PCMA unless your traffic is WebRTC-originated.

Singapore PoPs add 40–50ms each way — and that compounds at every stage. If your STT, LLM and TTS all run out of Singapore (Deepgram default, OpenAI default until recently, Cartesia Singapore), you have added ~50ms on three round-trips. That is 150ms of pure transit before any model has done any work. Mumbai PoPs for every stage are not optional in 2026 — they are the difference between a 480ms median and an 830ms median.

Jitter buffer tuning is a real lever. Carrier-side jitter at p95 of 28ms means your jitter buffer needs to hold ~60ms of audio to deliver smoothly. Drop it to 40ms and you save 20ms on the budget but accept ~3% audio glitches. Most production stacks run 40–50ms jitter buffer, accept the occasional glitch, and tune VAD around it.

The full provider breakdown is in our telephony partner deep-dive on Plivo, Exotel, Ozonetel, Knowlarity and Twilio.

STT — the choice that drives both latency and downstream cost

STT is where the budget can be saved or blown. Five providers worth considering in India in 2026.

Provider	First-partial	Final-transcript flush	English WER (Indian)	Hindi WER	Hinglish code-switch	Mumbai PoP
Deepgram Nova-3	90ms	180ms	7.4%	13.8%	16.2%	Yes
AssemblyAI Universal-2	140ms	260ms	6.8%	14.6%	17.1%	No (SG)
ElevenLabs Scribe	180ms	320ms	7.1%	12.4%	14.8%	No
Sarvam Saaras v2	110ms	210ms	8.2%	9.6%	11.4%	Yes
AI4Bharat IndicConformer	160ms	280ms	9.4%	8.8%	10.9%	Self-host

A few honest observations from running these in production:

Deepgram Nova-3 is the lowest-latency choice on English-dominant or Hinglish-light calls. The Mumbai PoP makes it ~40ms faster on average than the same model from Singapore. On heavy Hinglish with frequent code-switching — a fintech collections call to a Bengaluru SME owner who slides between English numbers and Hindi sentiment mid-utterance — Nova-3 misroutes about 1 in 6 utterances on language detection and the resulting WER hit cascades into LLM confusion. Sarvam and AI4Bharat win on Hindi and Hinglish; Deepgram wins on English and pure speed.

The pattern that works in production is a router. Detect language on the opening 800ms, route to Sarvam for Hindi-dominant, Deepgram for English-dominant, ElevenLabs Scribe for Tamil/Telugu/Bengali where its multilingual model still leads. The router adds 30–40ms at the start of the call but is amortised across the rest of it. The full multilingual treatment is in our Hindi-Tamil-Telugu-Bengali multilingual voice AI post.

WER numbers above are from clean studio audio. On a real Plivo PCMA stream from a Patna borrower at 8pm on Diwali eve, multiply by 1.6–2.4×. Vendor demos do not survive contact with the buyer's own audio.

LLM — first-token latency is the metric that matters

The LLM stage is where the most engineering time gets spent and where the worst architectural mistakes still happen.

First-token latency, not throughput, is what drives perceived voice latency. A model that generates 200 tokens/second but takes 600ms to start is worse for voice than a model that generates 80 tokens/second but starts in 180ms — because TTS streams from the first token, and the user hears audio as soon as the first phrase is generated.

Model	First-token (warm)	First-token (cold)	Tokens/sec (streaming)	Indian context fit
GPT-4o-mini	240ms	480ms	180	Strong English, weak Indic
Claude 3.5 Haiku	280ms	540ms	140	Strong English + Hinglish
Gemini 2.0 Flash	180ms	320ms	220	Good Indic, fast
Llama 3.3 70B (self-host A100)	140ms	380ms	90	Tune-able, controllable
Sarvam M1	160ms	290ms	130	Best Hindi reasoning

Three engineering moves that reliably cut LLM latency in half:

Prompt caching. Anthropic and OpenAI both expose explicit prompt caching now. The static portion of the prompt — system instructions, tool definitions, knowledge base — stays cached, and only the dynamic turn-by-turn delta gets sent. On a 3,800-token system prompt with a 200-token turn delta, this drops first-token from 420ms to 180ms. The savings compound across turns. Every production voice stack in 2026 should be using this; many still are not.

KV-cache reuse across turns. When you stay on the same model session across a call, the model's key-value cache from the prior turn does not need to be rebuilt. This is invisible at the API surface for hosted models but is a real lever on self-hosted Llama or Mistral deployments. Properly tuned, KV-reuse cuts second-turn-onward first-token to ~100ms.

Right-sizing the model. A 70B model is not always better than an 8B model for voice. Voice prompts are short, decisions are narrow, the model is not writing essays. We run 8B models on routing, classification and confirmation turns, escalate to 70B only on free-text reasoning. The cost saving is real; the latency saving is bigger.

The architectural mistake we still see at senior teams: routing every turn to GPT-4o or Claude Sonnet because "the demo used it." Most voice turns do not need a frontier model. Profile your turns, classify them by required reasoning, and route accordingly.

TTS — where Hindi authenticity meets the latency budget

TTS choice is where voice quality and latency genuinely trade off.

Provider	First-audio chunk	English voice	Hindi authenticity	Streaming	Mumbai PoP
Cartesia Sonic-2	90ms	Excellent	Limited	Yes	Self-host option
ElevenLabs Flash v2.5	75ms	Excellent	Acceptable	Yes	No
ElevenLabs Multilingual v2	280ms	Excellent	Strong	Yes	No
Sarvam Bulbul v2	130ms	Acceptable	Strongest	Yes	Yes
OpenAI TTS-1	320ms	Good	Weak	Limited	No
Google Cloud TTS Chirp	180ms	Good	Acceptable	Yes	Yes

Cartesia Sonic-2 is the fastest TTS on the market and the right default for English-dominant Indian deployments. Its Hindi support is workable but the pronunciation of compound Hindi words and named entities is not at parity with Bulbul. For a collections call to a Hindi-belt borrower where the bot has to say "Janakpuri Extension" or "Lakshmi Nagar" correctly, Bulbul or ElevenLabs Multilingual is the choice — and you accept the latency hit.

The streaming chunk size is the underrated tuning knob. Smaller chunks (40–80ms) get audible faster but produce more network overhead and occasional prosody artifacts. Larger chunks (200–300ms) sound smoother but cost you 100–150ms on first-audible. Production sweet spot we have landed on is 80–120ms initial chunk, 200ms steady-state.

For Indic TTS at depth, see our Indic TTS benchmark covering Bulbul, ElevenLabs Multilingual, Google Cloud TTS and AI4Bharat.

VAD and endpointing — the silent latency killer

Voice Activity Detection and turn endpointing is where most "why is my bot slow?" investigations end up. It is also where the most counter-intuitive tradeoffs sit.

The naive setup: silence threshold 500ms, min_speech_duration 100ms, end-of-turn flush 200ms after silence. Sum that up and you are paying 700ms on every turn before the LLM even sees the transcript. The optimisation: drop silence threshold to 150ms. The cost: the bot now interrupts callers who pause mid-sentence to think. Net latency improvement: zero — because interrupted callers restart, doubling the next turn's effective latency.

What works in production:

Semantic endpointing, not silence endpointing. A small model (often a 1B Llama or a tuned BERT) classifies whether the transcript so far is a "complete utterance" or "likely still speaking." A caller who says "my account number is one nine six" gets recognised as incomplete (numbers usually continue) and the bot waits. A caller who says "I want to close my account" gets recognised as complete and the bot replies immediately. This adds 30–40ms of classifier latency but saves 200–400ms of silence wait.

Per-language VAD tuning. Hindi speech has longer median pauses between phrases than English. A VAD configured for English flags Hindi pauses as end-of-turn ~3× more often. Tune the silence threshold per detected language, not globally.

Backchannel suppression. "Hmm", "haan", "achha" from the caller are not turn-completions. The bot should not respond; it should keep listening. A short-utterance filter (under 400ms with no semantic content) keeps the bot from interrupting on backchannels.

The team that nails endpointing usually beats the team with the faster STT.

What blows the latency budget — five mistakes we still see

Sequential STT, LLM and TTS pipelines. STT runs, completes, then the LLM starts, then TTS starts. Total latency is the sum of three stages. The fix is streaming all three concurrently: STT partials feed the LLM as they arrive, LLM tokens feed TTS as they generate, TTS audio streams to SIP as it synthesises. Done right, total latency becomes max(stages) plus small overheads, not sum. The architectural change pays back 300–500ms on every turn.

Wrong region routing. STT in Singapore, LLM in us-east-1, TTS in Frankfurt, SIP in Mumbai. We have audited stacks where the call audio traversed four continents for a single turn. Every hop is 60–180ms. Get everything to ap-south-1 / Mumbai or accept that you are running an 800ms+ stack.

Over-sized LLM on every turn. Routing turn 1 (greeting), turn 2 (intent capture), turn 3 (number confirmation) all to GPT-4o because the demo did. Turn 1 needs a 100ms canned response. Turn 2 needs a small intent classifier. Only turn 3 onwards needs reasoning. Tier your LLM choice per turn type.

Missing prompt cache. Sending the full system prompt on every turn. With Anthropic prompt caching, the same call with 12 turns sends the 3,800-token system prompt once, not 12 times. First-token latency on turn 2+ drops from ~420ms to ~180ms. Cost drops by 70%. The implementation is two HTTP headers. Many teams have not done it.

No barge-in handling. The caller starts speaking while the bot is mid-sentence. A well-engineered stack detects barge-in within 80ms, stops TTS playback, flushes the audio buffer, and starts STT on the new utterance. A poorly engineered stack lets the bot finish its sentence — 1,500ms of dead time during which the caller's "wait, I have a question" is ignored. Perceived latency goes from acceptable to terrible in one stage.

The reference architecture that hits 480ms median

A stack that we have seen hold 480ms median first-audible and 720ms median end-to-end on Indian telephony at production scale.

Component	Choice	Why
SIP	Plivo Mumbai PoP, PCMA codec, 40ms jitter buffer	Lowest local jitter
Media handling	LiveKit or Pipecat on ap-south-1 EC2 c6i.2xlarge	Mumbai region critical
VAD	Silero VAD with semantic endpointer	150ms silence + classifier
STT	Router → Deepgram Nova-3 (Mumbai) for English/Hinglish, Sarvam Saaras v2 for Hindi	Language-aware
LLM	Tiered: Llama 3.1 8B for routing/confirmation, Claude 3.5 Haiku for reasoning, all with prompt caching	Right-size per turn
TTS	Cartesia Sonic-2 for English, Bulbul v2 for Hindi, 100ms initial chunk	Best speed + Indic
Observability	Per-stage timing, p50/p95/p99, alert on p95 over 600ms	Variance is the enemy

The cost on this stack runs roughly ₹4.20–6.80 per call-minute at 100,000 minutes/day scale — the breakdown is in our voice AI pricing post.

What "good" looks like in production

Metric	Acceptable	Good	Best-in-class
First-audible latency (p50)	600ms	480ms	320ms
First-audible latency (p95)	1,100ms	720ms	540ms
End-to-end latency (p50)	1,000ms	720ms	480ms
Barge-in detection latency	200ms	120ms	80ms
STT WER (Hinglish, real audio)	22%	16%	12%
LLM first-token (p50, cached)	380ms	220ms	140ms
TTS first-audio (p50)	220ms	130ms	80ms

The variance metric — p95 minus p50 — matters more than the median. A stack at 480ms median with 200ms p95 spread feels great. A stack at 380ms median with 800ms p95 spread feels broken on one call in twenty, which is enough to lose the buyer.

Build vs buy — the architecture decision

For an engineering team with 2 senior voice/audio engineers and 6+ months runway, building a sub-500ms stack on LiveKit + Deepgram + Anthropic + Cartesia is achievable. The hard parts are not the components — they are the integration, the semantic endpointer training, the per-language routing, the prompt-cache plumbing, and the observability.

For a team without dedicated voice engineering, a platform like our own AI caller for India ships these tradeoffs pre-tuned. The interesting buyer question is not "build or buy" — it is "which 3 of the 8 components do we want to control, and which 5 are we happy to consume from a platform?"

The teams that end up happiest in 2026 control the prompt, the LLM choice, the TTS voice and the telephony integration — and consume the rest. The teams that try to control everything spend a year on infra and ship a v1 that does not beat the platform they could have started with.

Compliance considerations on the latency stack

Two regulatory points specific to the Indian context.

DPDP 2023 data residency. STT transcripts and LLM inputs are personal data. Running them through Singapore PoPs or us-east-1 endpoints triggers cross-border data transfer rules. Mumbai or ap-south-1 PoPs are not just a latency win — they are the cleaner compliance posture. Confirm with your DPO before defaulting to a foreign region.

TRAI DLT and call recording. Recording happens at SIP egress, not at the application layer. The recording path adds zero latency to the live call but adds storage and retrieval load. Build recording retrieval into the architecture as a first-class concern; the regulator will ask.

The 90-day implementation playbook

Weeks 1–2. Instrument every stage. Log SIP-in, VAD end, STT first-partial, STT final, LLM first-token, LLM done, TTS first-chunk, TTS done, SIP-out. Build the dashboard. You cannot fix what you do not measure. Most stacks discover at this stage that their LLM was 280ms and their VAD was 600ms — and they had been blaming the LLM.

Weeks 3–4. Move to Mumbai PoP for SIP, STT and TTS. Confirm LLM is in ap-south-1 or has equivalent regional endpoints. Measure the drop in p50 and p95.

Weeks 5–6. Implement prompt caching on the LLM. Tier the LLM choice per turn type. Add KV-cache reuse for self-hosted models.

Weeks 7–8. Train or import a semantic endpointer. Tune VAD silence threshold per language. Test barge-in handling under load.

Weeks 9–10. Add the language-aware STT router. Tune TTS chunk size. Profile and remove the worst p95 contributor.

Weeks 11–12. Load test at 2x expected peak. Hold a 24-hour soak test on real Indian carriers. Lock the architecture, document the choices, hand to ops.

By day 90 you have a stack that holds 480ms median, 720ms p95 first-audible on real Indian calls — and an architect who can answer the CFO's question about GPU spend without flinching.

What changes in the next 12 months

Speech-to-speech models hit telephony. GPT Realtime, Gemini Live and the next generation of Sarvam models collapse STT + LLM + TTS into a single model with first-audible latency under 250ms. The architecture simplifies. The compliance posture gets harder because there is no transcript intermediate to audit.

On-device VAD and endpointing. Mobile-side endpointing on the caller's app (where the integration is app-originated, not PSTN) cuts another 80–120ms from the budget.

Indic LLMs catch up. Sarvam M2, AI4Bharat's next generation and IBM Granite-Indic close the gap on Hindi reasoning. The default LLM choice for Hindi-belt deployments shifts from Claude/GPT to Indic-native models with better cultural and linguistic priors.

Regional PoPs from the LLM providers. Anthropic and OpenAI are both signalling ap-south-1 endpoints. The Singapore-vs-Mumbai latency penalty disappears, and the architecture simplifies further.

Bottom line

Sub-500ms latency on voice AI in India is not a vendor pitch — it is an architecture decision. The budget is real, the stages are countable, and the mistakes are predictable. Move everything to Mumbai. Stream STT, LLM and TTS concurrently. Cache the prompt. Tier the LLM. Tune VAD with a semantic endpointer, not just silence. Pick STT and TTS per language. Measure variance, not just median. Do those seven things and you will hold first-audible under 500ms at p50 and end-to-end under 800ms at p95 — on real Indian telephony, with real Indian audio, at production scale.

If you are evaluating low latency ai voice for an Indian fintech, healthcare network or telco and your architecture review has stalled on STT-vs-TTS tradeoffs or region routing, talk to us — we will show you the stage-by-stage timing dashboard from a live deployment, not a demo deck.

What "sub-500ms latency" actually means

The number gets thrown around like it is one number. It is at least three.

The end-to-end latency budget on Indian telephony

Stage	Best case	Typical	Worst case	What dominates
SIP ingress + jitter buffer	20ms	40ms	90ms	Carrier RTT to PoP, codec negotiation
Audio frame buffering (20ms frames)	20ms	40ms	60ms	Frame alignment for STT
VAD end-of-speech detection	80ms	200ms	450ms	Silence threshold + min_silence config
STT final-transcript flush	60ms	120ms	280ms	Model size, language, code-switching
LLM first-token	120ms	280ms	700ms	Prompt size, KV-cache hit, model
TTS first-audio chunk	60ms	140ms	380ms	Model, voice, language, region
SIP egress + carrier delivery	20ms	40ms	90ms	RTT + codec encoding
End-to-end total	380ms	860ms	2,050ms	—
First-audible (with streaming)	280ms	480ms	920ms	—

The SIP layer — where Indian telephony starts the clock

The latency budget begins at the carrier. Before STT runs, before the LLM thinks, before TTS speaks, the audio has already spent 40–90ms in transit.

Provider	Mumbai PoP RTT	Singapore PoP RTT	Default codec	Jitter (p95)
Plivo (India)	18ms	64ms	PCMA	22ms
Exotel	22ms	71ms	PCMA	28ms
Twilio (Mumbai)	26ms	68ms	PCMU/Opus	31ms
Ozonetel	24ms	—	PCMA	26ms
Knowlarity	30ms	—	PCMA	34ms

Three operational truths from running on these:

The full provider breakdown is in our telephony partner deep-dive on Plivo, Exotel, Ozonetel, Knowlarity and Twilio.

STT — the choice that drives both latency and downstream cost

STT is where the budget can be saved or blown. Five providers worth considering in India in 2026.

Provider	First-partial	Final-transcript flush	English WER (Indian)	Hindi WER	Hinglish code-switch	Mumbai PoP
Deepgram Nova-3	90ms	180ms	7.4%	13.8%	16.2%	Yes
AssemblyAI Universal-2	140ms	260ms	6.8%	14.6%	17.1%	No (SG)
ElevenLabs Scribe	180ms	320ms	7.1%	12.4%	14.8%	No
Sarvam Saaras v2	110ms	210ms	8.2%	9.6%	11.4%	Yes
AI4Bharat IndicConformer	160ms	280ms	9.4%	8.8%	10.9%	Self-host

A few honest observations from running these in production:

LLM — first-token latency is the metric that matters

The LLM stage is where the most engineering time gets spent and where the worst architectural mistakes still happen.

Model	First-token (warm)	First-token (cold)	Tokens/sec (streaming)	Indian context fit
GPT-4o-mini	240ms	480ms	180	Strong English, weak Indic
Claude 3.5 Haiku	280ms	540ms	140	Strong English + Hinglish
Gemini 2.0 Flash	180ms	320ms	220	Good Indic, fast
Llama 3.3 70B (self-host A100)	140ms	380ms	90	Tune-able, controllable
Sarvam M1	160ms	290ms	130	Best Hindi reasoning

Three engineering moves that reliably cut LLM latency in half:

TTS — where Hindi authenticity meets the latency budget

TTS choice is where voice quality and latency genuinely trade off.

Provider	First-audio chunk	English voice	Hindi authenticity	Streaming	Mumbai PoP
Cartesia Sonic-2	90ms	Excellent	Limited	Yes	Self-host option
ElevenLabs Flash v2.5	75ms	Excellent	Acceptable	Yes	No
ElevenLabs Multilingual v2	280ms	Excellent	Strong	Yes	No
Sarvam Bulbul v2	130ms	Acceptable	Strongest	Yes	Yes
OpenAI TTS-1	320ms	Good	Weak	Limited	No
Google Cloud TTS Chirp	180ms	Good	Acceptable	Yes	Yes

For Indic TTS at depth, see our Indic TTS benchmark covering Bulbul, ElevenLabs Multilingual, Google Cloud TTS and AI4Bharat.

VAD and endpointing — the silent latency killer

Voice Activity Detection and turn endpointing is where most "why is my bot slow?" investigations end up. It is also where the most counter-intuitive tradeoffs sit.

What works in production:

The team that nails endpointing usually beats the team with the faster STT.

What blows the latency budget — five mistakes we still see

The reference architecture that hits 480ms median

A stack that we have seen hold 480ms median first-audible and 720ms median end-to-end on Indian telephony at production scale.

Component	Choice	Why
SIP	Plivo Mumbai PoP, PCMA codec, 40ms jitter buffer	Lowest local jitter
Media handling	LiveKit or Pipecat on ap-south-1 EC2 c6i.2xlarge	Mumbai region critical
VAD	Silero VAD with semantic endpointer	150ms silence + classifier
STT	Router → Deepgram Nova-3 (Mumbai) for English/Hinglish, Sarvam Saaras v2 for Hindi	Language-aware
LLM	Tiered: Llama 3.1 8B for routing/confirmation, Claude 3.5 Haiku for reasoning, all with prompt caching	Right-size per turn
TTS	Cartesia Sonic-2 for English, Bulbul v2 for Hindi, 100ms initial chunk	Best speed + Indic
Observability	Per-stage timing, p50/p95/p99, alert on p95 over 600ms	Variance is the enemy

The cost on this stack runs roughly ₹4.20–6.80 per call-minute at 100,000 minutes/day scale — the breakdown is in our voice AI pricing post.

What "good" looks like in production

Metric	Acceptable	Good	Best-in-class
First-audible latency (p50)	600ms	480ms	320ms
First-audible latency (p95)	1,100ms	720ms	540ms
End-to-end latency (p50)	1,000ms	720ms	480ms
Barge-in detection latency	200ms	120ms	80ms
STT WER (Hinglish, real audio)	22%	16%	12%
LLM first-token (p50, cached)	380ms	220ms	140ms
TTS first-audio (p50)	220ms	130ms	80ms

Build vs buy — the architecture decision

Compliance considerations on the latency stack

Two regulatory points specific to the Indian context.

The 90-day implementation playbook

Weeks 3–4. Move to Mumbai PoP for SIP, STT and TTS. Confirm LLM is in ap-south-1 or has equivalent regional endpoints. Measure the drop in p50 and p95.

Weeks 5–6. Implement prompt caching on the LLM. Tier the LLM choice per turn type. Add KV-cache reuse for self-hosted models.

Weeks 7–8. Train or import a semantic endpointer. Tune VAD silence threshold per language. Test barge-in handling under load.

Weeks 9–10. Add the language-aware STT router. Tune TTS chunk size. Profile and remove the worst p95 contributor.

Weeks 11–12. Load test at 2x expected peak. Hold a 24-hour soak test on real Indian carriers. Lock the architecture, document the choices, hand to ops.

By day 90 you have a stack that holds 480ms median, 720ms p95 first-audible on real Indian calls — and an architect who can answer the CFO's question about GPU spend without flinching.

What changes in the next 12 months

On-device VAD and endpointing. Mobile-side endpointing on the caller's app (where the integration is app-originated, not PSTN) cuts another 80–120ms from the budget.

Regional PoPs from the LLM providers. Anthropic and OpenAI are both signalling ap-south-1 endpoints. The Singapore-vs-Mumbai latency penalty disappears, and the architecture simplifies further.

Sub-500ms Latency Voice AI in India 2026: The STT + LLM + TTS Architecture That Survives Real Telephony

What "sub-500ms latency" actually means

The end-to-end latency budget on Indian telephony

The SIP layer — where Indian telephony starts the clock

STT — the choice that drives both latency and downstream cost

LLM — first-token latency is the metric that matters

TTS — where Hindi authenticity meets the latency budget

VAD and endpointing — the silent latency killer

What blows the latency budget — five mistakes we still see

The reference architecture that hits 480ms median

What "good" looks like in production

Build vs buy — the architecture decision

Compliance considerations on the latency stack

The 90-day implementation playbook

What changes in the next 12 months

Bottom line

Frequently Asked Questions

What's a realistic SIP-to-first-audio latency budget on Indian telephony in 2026?

Does Deepgram Nova-3 beat AssemblyAI Universal-2 on Hinglish code-switching?

How much latency does prompt caching actually save on the LLM stage?

Should I use Opus or PCMA for the SIP codec?

What's the right VAD silence threshold for Indian voice AI?

How do speech-to-speech models like GPT Realtime change the architecture?

What does the per-call cost look like on a sub-500ms architecture?

Rohan Kapoor

Sub-500ms Latency Voice AI in India 2026: The STT + LLM + TTS Architecture That Survives Real Telephony

What "sub-500ms latency" actually means

The end-to-end latency budget on Indian telephony

The SIP layer — where Indian telephony starts the clock

STT — the choice that drives both latency and downstream cost

LLM — first-token latency is the metric that matters

TTS — where Hindi authenticity meets the latency budget

VAD and endpointing — the silent latency killer

What blows the latency budget — five mistakes we still see

The reference architecture that hits 480ms median

What "good" looks like in production

Build vs buy — the architecture decision

Compliance considerations on the latency stack

The 90-day implementation playbook

What changes in the next 12 months

Bottom line

Frequently Asked Questions

What's a realistic SIP-to-first-audio latency budget on Indian telephony in 2026?

Does Deepgram Nova-3 beat AssemblyAI Universal-2 on Hinglish code-switching?

How much latency does prompt caching actually save on the LLM stage?

Should I use Opus or PCMA for the SIP codec?

What's the right VAD silence threshold for Indian voice AI?

How do speech-to-speech models like GPT Realtime change the architecture?

What does the per-call cost look like on a sub-500ms architecture?

Rohan Kapoor

Other Blogs

AI Cart Recovery Reporting and A/B Testing for D2C India 2026: Dashboards, Cohort Maths and the 12-Week Test Calendar

Voice AI for Quick-Commerce Delivery Partner Operations India 2026: Acceptance Rate, Onboarding, Retention (Blinkit, Zepto, Instamart)

AI Contact Centre for India 2026: Voice + WhatsApp + Web Chat Unified for Indian Enterprises

AI Voice Agent Build vs Buy for Indian Enterprises 2026: When to Build, When to License

AI Voice Agent India 2026: The Buyer's Definition, Pricing Map, Vendor Landscape and How to Pick One

Marketplace Cart Recovery via AI Voice Calls in India 2026: The Amazon, Flipkart, Meesho Multi-Brand Multi-SKU Playbook

AI Telecaller in India 2026: A Vertical-by-Vertical Replacement Playbook for Sales, Support and Collections Teams

Top AI Voice Agent Platforms for Enterprises in India 2025–2026: The RFP Shortlist

Customer Not Available — A Business Continuity Plan for Last-Mile, Collections and Healthcare Operations in India 2026