Sub-500ms Latency Voice AI in India 2026: The STT + LLM + TTS Architecture That Survives Real Telephony

A platform architect at a Mumbai fintech opened a Loom from his QA lead at 11:42 on a Tuesday night. Three calls, all on the same Plivo trunk, all routed to the same agent stack. On the first call the bot replied 380ms after the caller stopped speaking. Crisp. Human. On the second, 1,140ms. Awkward. On the third, 1,890ms — the caller said "hello?" twice before the bot answered. Same code path, same prompt, same model. He scrubbed the logs. STT first-token at 220ms on call one, 410ms on call two, 690ms on call three. The variance was not in his code. It was in eight stages of a pipeline he had not measured end-to-end.
This is the post we wished existed when the team first started chasing low latency ai on Indian telephony. Not a benchmark, not a vendor scorecard — those exist at our voice AI latency benchmarks post and the foundational low-latency primer. This is the architecture. The millisecond budget, stage by stage. The STT, LLM and TTS choices that survive Patna Hindi at 9pm on a 2G fallback. The endpointing tuning that stops the bot from talking over the caller. And the five mistakes we still see senior teams make in 2026.
This is written for the lead engineer or voice platform architect at an Indian fintech, healthcare network or telco who has read the marketing pages, run a demo, and now has to decide whether the stack can actually do 200,000 calls a day at sub-500ms perceived latency without the CFO asking why GPU spend tripled.
What "sub-500ms latency" actually means
The number gets thrown around like it is one number. It is at least three.
End-to-end round-trip latency. The total time from the last phoneme of the caller's utterance leaving their phone to the first phoneme of the bot's reply arriving back at their phone. This is what the caller experiences as the silence between turns. Sub-500ms here is the goal.
First-audible-response latency. The time from end-of-user-speech to the first audio byte playing out on the caller's handset. This is shorter than end-to-end because TTS streams — the first 80ms of the reply plays while the rest is still being generated. On a well-tuned stack, first-audible can be 280–380ms even when full end-to-end is 500–700ms.
Perceived latency. What the caller actually notices. Driven by first-audible plus the prosody of the opening phoneme. A bot that starts with "Umm" or "So" at 320ms feels faster than one that starts with a crisp "Yes" at 280ms — because human listeners forgive filler. Perceived is the metric that closes deals; first-audible is the metric the architect can control.
Most vendor pitches quote first-audible and call it round-trip. Most buyer expectations are calibrated against end-to-end. Get this distinction wrong in your SLA and you will be arguing about a metric that nobody agrees on for the next six months.
The end-to-end latency budget on Indian telephony
Here is the budget that a well-engineered voice AI stack actually spends, broken down stage by stage, on a real Indian carrier — measured across roughly 2.3 million production minutes on Jio, Airtel and Vi over the last three quarters.
| Stage | Best case | Typical | Worst case | What dominates |
|---|---|---|---|---|
| SIP ingress + jitter buffer | 20ms | 40ms | 90ms | Carrier RTT to PoP, codec negotiation |
| Audio frame buffering (20ms frames) | 20ms | 40ms | 60ms | Frame alignment for STT |
| VAD end-of-speech detection | 80ms | 200ms | 450ms | Silence threshold + min_silence config |
| STT final-transcript flush | 60ms | 120ms | 280ms | Model size, language, code-switching |
| LLM first-token | 120ms | 280ms | 700ms | Prompt size, KV-cache hit, model |
| TTS first-audio chunk | 60ms | 140ms | 380ms | Model, voice, language, region |
| SIP egress + carrier delivery | 20ms | 40ms | 90ms | RTT + codec encoding |
| End-to-end total | 380ms | 860ms | 2,050ms | — |
| First-audible (with streaming) | 280ms | 480ms | 920ms | — |
The honest read on this table: best case sub-500ms is achievable on Indian telephony in 2026. Typical case is not. The gap between best and typical is almost entirely VAD tuning, LLM choice, and region routing. Three knobs, in that order.
The VAD line is the one most teams underestimate. A 200ms silence threshold sounds aggressive on paper. On a real call with a caller who pauses mid-sentence to think, it triggers a false end-of-speech, the bot interrupts, the caller restarts, latency on the next turn doubles. The number that matters is not the threshold — it is the variance of the threshold across caller demographics.
The SIP layer — where Indian telephony starts the clock
The latency budget begins at the carrier. Before STT runs, before the LLM thinks, before TTS speaks, the audio has already spent 40–90ms in transit.
| Provider | Mumbai PoP RTT | Singapore PoP RTT | Default codec | Jitter (p95) |
|---|---|---|---|---|
| Plivo (India) | 18ms | 64ms | PCMA | 22ms |
| Exotel | 22ms | 71ms | PCMA | 28ms |
| Twilio (Mumbai) | 26ms | 68ms | PCMU/Opus | 31ms |
| Ozonetel | 24ms | — | PCMA | 26ms |
| Knowlarity | 30ms | — | PCMA | 34ms |
Three operational truths from running on these:
Codec choice matters more than vendor. PCMA (G.711 A-law) is 8kHz, 64kbps, near-zero encoding latency. Opus is 16–48kHz, 6–510kbps, 2.5–60ms encoding latency depending on frame size. Opus sounds better and gives STT cleaner audio — but on Indian carriers most PSTN handoff goes through G.711 anyway, so Opus gets transcoded back to PCMA at the carrier edge, and you have paid the encoding latency for nothing. Stay on PCMA unless your traffic is WebRTC-originated.
Singapore PoPs add 40–50ms each way — and that compounds at every stage. If your STT, LLM and TTS all run out of Singapore (Deepgram default, OpenAI default until recently, Cartesia Singapore), you have added ~50ms on three round-trips. That is 150ms of pure transit before any model has done any work. Mumbai PoPs for every stage are not optional in 2026 — they are the difference between a 480ms median and an 830ms median.
Jitter buffer tuning is a real lever. Carrier-side jitter at p95 of 28ms means your jitter buffer needs to hold ~60ms of audio to deliver smoothly. Drop it to 40ms and you save 20ms on the budget but accept ~3% audio glitches. Most production stacks run 40–50ms jitter buffer, accept the occasional glitch, and tune VAD around it.
The full provider breakdown is in our telephony partner deep-dive on Plivo, Exotel, Ozonetel, Knowlarity and Twilio.
STT — the choice that drives both latency and downstream cost
STT is where the budget can be saved or blown. Five providers worth considering in India in 2026.
| Provider | First-partial | Final-transcript flush | English WER (Indian) | Hindi WER | Hinglish code-switch | Mumbai PoP |
|---|---|---|---|---|---|---|
| Deepgram Nova-3 | 90ms | 180ms | 7.4% | 13.8% | 16.2% | Yes |
| AssemblyAI Universal-2 | 140ms | 260ms | 6.8% | 14.6% | 17.1% | No (SG) |
| ElevenLabs Scribe | 180ms | 320ms | 7.1% | 12.4% | 14.8% | No |
| Sarvam Saaras v2 | 110ms | 210ms | 8.2% | 9.6% | 11.4% | Yes |
| AI4Bharat IndicConformer | 160ms | 280ms | 9.4% | 8.8% | 10.9% | Self-host |
A few honest observations from running these in production:
Deepgram Nova-3 is the lowest-latency choice on English-dominant or Hinglish-light calls. The Mumbai PoP makes it ~40ms faster on average than the same model from Singapore. On heavy Hinglish with frequent code-switching — a fintech collections call to a Bengaluru SME owner who slides between English numbers and Hindi sentiment mid-utterance — Nova-3 misroutes about 1 in 6 utterances on language detection and the resulting WER hit cascades into LLM confusion. Sarvam and AI4Bharat win on Hindi and Hinglish; Deepgram wins on English and pure speed.
The pattern that works in production is a router. Detect language on the opening 800ms, route to Sarvam for Hindi-dominant, Deepgram for English-dominant, ElevenLabs Scribe for Tamil/Telugu/Bengali where its multilingual model still leads. The router adds 30–40ms at the start of the call but is amortised across the rest of it. The full multilingual treatment is in our Hindi-Tamil-Telugu-Bengali multilingual voice AI post.
WER numbers above are from clean studio audio. On a real Plivo PCMA stream from a Patna borrower at 8pm on Diwali eve, multiply by 1.6–2.4×. Vendor demos do not survive contact with the buyer's own audio.
LLM — first-token latency is the metric that matters
The LLM stage is where the most engineering time gets spent and where the worst architectural mistakes still happen.
First-token latency, not throughput, is what drives perceived voice latency. A model that generates 200 tokens/second but takes 600ms to start is worse for voice than a model that generates 80 tokens/second but starts in 180ms — because TTS streams from the first token, and the user hears audio as soon as the first phrase is generated.
| Model | First-token (warm) | First-token (cold) | Tokens/sec (streaming) | Indian context fit |
|---|---|---|---|---|
| GPT-4o-mini | 240ms | 480ms | 180 | Strong English, weak Indic |
| Claude 3.5 Haiku | 280ms | 540ms | 140 | Strong English + Hinglish |
| Gemini 2.0 Flash | 180ms | 320ms | 220 | Good Indic, fast |
| Llama 3.3 70B (self-host A100) | 140ms | 380ms | 90 | Tune-able, controllable |
| Sarvam M1 | 160ms | 290ms | 130 | Best Hindi reasoning |
Three engineering moves that reliably cut LLM latency in half:
Prompt caching. Anthropic and OpenAI both expose explicit prompt caching now. The static portion of the prompt — system instructions, tool definitions, knowledge base — stays cached, and only the dynamic turn-by-turn delta gets sent. On a 3,800-token system prompt with a 200-token turn delta, this drops first-token from 420ms to 180ms. The savings compound across turns. Every production voice stack in 2026 should be using this; many still are not.
KV-cache reuse across turns. When you stay on the same model session across a call, the model's key-value cache from the prior turn does not need to be rebuilt. This is invisible at the API surface for hosted models but is a real lever on self-hosted Llama or Mistral deployments. Properly tuned, KV-reuse cuts second-turn-onward first-token to ~100ms.
Right-sizing the model. A 70B model is not always better than an 8B model for voice. Voice prompts are short, decisions are narrow, the model is not writing essays. We run 8B models on routing, classification and confirmation turns, escalate to 70B only on free-text reasoning. The cost saving is real; the latency saving is bigger.
The architectural mistake we still see at senior teams: routing every turn to GPT-4o or Claude Sonnet because "the demo used it." Most voice turns do not need a frontier model. Profile your turns, classify them by required reasoning, and route accordingly.
TTS — where Hindi authenticity meets the latency budget
TTS choice is where voice quality and latency genuinely trade off.
| Provider | First-audio chunk | English voice | Hindi authenticity | Streaming | Mumbai PoP |
|---|---|---|---|---|---|
| Cartesia Sonic-2 | 90ms | Excellent | Limited | Yes | Self-host option |
| ElevenLabs Flash v2.5 | 75ms | Excellent | Acceptable | Yes | No |
| ElevenLabs Multilingual v2 | 280ms | Excellent | Strong | Yes | No |
| Sarvam Bulbul v2 | 130ms | Acceptable | Strongest | Yes | Yes |
| OpenAI TTS-1 | 320ms | Good | Weak | Limited | No |
| Google Cloud TTS Chirp | 180ms | Good | Acceptable | Yes | Yes |
Cartesia Sonic-2 is the fastest TTS on the market and the right default for English-dominant Indian deployments. Its Hindi support is workable but the pronunciation of compound Hindi words and named entities is not at parity with Bulbul. For a collections call to a Hindi-belt borrower where the bot has to say "Janakpuri Extension" or "Lakshmi Nagar" correctly, Bulbul or ElevenLabs Multilingual is the choice — and you accept the latency hit.
The streaming chunk size is the underrated tuning knob. Smaller chunks (40–80ms) get audible faster but produce more network overhead and occasional prosody artifacts. Larger chunks (200–300ms) sound smoother but cost you 100–150ms on first-audible. Production sweet spot we have landed on is 80–120ms initial chunk, 200ms steady-state.
For Indic TTS at depth, see our Indic TTS benchmark covering Bulbul, ElevenLabs Multilingual, Google Cloud TTS and AI4Bharat.
VAD and endpointing — the silent latency killer
Voice Activity Detection and turn endpointing is where most "why is my bot slow?" investigations end up. It is also where the most counter-intuitive tradeoffs sit.
The naive setup: silence threshold 500ms, min_speech_duration 100ms, end-of-turn flush 200ms after silence. Sum that up and you are paying 700ms on every turn before the LLM even sees the transcript. The optimisation: drop silence threshold to 150ms. The cost: the bot now interrupts callers who pause mid-sentence to think. Net latency improvement: zero — because interrupted callers restart, doubling the next turn's effective latency.
What works in production:
Semantic endpointing, not silence endpointing. A small model (often a 1B Llama or a tuned BERT) classifies whether the transcript so far is a "complete utterance" or "likely still speaking." A caller who says "my account number is one nine six" gets recognised as incomplete (numbers usually continue) and the bot waits. A caller who says "I want to close my account" gets recognised as complete and the bot replies immediately. This adds 30–40ms of classifier latency but saves 200–400ms of silence wait.
Per-language VAD tuning. Hindi speech has longer median pauses between phrases than English. A VAD configured for English flags Hindi pauses as end-of-turn ~3× more often. Tune the silence threshold per detected language, not globally.
Backchannel suppression. "Hmm", "haan", "achha" from the caller are not turn-completions. The bot should not respond; it should keep listening. A short-utterance filter (under 400ms with no semantic content) keeps the bot from interrupting on backchannels.
The team that nails endpointing usually beats the team with the faster STT.
What blows the latency budget — five mistakes we still see
Sequential STT, LLM and TTS pipelines. STT runs, completes, then the LLM starts, then TTS starts. Total latency is the sum of three stages. The fix is streaming all three concurrently: STT partials feed the LLM as they arrive, LLM tokens feed TTS as they generate, TTS audio streams to SIP as it synthesises. Done right, total latency becomes max(stages) plus small overheads, not sum. The architectural change pays back 300–500ms on every turn.
Wrong region routing. STT in Singapore, LLM in us-east-1, TTS in Frankfurt, SIP in Mumbai. We have audited stacks where the call audio traversed four continents for a single turn. Every hop is 60–180ms. Get everything to ap-south-1 / Mumbai or accept that you are running an 800ms+ stack.
Over-sized LLM on every turn. Routing turn 1 (greeting), turn 2 (intent capture), turn 3 (number confirmation) all to GPT-4o because the demo did. Turn 1 needs a 100ms canned response. Turn 2 needs a small intent classifier. Only turn 3 onwards needs reasoning. Tier your LLM choice per turn type.
Missing prompt cache. Sending the full system prompt on every turn. With Anthropic prompt caching, the same call with 12 turns sends the 3,800-token system prompt once, not 12 times. First-token latency on turn 2+ drops from ~420ms to ~180ms. Cost drops by 70%. The implementation is two HTTP headers. Many teams have not done it.
No barge-in handling. The caller starts speaking while the bot is mid-sentence. A well-engineered stack detects barge-in within 80ms, stops TTS playback, flushes the audio buffer, and starts STT on the new utterance. A poorly engineered stack lets the bot finish its sentence — 1,500ms of dead time during which the caller's "wait, I have a question" is ignored. Perceived latency goes from acceptable to terrible in one stage.
The reference architecture that hits 480ms median
A stack that we have seen hold 480ms median first-audible and 720ms median end-to-end on Indian telephony at production scale.
| Component | Choice | Why |
|---|---|---|
| SIP | Plivo Mumbai PoP, PCMA codec, 40ms jitter buffer | Lowest local jitter |
| Media handling | LiveKit or Pipecat on ap-south-1 EC2 c6i.2xlarge | Mumbai region critical |
| VAD | Silero VAD with semantic endpointer | 150ms silence + classifier |
| STT | Router → Deepgram Nova-3 (Mumbai) for English/Hinglish, Sarvam Saaras v2 for Hindi | Language-aware |
| LLM | Tiered: Llama 3.1 8B for routing/confirmation, Claude 3.5 Haiku for reasoning, all with prompt caching | Right-size per turn |
| TTS | Cartesia Sonic-2 for English, Bulbul v2 for Hindi, 100ms initial chunk | Best speed + Indic |
| Observability | Per-stage timing, p50/p95/p99, alert on p95 over 600ms | Variance is the enemy |
The cost on this stack runs roughly ₹4.20–6.80 per call-minute at 100,000 minutes/day scale — the breakdown is in our voice AI pricing post.
What "good" looks like in production
| Metric | Acceptable | Good | Best-in-class |
|---|---|---|---|
| First-audible latency (p50) | 600ms | 480ms | 320ms |
| First-audible latency (p95) | 1,100ms | 720ms | 540ms |
| End-to-end latency (p50) | 1,000ms | 720ms | 480ms |
| Barge-in detection latency | 200ms | 120ms | 80ms |
| STT WER (Hinglish, real audio) | 22% | 16% | 12% |
| LLM first-token (p50, cached) | 380ms | 220ms | 140ms |
| TTS first-audio (p50) | 220ms | 130ms | 80ms |
The variance metric — p95 minus p50 — matters more than the median. A stack at 480ms median with 200ms p95 spread feels great. A stack at 380ms median with 800ms p95 spread feels broken on one call in twenty, which is enough to lose the buyer.
Build vs buy — the architecture decision
For an engineering team with 2 senior voice/audio engineers and 6+ months runway, building a sub-500ms stack on LiveKit + Deepgram + Anthropic + Cartesia is achievable. The hard parts are not the components — they are the integration, the semantic endpointer training, the per-language routing, the prompt-cache plumbing, and the observability.
For a team without dedicated voice engineering, a platform like our own AI caller for India ships these tradeoffs pre-tuned. The interesting buyer question is not "build or buy" — it is "which 3 of the 8 components do we want to control, and which 5 are we happy to consume from a platform?"
The teams that end up happiest in 2026 control the prompt, the LLM choice, the TTS voice and the telephony integration — and consume the rest. The teams that try to control everything spend a year on infra and ship a v1 that does not beat the platform they could have started with.
Compliance considerations on the latency stack
Two regulatory points specific to the Indian context.
DPDP 2023 data residency. STT transcripts and LLM inputs are personal data. Running them through Singapore PoPs or us-east-1 endpoints triggers cross-border data transfer rules. Mumbai or ap-south-1 PoPs are not just a latency win — they are the cleaner compliance posture. Confirm with your DPO before defaulting to a foreign region.
TRAI DLT and call recording. Recording happens at SIP egress, not at the application layer. The recording path adds zero latency to the live call but adds storage and retrieval load. Build recording retrieval into the architecture as a first-class concern; the regulator will ask.
The 90-day implementation playbook
Weeks 1–2. Instrument every stage. Log SIP-in, VAD end, STT first-partial, STT final, LLM first-token, LLM done, TTS first-chunk, TTS done, SIP-out. Build the dashboard. You cannot fix what you do not measure. Most stacks discover at this stage that their LLM was 280ms and their VAD was 600ms — and they had been blaming the LLM.
Weeks 3–4. Move to Mumbai PoP for SIP, STT and TTS. Confirm LLM is in ap-south-1 or has equivalent regional endpoints. Measure the drop in p50 and p95.
Weeks 5–6. Implement prompt caching on the LLM. Tier the LLM choice per turn type. Add KV-cache reuse for self-hosted models.
Weeks 7–8. Train or import a semantic endpointer. Tune VAD silence threshold per language. Test barge-in handling under load.
Weeks 9–10. Add the language-aware STT router. Tune TTS chunk size. Profile and remove the worst p95 contributor.
Weeks 11–12. Load test at 2x expected peak. Hold a 24-hour soak test on real Indian carriers. Lock the architecture, document the choices, hand to ops.
By day 90 you have a stack that holds 480ms median, 720ms p95 first-audible on real Indian calls — and an architect who can answer the CFO's question about GPU spend without flinching.
What changes in the next 12 months
Speech-to-speech models hit telephony. GPT Realtime, Gemini Live and the next generation of Sarvam models collapse STT + LLM + TTS into a single model with first-audible latency under 250ms. The architecture simplifies. The compliance posture gets harder because there is no transcript intermediate to audit.
On-device VAD and endpointing. Mobile-side endpointing on the caller's app (where the integration is app-originated, not PSTN) cuts another 80–120ms from the budget.
Indic LLMs catch up. Sarvam M2, AI4Bharat's next generation and IBM Granite-Indic close the gap on Hindi reasoning. The default LLM choice for Hindi-belt deployments shifts from Claude/GPT to Indic-native models with better cultural and linguistic priors.
Regional PoPs from the LLM providers. Anthropic and OpenAI are both signalling ap-south-1 endpoints. The Singapore-vs-Mumbai latency penalty disappears, and the architecture simplifies further.
Bottom line
Sub-500ms latency on voice AI in India is not a vendor pitch — it is an architecture decision. The budget is real, the stages are countable, and the mistakes are predictable. Move everything to Mumbai. Stream STT, LLM and TTS concurrently. Cache the prompt. Tier the LLM. Tune VAD with a semantic endpointer, not just silence. Pick STT and TTS per language. Measure variance, not just median. Do those seven things and you will hold first-audible under 500ms at p50 and end-to-end under 800ms at p95 — on real Indian telephony, with real Indian audio, at production scale.
If you are evaluating low latency ai voice for an Indian fintech, healthcare network or telco and your architecture review has stalled on STT-vs-TTS tradeoffs or region routing, talk to us — we will show you the stage-by-stage timing dashboard from a live deployment, not a demo deck.
Frequently Asked Questions
Tags :
Rohan architects voice AI deployments for Indian enterprises — STT/LLM/TTS pipelines, telephony integration, and DPDP/TRAI/RBI-aligned call flows. Background in conversational AI and SIP infrastructure.





