AI Voice Agent Build vs Buy for Indian Enterprises 2026: When to Build, When to License

A VP of Engineering at a 1,200-headcount NBFC walked into our office in late April with a deck his CTO had asked him to defend in two weeks. The deck argued the firm should build its own outbound voice agent on top of an open-weights LLM and a self-hosted STT model, route it through their existing Exotel trunk, and skip platforms entirely. The number on slide four was ₹2.4 crore over twelve months — a third of what three commercial vendors had quoted. The number on slide eleven, in smaller font, was "Year-2 headcount: 11 FTE." He did not believe slide eleven. Neither did his CFO. But he could not articulate why, and the board meeting was on a Friday.
That conversation is the reason this post exists. The build-versus-buy decision for voice agents inside a regulated Indian enterprise is not really about software costs. It is about who owns the regulatory burden, who owns the Hindi WER drift in Patna, and who staffs the on-call rotation at 11 PM when a DLT header expires mid-campaign. Most build-side decks underweight all three. Most buy-side decks overweight the first.
The thesis
This post argues a single position: for almost every 500+ headcount Indian enterprise running outbound voice today, the right answer is license the platform, own the prompts, own the data layer, and reserve build-side investment for the two or three flows that genuinely justify it. Pure build is correct in roughly one in twenty cases — usually proprietary tone constraints, exotic compliance, or deep ERP coupling. Pure buy without owning prompts and data is correct in roughly zero cases at this headcount, because you will be re-platforming every eighteen months otherwise. The rest of the post explains how to know which case you are in, and a 12-week decision playbook you can hand to your CTO.
Why this matters in 2026 and not 2024
Three things shifted between 2024 and now that make this decision harder, not easier.
DPDP 2023 enforcement guidance from the Data Protection Board came into operational shape in early 2026. Purpose-bound consent — the requirement that you can prove, per call, that the data subject's consent covered this specific purpose — moved from a theoretical clause to something auditors actually ask for. A homegrown voice stack that logs consent in five different places across five different microservices is now a compliance liability, not a feature. Platforms have started shipping consent-ledger primitives. Your build team has not.
TRAI DLT scrubbing rules tightened in March 2026, particularly around content templates for transactional vs service vs promotional categorisation. Voice agents that say anything resembling promotional content — "would you like to upgrade?", "we have a new offer" — now need pre-approved content templates registered against the right header on the DLT portal. If your build team has not touched the DLT API in the last quarter, they will be surprised by how much it has changed.
The STT/TTS market collapsed in price but fragmented in quality. Indian-language STT pricing on the major providers dropped roughly 40% between Q4 2025 and Q2 2026. But the quality gap between vendor models on Patna Hindi versus Delhi Hindi widened, not narrowed. Build teams who benchmarked on a Delhi-Hindi golden set in 2024 are running production on assumptions that no longer hold. We have audited four such deployments in 2026 and three of them had real-world Hindi WER between 19% and 26% in eastern UP and Bihar — twice the WER on their golden set.
If you have not re-run the build-vs-buy math since these three shifts, you are working off stale numbers.
The mechanism: what an outbound voice agent stack actually contains
People underestimate this. The build-side proposal in your inbox probably has seven boxes on the architecture diagram. The real stack has somewhere between eighteen and twenty-six components, depending on use case. Here is the honest decomposition.
The eight layers of a production voice agent stack
| Layer | What it does | Build complexity | Drift risk |
|---|---|---|---|
| Telephony / SIP | Outbound dialing, trunk management, DTMF, call legs | Low (vendor-fronted) | Low |
| DLT scrubbing | Header validation, content template match, opt-out check at dial time | Medium | High — TRAI rules change quarterly |
| Dialer / pacing | Predictive vs progressive, abandon rate caps, retries | Medium | Medium |
| STT | Real-time transcription, language detect, code-switch handling | High | Very high — Hindi WER drifts by region |
| Dialog / LLM | Intent, slot-filling, conversation policy, refusal handling | High | High — model versions deprecate |
| TTS | Voice synthesis, prosody, code-switch pronunciation | Medium | Medium |
| Integration | CRM writeback, LMS hooks, NACH triggers, payment links | Medium-High | Low once stable |
| Consent + audit | DPDP purpose binding, recording disclosure, retention | Medium | High — regulator interpretation evolving |
Each of these layers has its own provider market, its own SLA, its own failure mode, and its own on-call rotation. A build team that proposes to own all eight is proposing to staff eight specialisations. A buy team that owns none of them ends up unable to debug their own production incidents.
Where the real engineering hours go
The slide-four cost in most build decks assumes engineering effort is dominated by the dialog layer — prompt engineering, intent classification, conversation policy. In practice, across four deployments we have audited, the breakdown looks more like this:
- 15–20% on the dialog layer itself
- 25–30% on telephony + DLT plumbing
- 20–25% on STT/TTS tuning for regional Hindi and code-switching
- 15–20% on CRM/ERP integration and idempotency
- 10–15% on consent ledger, audit logging, and DPDP artefacts
- 5–10% on dashboards, agent supervisor tools, and ops UX
The dialog layer — the part that feels like the "AI work" — is the smallest line item. Build teams discover this in month four, after the demo is impressive but the call doesn't connect 30% of the time because the DLT header rotated.
Latency budget, in milliseconds
Realistic budget for a natural-feeling outbound conversation is 700–900ms round-trip from end-of-user-utterance to start-of-agent-audio. That decomposes roughly to: 150–250ms STT finalisation, 200–350ms LLM completion (with streaming), 100–200ms TTS first-byte, plus 100–150ms of network and SIP overhead. Each layer eats into this budget. If your build team picks an STT provider with 400ms median latency on Hindi (several do), you will never recover the conversational feel even with a sub-200ms LLM. Most platforms publish a budget and enforce it as an SLA. Most build teams discover the budget exists only after the first production call sounds like a walkie-talkie.
What goes wrong on the build side
These are the seven failure modes we have seen in 2025–26 across four NBFC, two hospital-chain, and three D2C build attempts. Listed in roughly the order they bite.
Underestimating the DLT churn. Build team treats DLT registration as a one-time setup. In practice headers get rotated, templates get rejected, and content categorisation gets re-flagged on roughly a six-to-ten-week cadence. Without a dedicated ops engineer who understands the Jio/Airtel/Vi DLT portals, campaigns stop mid-flight. We have seen three teams discover this only after a regulator complaint.
The Hindi WER cliff. STT chosen on a Delhi-Hindi benchmark looks fine. The first 5,000 calls into Tier-2 UP, Bihar, and rural Maharashtra reveal WER in the 18–26% range. The dialog layer was designed assuming 8% WER. Slot extraction breaks. The team starts patching with regex fallbacks. By month six the codebase is a pile of regex.
Consent ledger as an afterthought. DPDP purpose binding requires that for every call, you can produce: the consent artefact, its scope, its timestamp, and the audit trail showing the call was made under that scope. Build teams typically log consent in the CRM and call logs in a separate event store. Stitching them post-hoc for a regulator request takes weeks. Platforms ship this as a primitive.
LLM provider deprecation roulette. A model your team built against in Q1 2026 may be tier-shifted, price-changed, or quality-shifted by Q4. Build teams without an abstraction layer end up re-prompting from scratch. Platforms either pin model versions or maintain regression suites across versions. Yours probably does not.
Telephony cost surprise. Outbound trunks bill per second with a minimum-pulse, and call drop / busy / SIM-off ratios in India run 35–50% depending on the geography. Build teams budget cost-per-connected-minute. They get billed cost-per-attempted-call-second. The variance is usually 1.6–2.1x against budget.
On-call rotation. Outbound campaigns run between 10:30 AM and 8 PM IST. Production incidents — STT provider 500s, TTS regional outage, CRM webhook timeouts — happen inside that window, every single day. A serious build needs a three-person rotation. Most build proposals budget for one engineer "on rotation as needed". This is the headcount line on slide eleven that finance disbelieves.
The supervisor UX nobody wanted to build. Ops teams need to listen to live calls, override the agent, mark calls for QA, and segment by header / campaign / region. This is a small product in itself. Build teams discover it in month five when the ops lead refuses to use the agent because she cannot see what it is doing.
The numbers, with realistic ranges
Costs are the part where build-side decks lie to themselves most. Here is what the math actually looks like for a mid-size outbound deployment doing 300,000 to 1 million calls per month.
Build-side annual costs, plausible Indian ranges
| Line item | Low | High | Notes |
|---|---|---|---|
| Engineering FTE (3–6 people) | ₹1.2 Cr | ₹2.8 Cr | Year 1 build, Year 2 maintenance and drift |
| STT (Hindi + 2 regional) | ₹35 L | ₹95 L | Depends on call volume and provider |
| LLM inference | ₹25 L | ₹80 L | Streaming, average 12–18 turns per call |
| TTS | ₹15 L | ₹45 L | Indic voices, premium tier |
| Telephony / trunk | ₹60 L | ₹2.2 Cr | Volume and ASR-dependent |
| DLT ops + compliance | ₹12 L | ₹28 L | One ops engineer, partial |
| Observability + tooling | ₹8 L | ₹22 L | Logging, traces, recordings retention |
| Total | ₹2.7 Cr | ₹7.0 Cr | Excludes opportunity cost of slow ramp |
The bottom of the range — ₹2.7 Cr — assumes everything goes right, the team is in place from day one, and the use case is narrow. The top end assumes a multi-flow, multi-language deployment. Both ranges exclude the cost of being eight months later to production than a buy path, which on collections or cart recovery is typically ₹1–3 Cr in unrealised recovery.
Buy-side annual costs
| Use case | Calls / month | Realistic annual platform spend |
|---|---|---|
| EMI reminders, narrow flow | 200k–400k | ₹35 L – ₹70 L |
| Cart recovery, multi-SKU | 300k–600k | ₹50 L – ₹95 L |
| Insurance renewal + upsell | 400k–800k | ₹65 L – ₹1.3 Cr |
| Collections, multi-bucket | 500k–1.2M | ₹90 L – ₹1.9 Cr |
| Multi-flow enterprise rollout | 1M+ | ₹1.4 Cr – ₹3.2 Cr |
Add to the buy column ₹40–80 L of in-house engineering for prompt ownership, data layer, and integration glue — which you should be doing whether you build or buy, and which we will come to.
What "good" performance looks like
Pure benchmark numbers, plausible ranges across the Indian deployments we have audited:
- Connect rate (call answered by a human): 32–48% on cold lists, 55–72% on warm.
- Intent resolution within the call: 58–74% for narrow flows (reminders, OTPs), 38–52% for open-ended (sales, support triage).
- Average handle time: 70–110 seconds for reminders, 140–220 seconds for collections.
- Hindi WER on Delhi/Mumbai/Bangalore: 6–11%. On Patna/Lucknow/Jodhpur: 14–24%.
- DLT pass rate at dial: should be >98%; below 95% means your scrubbing is broken.
- Compliance: 100% recording disclosure for IRDAI-governed sales calls, no exceptions.
If a vendor quotes 4% Hindi WER, ask for the test set composition. If they cannot produce it, the number is from a demo set.
When to build, when to buy, when to do both
This is the framing that matters most. The honest answer is rarely binary.
When pure build is correct
Three conditions usually converge:
- You have proprietary tone or persona constraints no platform will honour — typically luxury brands, regulated financial sales scripts, or vernacular voice signatures that are part of your brand IP.
- You have regulatory edge cases that mainstream platforms do not handle — IRDAI sales recording with sector-specific disclosure phrasing, hospital chains under MCI advertising rules, or a state-level licensing condition.
- You have deep ERP/CRM coupling that the platform's integration model cannot express — usually because your system of record is a 20-year-old core banking system or an in-house claims engine with non-standard auth.
If two of three apply, build the layers that touch the constraint and buy the rest. If all three apply, you may genuinely be a pure-build case. We have seen perhaps two such teams in the last eighteen months, both in core banking.
When pure buy is correct
You are below the 500-headcount threshold, the use case is narrow (reminders, OTPs, lead qualification), the volume is under 200k calls a month, and your engineering bandwidth is committed elsewhere. Pure buy gets you to production in six to ten weeks. The platform handles DLT, DPDP, STT drift, and the on-call rotation. You pay a premium for not owning the stack. The premium is worth it.
The hybrid path almost everyone should take
For the 500+ headcount Indian enterprise running multi-flow outbound voice — which is the persona this post is written for — the right shape is:
- License the platform for telephony, DLT, STT/TTS, dialog orchestration, consent ledger, supervisor UX. This is roughly 70–80% of the stack.
- Own the prompts, the conversation policy, and the per-campaign tuning. These are your IP and they should not live in the vendor's repo.
- Own the data layer — call outcomes, transcripts, recordings, consent artefacts — in your own warehouse. Most platforms will stream events out. Insist on this in the MSA.
- Build the 2–3 flows that touch your proprietary edge — the IRDAI-compliant renewal script, the core-banking webhook, the regional-language phonebook for your brand names — as plugins or pre/post-processors on top of the platform.
This shape gives you platform leverage on the 70% of work that does not differentiate you, and ownership of the 30% that does. It also makes you re-platformable — if the vendor fails in year three, you have your prompts, your data, and your integration code. You re-platform in eight weeks, not eight months.
A four-column comparison table for your deck
| Dimension | Pure build | Pure buy | Hybrid (recommended) |
|---|---|---|---|
| Time to first live flow | 6–10 months | 6–10 weeks | 8–12 weeks |
| Year-1 cost (mid volume) | ₹3.5–5.5 Cr | ₹70 L – ₹1.4 Cr | ₹1.0–1.8 Cr |
| Regulatory burden owner | You | Vendor | Shared, vendor leads |
| Re-platforming cost (Year 3) | Re-write | High lock-in | Low — your prompts + data |
| Hindi WER drift owner | You | Vendor | Vendor, you monitor |
| Failure mode | Schedule slip | Vendor lock-in | Coordination overhead |
Compliance and regulatory: what the buy path actually offloads
This is the section build-side decks under-cost. Let's go through what regulatory work you are no longer doing if you buy.
TRAI DLT. A serious platform maintains live integration with the Jio/Airtel/Vi DLT portals, monitors header rotation, validates content templates at dial-time, and re-routes when a header expires. This is roughly 15–25% of one engineer's time, all year, every year. On a build, this is your engineer.
DPDP 2023 consent. Purpose-bound consent requires a ledger that ties every call to a specific consent artefact. Platforms now ship this as a first-class object — consent scope, timestamp, source-of-truth pointer, retention window. You configure it; you do not build it. Build teams are still arguing about whether to store consent in the CRM or in the event store.
IRDAI sales call disclosure. Insurance renewal and upsell calls require recorded, disclosed, consent-confirmed conversations with specific phrasing under IRDAI master circulars. Platforms operating in the insurance vertical have pre-built modules for this. A build team will read three master circulars and get the phrasing wrong on the first audit. This was the failure mode in two of the IRDAI deployments we saw audited in 2025.
RBI Fair Practices Code for collections. Bucketed collections (early, mid, late, recovery) have different permissible language and timing under RBI's FPC for NBFCs. The platform's policy engine encodes this. Your build team will encode it once and then under-maintain it.
Sectoral nuance — hospitals, gold loan, NBFC. Each vertical has its own quirks. Healthcare under MCI advertising restrictions cannot upsell on calls. Gold loan top-up under RBI's recent gold loan circulars has specific disclosure requirements. NBFC microfinance has interest rate disclosure rules. Multi-vertical platforms maintain these. Single-purpose build teams do not.
If your firm is in BFSI, NBFC, insurance, or healthcare, the regulatory load alone tilts the math toward buy. The vendor amortises the compliance engineering across hundreds of customers. You amortise it across your own three flows. The unit economics never recover.
The 12-week build-vs-buy decision playbook
This is the playbook you can paste into a Notion doc, share with your CTO, and run. We have run versions of it with eleven enterprises in the last fifteen months. It works.
Week 1–2: define the use case and the constraint set
- Write down the top three outbound flows by business impact. Not all flows. The top three.
- For each, write the upstream system of record, the downstream action it must trigger, and the regulatory regime it sits under.
- Define your hard constraints: latency budget, language coverage, retention period, consent model, on-call SLA.
- Define your soft constraints: tone, brand voice, persona.
Week 3–4: audit your existing stack
- Map every system the agent will touch: CRM, telephony, payment, LMS, ticketing.
- Identify which integrations are well-documented APIs and which are custom hacks.
- Pull six months of call logs from your current human team. Compute the actual Hindi/regional language distribution. This is your golden set.
- Estimate annual call volume, peak-day volume, and concurrency requirement.
Week 5–6: vendor RFP and build proposal in parallel
- Issue an RFP to three to five vendors. Demand: pricing transparency, SLA, DLT/DPDP posture, your-golden-set WER test, data-layer event stream, exit clause.
- In parallel, ask your engineering team to write a build proposal for the same scope. Insist on: per-layer cost, three-year TCO, headcount plan, drift maintenance plan.
- Have both proposals reviewed by someone outside the team who has shipped voice in production before. (We are happy to do this for free; so are several others.)
Week 7–8: bake-off on your data
- Force every vendor to run their stack on your golden set, not their demo set. Measure: WER, intent resolution, latency, DLT pass rate.
- Have your build team produce a working prototype on the same set with their proposed stack. Measure the same things.
- The bake-off result is usually decisive — and usually surprising. We have seen build teams that were confident come back with 19% Hindi WER against a platform's 9%. We have also seen the opposite.
Week 9–10: the hybrid scoping
- Whichever way the bake-off goes, identify the two or three components that you should own regardless: the prompts, the data layer, the regulatory-edge plugin. Scope these.
- Write the MSA terms that protect ownership: prompt portability, event-stream guarantee, retention rights, exit assistance.
- Get sign-off from legal and CISO on these terms specifically.
Week 11: pilot scope and success criteria
- Pick one flow. One. The flow with the cleanest data and the most patient business owner.
- Define success criteria in three numbers: connect rate, intent resolution, cost per resolved call. Not five numbers. Three.
- Set the pilot duration: six to eight weeks of production calls, not a four-day "demo."
Week 12: decision and contracting
- Run the decision meeting with the CTO, CFO, the business owner, and CISO present.
- Present the four-column table. Defend the recommended path.
- Sign the contract — or kick off the build — with a 90-day pilot exit clause either way.
If you cannot defend the recommendation in twenty minutes to that room, you have not done the work above. Go back to week one.
For a sharper view of what to ask vendors in week 5–6, our enterprise RFP shortlist post breaks down the questions that separate serious vendors from re-sellers. For the TCO math underpinning week 9–10, the honest TCO comparison digs deeper into the per-line-item numbers.
What changes in the next 12 months
Three shifts will matter to this decision before mid-2027.
DPDP enforcement will get its first major case. When it does, the bar for purpose-bound consent ledgers will move from "documented" to "auditable in 48 hours." Platforms with consent-ledger primitives will be in a better position than build teams patching event stores. If you are in build mode now and have not designed for this, you are taking on contingent liability.
The STT market will likely undergo a second price drop and a consolidation. Two of the four major Indian-language STT providers will probably either be acquired or pivot away from real-time. Build teams with hard dependencies on a specific provider will need to migrate. Platforms with abstracted STT routing will absorb the migration. Plan for this in your MSA.
LLM-side, model deprecation cycles are tightening from twelve months to nine. Build teams without a regression suite across model versions will eat this. Platforms with regression suites will too, but they will eat it on your behalf. This is one of the largest hidden costs of pure build that build decks systematically ignore.
The shift you will not see coming is the one that matters most. Plan for re-platformability, not for the current best vendor.
Bottom line
For a 500+ headcount Indian enterprise in 2026, build vs buy is the wrong framing. The right framing is: which 70% do you license, which 30% do you own, and how do you contract so you stay re-platformable. Pure build is correct in roughly one in twenty cases, and even those cases are usually hybrids in disguise. Pure buy without prompt and data ownership locks you in for the wrong reasons. The hybrid path — platform for the commodity layers, in-house for the prompts and the data and the two or three edge flows — gets you to production in eight to twelve weeks, costs a third of pure build, and leaves you portable. Run the 12-week playbook. If you still disagree with the recommendation at the end of week 12, build. You will at least have done it with the right numbers.
If you want a second opinion on a build proposal already on your desk, the team at caller.digital has reviewed eleven such decks in the last fifteen months across BFSI, insurance, and healthcare. We will tell you when to walk away from a vendor, including from us. Pricing transparency lives on the pricing page. Integration depth is documented on the CRM integrations and telephony integrations pages.
Frequently Asked Questions
Tags :





