Voice AI Call QA & Scoring in India 2026: Auditing 100% of Calls Instead of Sampling 2%

Anjali Deshpande heads QA for a Mumbai NBFC that does about 4.2 lakh collection calls a month across two in-house floors and one outsourced partner in Indore. On a Tuesday afternoon in April she is sitting in the compliance head's cabin reading an RBI inspection report that has put a quiet, careful sentence into her week: "field examination of borrower complaints disclosed instances of abusive language and third-party disclosure by recovery agents not surfaced by the regulated entity's internal call audit programme." Three borrower complaints, all from the same product line, all within a six-week window. Her team had cleared every one of those agents in the routine monthly review.
The arithmetic is brutal and not really her fault. Her eleven QA analysts listen to roughly 8,000 calls a month — about 1.9% of the volume. They pick those calls with a stratified random sample that is, by any reasonable standard, well-designed. None of the three complained calls fell in the sample. The RBI inspector did not care. The report's language is the language of the Fair Practices Code: the regulated entity is expected to monitor recovery conduct. Sampling 2% is not monitoring. Sampling 2% is hoping.
This post is about what the other 98% looks like once you can actually hear it.
The thesis
Voice AI call QA scoring is no longer a cost-cutting story. In 2026, for any Indian contact centre operating under RBI, IRDAI, SEBI or TRAI oversight, it is an audit-defensibility story. The cost of AI-scoring a five-minute Hindi-English call has fallen to roughly ₹1.5–4, against ₹18–32 for a senior human reviewer doing the same work. That economics shift turns 100% review from "nice to have" into the default regulators will quietly start to expect — and that internal audit committees are already asking about. This post lays out how the audit actually works end-to-end, what it catches that manual sampling misses, where it generates false positives, the compliance dimensions a QA rubric must cover, and the build-vs-buy choices a 500–5,000 seat operation has to make in the next twelve months.
Why this matters now
Three things have changed in the last eighteen months that turn 100% AI audit from a vendor pitch into a board-level expectation.
The first is regulatory tone. RBI's 2024 update to the Fair Practices Code on recovery agents — and the supervisory guidance that followed — made clear that the obligation to monitor is on the regulated entity, not on the sampling methodology. The standard inspectors apply is "did you have a reasonable mechanism to detect this conduct," and a 2% manual sample is increasingly hard to defend as reasonable when commercially available AI tools score every call. IRDAI's outbound-sales guidance for insurance and SEBI's supervision of AMC and broking call centres run on the same logic. A compliance head in 2026 who says "we sample randomly" is one inspector away from a finding.
The second is unit economics. ASR pricing for Indian-language voice has fallen by roughly 70% in two years. An evaluator LLM call against a structured rubric — about 1,500–2,500 tokens in, 400–600 tokens out — costs cents, not rupees. Scoring a five-minute call on a 30-criterion rubric end-to-end now runs ₹1.5–4 depending on language mix and vendor. A senior human QA analyst in Mumbai or Bangalore, fully loaded, costs ₹65,000–95,000 a month and audits 700–900 calls. The per-call delta is roughly 8–10x.
The third is what AI scoring actually catches. Manual QA, even when well-run, is dominated by adherence checks — did the agent say the disclosure, did they capture the promise-to-pay, did they follow the rebuttal tree. The breaches that hurt — third-party disclosure, threatening tone in the last 30 seconds of a long call, calls placed before 8am — are rare-event problems. Rare events do not survive 2% sampling. They survive 100% review. We have seen collection floors where the post-audit breach-detection rate moved roughly 3.2x — not because agents got worse, but because the surveillance lens widened.
How the audit actually works
A working 100% audit pipeline is four stages, each with its own failure surface. It is worth understanding what each stage actually does before deciding to buy it, build it, or hybrid it.
Stage 1: Capture and segmentation. Calls land in the recording store — usually the dialer's S3 bucket or the on-prem recorder. The pipeline pulls each completed call within minutes of disposition, attaches the metadata that matters (agent ID, campaign, customer segment, call duration, disposition code, language tag if the dialer captured one) and pushes the audio into queue. Most failures at this stage are not glamorous: missing recordings on dropped calls, agent IDs not threaded through from CRM, calls under 15 seconds that should be excluded but aren't.
Stage 2: Transcription. ASR converts audio to a time-stamped, speaker-diarized transcript. For Indian contact centres this is the single biggest accuracy lever. A Hindi-belt collections floor calling Bihar, eastern UP and Jharkhand will see word error rates 1.6–2.4x what a vendor's Delhi-Hindi demo showed. Code-switching between Hindi and English is normal in NBFC calls — borrowers say "main payment kar dunga next Tuesday" and a model trained on pure-Hindi or pure-English corpora chokes on the boundary. Speaker diarization — knowing which words came from the agent and which from the customer — matters disproportionately, because almost every compliance rule applies to the agent side only.
Stage 3: Rubric evaluation. The transcript, with metadata, is passed to an evaluator — typically a constrained LLM call against a structured rubric. The rubric is the QA team's old scorecard, translated into yes/no/score-1-5 criteria with explicit evidence requirements. The evaluator returns a JSON object: each criterion, the score, the citing utterance, and a confidence value. This is the layer where most of the QA team's judgment lives — it is also the layer where most of the hallucination risk lives. A rubric that asks "was the agent rude" without grounding "rude" in specific phrases will get you a confident, polite hallucination on roughly 6–10% of calls.
Stage 4: Routing and human-in-loop. Calls are bucketed. High-confidence clean calls are auto-passed and the score is logged. High-confidence breach calls go straight to a remediation queue with a clip attached. The middle band — low-confidence on any criterion, or any flagged hard-compliance item — gets routed to a human reviewer. A well-tuned pipeline sends 8–14% of calls to humans, down from the 100% of sampled calls a manual team reviews today. Your eleven QA analysts go from listening for breaches to adjudicating the model's uncertain calls — a higher-value job that also keeps the model honest.
The QA dimensions a working rubric covers
A serious rubric for an Indian collections or sales floor has ten dimensions, not the three or four that vendor demos focus on.
| Dimension | What is being checked | Why it bites in India |
|---|---|---|
| Script adherence | Agent followed approved opening, rebuttals, closing | Drift is constant; manual QA catches the obvious cases, misses subtle skipped clauses |
| Mandatory disclosures | Recording consent, agent name, company name, purpose of call | RBI Recovery Agent norms require all four; manual sampling misses skipped disclosures roughly 1 in 9 calls |
| Prohibited words and phrases | Threats, abuse, caste/religious slurs, third-party disclosure | Single phrase can trigger a complaint; AI catches what humans miss in long calls |
| Tone and sentiment | Aggression markers, sustained raised voice, sarcastic register | Indian-directness vs aggression is the hardest line — see failure modes below |
| Customer interruptions | Agent talked over the customer, did not let them complete | Common in collections; correlates with later complaints |
| Dead air and hold violations | Silence > 30s without notice, hold > 90s without check-in | Hold abuse is a quiet but routine complaint vector |
| Hot-transfer correctness | Transfer to right queue, warm hand-off, context shared | Failed transfers drive repeat calls and CSAT loss |
| Promise-to-pay capture | PTP date, amount, mode confirmed and logged | The single most common revenue-relevant miss in collections |
| CSAT proxy | Customer's closing sentiment, willingness to continue | Useful as a leading indicator before CSAT surveys arrive |
| Regulatory window adherence | Call time within permitted hours, frequency caps respected | RBI recovery: no calls before 8am or after 7pm; trivially auto-checkable |
A vendor pitching you a QA platform with five generic criteria is not a QA platform. It is a sentiment dashboard.
What goes wrong
The failure modes are predictable enough that they are now the first thing to ask any vendor about.
ASR errors on Hindi and regional languages create false flags. Whisper-class models trained on global English corpora are confident on Delhi Hindi and break on Bhojpuri-influenced or Marwari-influenced Hindi. When the transcript says "[unintelligible] paisa nahi denge" instead of "abhi paisa nahi denge," the evaluator may read a refusal as a threat. The fix is twofold: a fine-tuned Indian-language ASR (Whisper-large-v3 with India-specific fine-tuning, or a domestic vendor like AI4Bharat-derived stacks), and a confidence threshold on every flag tied to ASR confidence on the cited span. A breach citing a low-confidence transcript span should be routed to human, not auto-flagged.
US-English sentiment models flag Indian directness as rudeness. A borrower saying "tum log roz call karte ho, paisa nahi hai abhi" is direct, not abusive. An agent saying "madam, aapko samajhna padega" is firm, not threatening. Off-the-shelf sentiment APIs trained on US customer-service corpora misclassify the firm register of Indian collections calls as aggression on 12–20% of calls. The fix is a tone classifier fine-tuned on labelled Indian collection audio, or — pragmatically — moving "tone" from auto-flag to human-review-required until a domain-specific model is in place.
Over-flagging hold when the customer asked for it. A working rubric distinguishes "agent put customer on unannounced hold for 110 seconds" from "customer said 'one minute' and the agent waited." Without that distinction, you get a flood of false breaches, the floor loses faith in the audit, and the system stops being used. The rubric should require the evaluator to cite the trigger for the hold before scoring the duration.
Evaluator LLM hallucinating breaches. The most damaging failure. An evaluator asked to score "did the agent abuse the customer" with no rubric grounding will, on a small percentage of calls, return "Yes — agent said you should pay now" with high confidence. This is a hallucinated paraphrase, not a citation. The fix is to require the evaluator to return the exact transcript span as evidence for every breach, and to programmatically verify that the span appears verbatim in the transcript before persisting the flag. Spans that fail verification are dropped. This single check removes the majority of false positives.
Cross-line bleed in stereo recordings. When the agent and customer share a channel (mono recording) and diarization fails, words attributed to the wrong speaker create breaches that did not happen. Stereo recording at the telephony layer fixes this at source. If the telephony partner does not support per-leg recording, the entire QA stack rests on diarization quality — which on Indian-language calls is meaningfully worse than on English.
Rubric drift across product lines. A collections rubric is not a customer-support rubric is not an outbound-sales rubric. Teams that ship one rubric across all campaigns end up with high false-positive rates on the campaigns it wasn't designed for. The fix is one rubric per campaign type, versioned, with explicit change logs.
The numbers that matter
What "good" looks like in 2026, against measured baselines on Indian floors.
Cost per call audited. Manual senior-analyst review: ₹18–32 per call fully loaded (salary + supervision + tooling + lost calls during review). AI-only review with no human-in-loop: ₹1.5–4 per call (ASR + LLM evaluator + storage). Hybrid with 10–14% human escalation: ₹3.5–6 per call. The hybrid number is the one most defensible operations are converging to.
Agreement with senior human QA. The metric to ask vendors for is Cohen's kappa between AI score and a senior human reviewer on a blind-labelled set. A kappa above 0.75 is the threshold at which audit teams stop second-guessing the AI on adherence and disclosure items. Below 0.65 you are essentially running a triage queue, not an audit. Tone and sentiment dimensions typically sit 0.10–0.15 lower than adherence dimensions — plan for that and weight your rubric accordingly.
Breach detection uplift. On the four NBFC and two BPO floors we have measured, moving from 2% manual to 100% AI audit lifts identified breaches by roughly 3.2x. Most of that lift is in low-frequency, high-severity categories — third-party disclosure, threatening tone, calling outside the permitted window — exactly the categories regulators care about and sampling misses.
False-positive rate on first-pass. Out-of-the-box on Indian collection calls, generic models flag 18–28% of calls as containing at least one breach. After two weeks of rubric tuning and confidence calibration, that drops to 6–9%. A floor that does not budget for tuning will swamp its human reviewers with false alerts and abandon the system inside a quarter.
Coverage by language. Hindi-English code-switched calls: 88–94% transcript accuracy on tuned stacks. Marathi, Bengali, Tamil, Telugu: 82–90%. Bhojpuri, Awadhi, Marwari-influenced Hindi: 70–82% — workable for adherence checks, weak for nuanced tone. Plan rubric weights accordingly: do not auto-flag tone on low-coverage circles.
Time from call disposition to score. Best stacks: under 4 minutes. Most stacks: 15–40 minutes. Batch overnight: 6–12 hours. Real-time scoring (during the call) is technically possible but rarely worth the latency cost for QA — useful for live-agent coaching, not audit defensibility.
Build vs buy
Three reasonable paths exist for a 500–5,000 seat Indian operation in 2026, and the right choice depends less on engineering capacity than on rubric ownership.
Open-source baseline. Whisper-large-v3 (fine-tuned on Indian audio if you have labelled data, or AI4Bharat IndicWhisper) for ASR, GPT-4o-mini or Claude Haiku for evaluator, a Postgres for scores, a thin Next.js dashboard. A two-engineer team can stand this up in 8–10 weeks. Marginal cost per call: ₹1.5–3. The hidden cost is rubric authoring and maintenance — that is a senior QA leader's job, not an engineering job, and it is the long pole. Pick this path if your QA leader can own a versioned rubric and you have someone in-house who can fine-tune ASR on your labelled audio.
Commercial QA platforms. Vendors in this space — domestic ones built on the conversation-intelligence layer and international ones adapted for India — charge typically ₹4–12 per call audited on a managed basis. You get a tuned ASR stack, a rubric editor, dashboards, integrations to common dialers, and a support team. Pick this path if you cannot dedicate engineering to a two-quarter build, or if you need audit-trail features (immutable logs, regulator export) faster than you can build them. Insist on running their stack against your last quarter's audio before signing — most demos use the vendor's own audio.
Hybrid: vendor ASR + in-house rubric. A growing pattern. Use a managed ASR layer (commercial or open-source-hosted) and own the evaluator-LLM call and rubric layer in-house. This separates the part that needs scale and continuous tuning (ASR) from the part that needs QA-team ownership (the rubric). On a 4.2-lakh-call-month operation, this comes in at roughly ₹2.5–4 per call and gives you full control over what gets flagged and why.
The build-vs-buy comparison, in three dimensions:
| Dimension | Open-source baseline | Commercial platform | Hybrid (vendor ASR + in-house rubric) |
|---|---|---|---|
| Time to first audited batch | 8–10 weeks | 3–5 weeks | 5–7 weeks |
| Marginal cost per call | ₹1.5–3 | ₹4–12 | ₹2.5–4 |
| Rubric flexibility | Full | Limited to vendor's schema | Full |
| Regulator audit-trail export | Build yourself | Out of box | Build yourself |
| ASR tuning on your audio | Yes, if you have labelled data | Vendor-side, opaque | Vendor-side, partial visibility |
| Best for | Tech-led floors with QA leadership | Compliance-led floors needing speed | Operations-led floors with QA ownership |
Compliance dimensions the rubric must encode
This is the part where audit defensibility actually lives. A rubric that does not explicitly encode regulatory rules will be useless in an inspection.
RBI Fair Practices Code for recovery agents is the heaviest single load for NBFCs and banks. The rubric must auto-check call timing (no calls before 8am or after 7pm IST — and circulars have tightened the second-call window for the same borrower in the same day), absence of threatening or abusive language, no disclosure of debt to third parties (family members, neighbours, employers), and presence of mandatory identification (agent name, agency name, on whose behalf). Each of these should be a discrete rubric item with explicit evidence requirements. See the RBI Fair Practices Code playbook for the per-clause rubric mapping.
IRDAI conduct rules for insurance sales calls require disclosed recording consent at the start of the call, accurate product description, no mis-selling claims, and a verifiable need-analysis trail. The rubric for an insurance outbound floor will look different from a collections rubric — adherence weight is higher, tone weight is lower.
TRAI DLT consent capture matters most on outbound. The rubric should verify that the consent header was read where required, that the call was placed on a DLT-registered template, and that opt-out requests were captured and logged. The mechanics are covered in the TRAI DLT compliance guide.
SEBI norms for AMC tele-sales and broking call centres require accurate risk disclosure, no guaranteed-return language, and KYC verification — all auto-checkable items if the rubric is written for them. For BFSI floors running multiple regulated lines, expect to maintain at least four rubric variants — one per regulator.
The point is not that AI scoring solves compliance. The point is that a rubric written down, versioned, and applied to 100% of calls is itself a piece of audit-defensible evidence. The inspector wants to see your monitoring mechanism. "We score every call on a 30-item rubric and route flagged calls to human review with documented remediation" is a defensible answer. "We sample 2% randomly" is increasingly not.
Implementation playbook
A phased rollout that has worked on three of the four floors we have implemented this on.
Weeks 1–2: rubric authoring. The QA leader and compliance head co-author a versioned rubric per campaign type. Each item has: criterion, scoring scale, evidence requirement, severity weight, regulatory citation if applicable. Aim for 25–35 items per rubric. Resist the temptation to ship a 60-item rubric in v1; you will spend the rest of the year tuning items nobody reads.
Weeks 2–4: ASR baseline and labelled set. Pull 500–800 random calls from the last quarter, transcribe them with the chosen ASR, have two senior QA analysts hand-correct 200 of them. This gives you a labelled set for measuring ASR accuracy by campaign, language and circle. It also gives you the gold-label set for measuring evaluator kappa later.
Weeks 3–6: evaluator wiring. Implement the rubric as a structured prompt, force JSON output with citation spans, programmatically verify spans against the transcript, and route low-confidence calls to a human queue. Test on the labelled set. Iterate the prompt until kappa on adherence and disclosure items is above 0.75. Tone and sentiment will lag — accept it and weight them lower in v1.
Weeks 5–8: shadow run. Run the AI audit alongside the existing manual sample for four weeks. Do not act on AI findings yet. Compare what the AI flagged that the human sample missed, and what the human sample flagged that the AI missed. Most of the early calibration happens here.
Weeks 8–12: cutover with human-in-loop. Replace random sampling with 100% AI audit plus human review of flagged calls. Re-deploy the eleven manual analysts to adjudicating low-confidence calls and remediation. Track flagged-but-cleared rates — if humans clear more than 30% of AI flags, the rubric needs tightening.
Weeks 12–24: tuning and rubric versioning. Monthly rubric reviews. Quarterly ASR re-tuning on the new labelled data the human queue has produced. Build a regulator-export feature: an inspector should be able to request "all calls flagged for third-party disclosure in March 2026" and get a packaged export in under an hour.
For the dialer-side and ASR-vendor selection that feeds this pipeline, see the voice-AI vendor RFP scoring rubric. For the experimentation discipline that should sit alongside QA on outbound campaigns, see the A/B testing playbook.
What changes in the next 12 months
Three shifts are already visible and will reshape this market by mid-2027.
Real-time scoring will move from a vendor checkbox to a usable feature, but only for live-agent coaching, not audit. Latency and cost of running the evaluator on every utterance is falling, and floors that pair AI scoring with whisper-coaching (a live nudge to the agent based on a tone or disclosure breach) are seeing per-agent CSAT lifts of 3–5 points in pilots. The audit use case will stay post-call because audit needs a complete call.
Regulator-facing audit trails will become a standard product feature. The platforms that ship an immutable, tamper-evident log of every score and every human override will win against the platforms that ship better dashboards. RBI and SEBI inspectors are asking for this already on a case-by-case basis; by 2027 expect it to be the default ask.
Indian-language ASR will close meaningfully on Bhojpuri, Awadhi and Marwari-Hindi as the labelled-data flywheel finally turns. The vendor that ships sub-15% WER on Patna collections audio gets the NBFC market by default. We are watching this number closely.
Bottom line
Sampling 2% of your calls is no longer monitoring; it is hope dressed up as methodology. The cost of auditing every call has fallen far enough that the conversation in 2026 is not whether to do it but how to do it without flooding your QA team with false flags. The answer is a versioned rubric written by your compliance and QA leaders together, a tuned Indian-language ASR with diarization, an evaluator that cites verbatim spans and a human-in-loop routing layer that turns your eleven analysts from sample listeners into uncertainty adjudicators. Get this right and the next inspection report reads differently — and so does the cost-per-audit line in next quarter's finance review.
Frequently Asked Questions
Tags :









