Which model had best deflection?

Claude 4.6 led deflection in complex policies; GPT-5.x close behind with better tool reliability.

GPT-4.1-mini variants delivered lowest latency for simple intents; Claude/GPT-5.x faster on multi-turn accuracy.

Claude 4.6 refused unsafe actions most reliably; GPT-5.x improved with strict system prompts and approvals.

Match model to task: light triage → mini models; complex policy + tools → Claude or GPT-5.x; cost-sensitive bulk → mini/local.

GPT vs Gemini vs Claude: Customer Support Benchmark (Feb 2026)

120 intents across billing, shipping, returns, account, and policy questions.
Tool-calling flows: knowledge retrieval + CRM lookup + status update.
Metrics: deflection, accuracy, handle time, refusal correctness, cost per 1k tickets.

Claude 4.6: Best policy adherence and complex deflection; slower than minis but safest.

GPT-5.x: Strong tool reliability, balanced quality/speed; great for workflows with actions.

Gemini: Competitive on straight Q&A; weaker on strict policy and tool precision in our tests.

GPT-4.1-mini: Lowest cost/latency; good for triage and simple intents.

Tier-1 triage + FAQs: GPT-4.1-mini or local small models for cost + speed.
Policy-heavy with actions: Claude 4.6 or GPT-5.x with strict system prompts and approvals.
Multi-turn with retrieval + tools: GPT-5.x excelled in tool calling stability; Claude strong on refusal safety.

Match the model to the workflow—don’t pay for reasoning where you just need speed, and don’t skimp on safety for policy-heavy actions.

Did you include retrieval?

Yes—each model used RAG over a 300-article help center plus CRM actions.

What about hallucinations?

Claude had the fewest policy violations; GPT-5.x improved with refusal rules and approvals.

Can we mix models?

Yes—use minis for triage, premium models for complex actions, and route dynamically.

How often to retest?

Quarterly or on major model releases; keep regression suites and evals versioned.