How we tested
- 120 intents across billing, shipping, returns, account, and policy questions.
- Tool-calling flows: knowledge retrieval + CRM lookup + status update.
- Metrics: deflection, accuracy, handle time, refusal correctness, cost per 1k tickets.
Headline results
Claude 4.6: Best policy adherence and complex deflection; slower than minis but safest.
GPT-5.x: Strong tool reliability, balanced quality/speed; great for workflows with actions.
Gemini: Competitive on straight Q&A; weaker on strict policy and tool precision in our tests.
GPT-4.1-mini: Lowest cost/latency; good for triage and simple intents.
Which model for which job
- Tier-1 triage + FAQs: GPT-4.1-mini or local small models for cost + speed.
- Policy-heavy with actions: Claude 4.6 or GPT-5.x with strict system prompts and approvals.
- Multi-turn with retrieval + tools: GPT-5.x excelled in tool calling stability; Claude strong on refusal safety.
Cost snapshot (per 1k tickets)
- GPT-4.1-mini: lowest, ideal for triage.
- GPT-5.x: mid, justified by tool reliability.
- Claude 4.6: higher, justified when policy risk is high.
Match the model to the workflow—don’t pay for reasoning where you just need speed, and don’t skimp on safety for policy-heavy actions.
FAQ
Did you include retrieval?
Yes—each model used RAG over a 300-article help center plus CRM actions.
What about hallucinations?
Claude had the fewest policy violations; GPT-5.x improved with refusal rules and approvals.
Can we mix models?
Yes—use minis for triage, premium models for complex actions, and route dynamically.
How often to retest?
Quarterly or on major model releases; keep regression suites and evals versioned.
