Common mistakes
- No acceptance criteria or output schema.
- Prompts that hide requirements instead of stating them.
- No retrieval grounding for factual answers.
- Missing refusal rules or safety rails.
- Unversioned prompts; changes ship without tests.
- No evals or only manual spot checks.
- Ignoring latency/cost by overusing heavy models.
Production patterns to use
Schema + criteria: JSON outputs with explicit fields and quality bars.
Retrieval: Provide sources; require citations; refuse if insufficient.
Tooling: Use function calls for structured actions, not free-text.
Safety: Refusal rules, PII stripping, and approvals for risky actions.
Evals: Golden + synthetic tests before every release.
How we ship safely
- Define success: schema, constraints, and edge cases.
- Add retrieval, citations, and refusal logic.
- Version prompts + track lineage in Git.
- Run evals and regression tests; gate releases.
- Monitor logs, alerts, and cost; iterate quickly.
Prompts are product code. Treat them with specs, tests, and monitoring—or expect surprises in production.
FAQ
Do small prompts need tests?
Yes—lightweight evals catch regressions and cost spikes early.
Which models to target?
Use small/fast models for simple tasks; reserve heavy models for reasoning.
How to manage versions?
Keep prompts in Git with changelog and linked eval results.
What about multi-turn?
Constrain memory, reset state intentionally, and test flows end-to-end.
