
Risk model and policy mapping (PII, actions, approvals)
Before building agents, map your organization's policies to technical controls. Ask:
What can go wrong?
Identify failure modes:
Policy → technical controls
Translate policies into enforceable rules:
Document this mapping. It becomes your acceptance criteria.
Prompt & tool contracts (least-privilege)
Design agents with narrow, explicit capabilities. Don't give a customer-service agent access to your entire API—scope tools to minimum necessary permissions.
Prompt contracts
Define clear interfaces for each agent:
Example:
Agent: Order Processor
Input: { orderId: string, action: "cancel" | "refund" }
Output: { success: boolean, message: string, auditLog: string }
Constraints:
- Order must belong to authenticated user
- Refunds ≤$10K auto-approve; >$10K route to human
- All actions logged to audit table
Tool least-privilege
Provide agents only the tools they need:
Customer service agent: Read orders, create support tickets (no delete)
Analyst agent: Read-only database access (no write)
Automation agent: Execute approved workflows (no arbitrary code)
Enforce through API keys with scoped permissions, not through prompts alone—prompts can be jailbroken.
Evals you need (accuracy, jailbreak, toxicity, cost)
Continuous evaluation prevents silent degradation. Build automated test suites:
Accuracy evals
Test agent outputs against golden datasets:
Does the agent extract correct information?
Are calculations accurate?
Do responses match expected format?
Run on every deployment. Regression should trigger rollback.
Safety evals
Test for adversarial behavior:
Jailbreak attempts: Can users trick the agent into ignoring rules?
Prompt injection: Can users manipulate the agent's instructions?
PII leakage: Does the agent expose sensitive data?
Maintain a "red team" dataset of known attacks. Add new attack vectors as discovered.
Quality evals
Measure subjective quality:
Relevance: Does the response address the query?
Coherence: Is the response logically consistent?
Toxicity: Does the agent generate harmful content?
Use LLM-as-judge or human raters. Set acceptance thresholds (e.g., ≥4.0/5.0 average).
Cost evals
Track operational costs:
Tokens per interaction
API calls per workflow
Average cost per user session
Set budget alerts. If costs spike, investigate prompt inefficiencies or abuse.
Minimal eval suite checklist
Before production:
[ ] 100+ accuracy test cases (happy path + edge cases)
[ ] 50+ safety test cases (jailbreaks, injections)
[ ] Cost per interaction measured and within budget
[ ] Quality spot-checked by human raters (n≥50)
[ ] All evals automated in CI/CD
Safety gates & rollback strategies
Even with evals, things break. Build layered defenses:
Pre-flight checks
Before executing actions, validate:
Input schema compliance
User authorization
Rate limits not exceeded
Known-bad patterns not present
Reject invalid requests before they reach the agent.
Runtime guardrails
While agents run, monitor:
Token usage (abort if exceeds threshold)
Confidence scores (route low-confidence to human)
Execution time (timeout if too slow)
Implement circuit breakers—if error rate crosses threshold, disable agent and route to fallback.
Post-execution validation
After agent completes, check:
Output schema compliance
Sensitive data redaction
Audit log completeness
Don't return outputs that fail validation. Log failures and alert ops.
Rollback strategy
When issues are detected:
1. Immediate: Disable agent, route traffic to fallback (static responses, human queue)
2. Triage: Review logs, identify root cause
3. Fix: Update prompts, retrain models, patch code
4. Re-eval: Run full eval suite
5. Gradual re-deploy: Canary → pilot → full rollout
Maintain version history. Fast rollback is essential.
Human-in-the-loop UI patterns (approve/annotate/retry)
Agents augment humans, not replace them. Design HITL workflows that keep humans in control:
Approval workflows
Route high-stakes decisions to humans:
Present agent recommendation + confidence + reasoning
Show relevant context (order history, customer profile)
Provide approve/reject/modify actions
Track approval latency (SLA monitoring)
Annotation workflows
Humans correct agent mistakes to improve future performance:
Show agent output vs. expected output
Provide easy correction interface (edit, select correct option)
Feed corrections back into retraining pipeline
Measure annotation quality (inter-rater agreement) to ensure reliable ground truth.
Retry workflows
When agents fail, let humans retry with adjustments:
Show error message and context
Allow manual parameter tweaks (temperature, prompt modifications)
Re-run agent with new settings
Log retry attempts for later analysis
HITL reviewer flow (state machine)
[Agent Completes]
├─→ High confidence → Auto-approve → [Done]
├─→ Medium confidence → Human review → Approve/Reject → [Done]
└─→ Low confidence → Human override → Manual completion → [Done]
[Human Review]
├─→ Approve: Log acceptance, execute action
├─→ Reject: Log rejection reason, route to manual queue
└─→ Modify: Annotate correction, re-run agent, log update
Observability: success, fallback, cost
Instrument everything. You can't improve what you don't measure.
Success metrics
Track:
Completion rate: % of agent runs that succeed
Accuracy: % of outputs matching expected results (from evals)
Latency: P50/P95/P99 response times
User satisfaction: Thumbs up/down, CSAT surveys
Fallback metrics
When agents fail:
Fallback rate: % of requests routed to human/static fallback
Fallback reasons: Categorize failures (low confidence, timeout, error)
Recovery time: How long until agent restored after incident
Cost metrics
Monitor spending:
Token usage: Tokens per request, daily/monthly totals
API costs: Dollars per interaction, by model/provider
Infrastructure: Compute, storage, bandwidth
Set budgets and alerts. Cost spikes often indicate abuse or inefficiency.
Run logs
Store comprehensive logs for every agent execution:
Timestamp, user ID, session ID
Input (prompt, parameters, context)
Output (response, confidence, tokens used)
Actions taken (API calls, database writes)
Success/failure status and error messages
Enable searchability (Elasticsearch, CloudWatch Logs Insights). Logs are your debugging and audit trail.