How do we prevent unsafe actions?

Layer defenses: scope agent permissions (least-privilege tools), validate inputs/outputs, enforce approval gates for high-stakes actions, and maintain audit logs. Test with adversarial evals.

Can we run on-prem/GovCloud?

Yes. Deploy agents and LLMs within your infrastructure. Use self-hosted models (Llama, Mistral) or GovCloud-approved API providers. All data stays in your environment.

What belongs in an agent 'run log'?

Log inputs (prompt, parameters), outputs (response, confidence), actions taken (API calls, data accessed), success/failure status, and timestamps. This enables debugging, audit, and retraining.

Agentic AI Systems: Guardrails, Evals, and Human-in-the-Loop

Multi-agent system architecture diagram showing guardrails and human review checkpoints

Risk model and policy mapping (PII, actions, approvals)

Before building agents, map your organization's policies to technical controls. Ask:

What can go wrong?

Identify failure modes:

Data leakage: PII, confidential info in prompts/responses

Unauthorized actions: Agents exceeding their authority

Cost overruns: Uncontrolled API usage

Quality issues: Hallucinations, incorrect outputs

Security: Prompt injection, jailbreaks

Policy → technical controls

Translate policies into enforceable rules:

"Don't share customer PII" → Input/output filters, redaction

"Require approval for orders >$10K" → Human-in-the-loop gate

"Limit AI spending to $500/day" → Rate limits, budget tracking

"Audit all decisions" → Comprehensive logging

Document this mapping. It becomes your acceptance criteria.

Prompt & tool contracts (least-privilege)

Design agents with narrow, explicit capabilities. Don't give a customer-service agent access to your entire API—scope tools to minimum necessary permissions.

Prompt contracts

Define clear interfaces for each agent:

Input schema: What data the agent receives (typed, validated)

Output schema: What the agent returns (structured, not free-form)

Constraints: Boundaries the agent must respect

Example:


Agent: Order Processor
Input: { orderId: string, action: "cancel" | "refund" }
Output: { success: boolean, message: string, auditLog: string }
Constraints: 
  - Order must belong to authenticated user
  - Refunds ≤$10K auto-approve; >$10K route to human
  - All actions logged to audit table


Tool least-privilege

Provide agents only the tools they need:
Customer service agent: Read orders, create support tickets (no delete)
Analyst agent: Read-only database access (no write)
Automation agent: Execute approved workflows (no arbitrary code)

Enforce through API keys with scoped permissions, not through prompts alone—prompts can be jailbroken.

Evals you need (accuracy, jailbreak, toxicity, cost)

Continuous evaluation prevents silent degradation. Build automated test suites:

Accuracy evals

Test agent outputs against golden datasets:
Does the agent extract correct information?
Are calculations accurate?
Do responses match expected format?

Run on every deployment. Regression should trigger rollback.

Safety evals

Test for adversarial behavior:
Jailbreak attempts: Can users trick the agent into ignoring rules?
Prompt injection: Can users manipulate the agent's instructions?
PII leakage: Does the agent expose sensitive data?

Maintain a "red team" dataset of known attacks. Add new attack vectors as discovered.

Quality evals

Measure subjective quality:
Relevance: Does the response address the query?
Coherence: Is the response logically consistent?
Toxicity: Does the agent generate harmful content?

Use LLM-as-judge or human raters. Set acceptance thresholds (e.g., ≥4.0/5.0 average).

Cost evals

Track operational costs:
Tokens per interaction
API calls per workflow
Average cost per user session

Set budget alerts. If costs spike, investigate prompt inefficiencies or abuse.

Minimal eval suite checklist

Before production:
[ ] 100+ accuracy test cases (happy path + edge cases)
[ ] 50+ safety test cases (jailbreaks, injections)
[ ] Cost per interaction measured and within budget
[ ] Quality spot-checked by human raters (n≥50)
[ ] All evals automated in CI/CD

Safety gates & rollback strategies

Even with evals, things break. Build layered defenses:

Pre-flight checks

Before executing actions, validate:
Input schema compliance
User authorization
Rate limits not exceeded
Known-bad patterns not present

Reject invalid requests before they reach the agent.

Runtime guardrails

While agents run, monitor:
Token usage (abort if exceeds threshold)
Confidence scores (route low-confidence to human)
Execution time (timeout if too slow)

Implement circuit breakers—if error rate crosses threshold, disable agent and route to fallback.

Post-execution validation

After agent completes, check:
Output schema compliance
Sensitive data redaction
Audit log completeness

Don't return outputs that fail validation. Log failures and alert ops.

Rollback strategy

When issues are detected:
1. Immediate: Disable agent, route traffic to fallback (static responses, human queue)
2. Triage: Review logs, identify root cause
3. Fix: Update prompts, retrain models, patch code
4. Re-eval: Run full eval suite
5. Gradual re-deploy: Canary → pilot → full rollout

Maintain version history. Fast rollback is essential.

Human-in-the-loop UI patterns (approve/annotate/retry)

Agents augment humans, not replace them. Design HITL workflows that keep humans in control:

Approval workflows

Route high-stakes decisions to humans:
Present agent recommendation + confidence + reasoning
Show relevant context (order history, customer profile)
Provide approve/reject/modify actions
Track approval latency (SLA monitoring)

Annotation workflows

Humans correct agent mistakes to improve future performance:
Show agent output vs. expected output
Provide easy correction interface (edit, select correct option)
Feed corrections back into retraining pipeline

Measure annotation quality (inter-rater agreement) to ensure reliable ground truth.

Retry workflows

When agents fail, let humans retry with adjustments:
Show error message and context
Allow manual parameter tweaks (temperature, prompt modifications)
Re-run agent with new settings
Log retry attempts for later analysis

HITL reviewer flow (state machine)


[Agent Completes] 
  ├─→ High confidence → Auto-approve → [Done]
  ├─→ Medium confidence → Human review → Approve/Reject → [Done]
  └─→ Low confidence → Human override → Manual completion → [Done]

[Human Review]
  ├─→ Approve: Log acceptance, execute action
  ├─→ Reject: Log rejection reason, route to manual queue
  └─→ Modify: Annotate correction, re-run agent, log update


Observability: success, fallback, cost

Instrument everything. You can't improve what you don't measure.

Success metrics

Track:
Completion rate: % of agent runs that succeed
Accuracy: % of outputs matching expected results (from evals)
Latency: P50/P95/P99 response times
User satisfaction: Thumbs up/down, CSAT surveys

Fallback metrics

When agents fail:
Fallback rate: % of requests routed to human/static fallback
Fallback reasons: Categorize failures (low confidence, timeout, error)
Recovery time: How long until agent restored after incident

Cost metrics

Monitor spending:
Token usage: Tokens per request, daily/monthly totals
API costs: Dollars per interaction, by model/provider
Infrastructure: Compute, storage, bandwidth

Set budgets and alerts. Cost spikes often indicate abuse or inefficiency.

Run logs

Store comprehensive logs for every agent execution:
Timestamp, user ID, session ID
Input (prompt, parameters, context)
Output (response, confidence, tokens used)
Actions taken (API calls, database writes)
Success/failure status and error messages

Enable searchability (Elasticsearch, CloudWatch Logs Insights). Logs are your debugging and audit trail.

Agentic AI Systems: Guardrails, Evals, and Human-in-the-Loop

Risk model and policy mapping (PII, actions, approvals)

What can go wrong?

Policy → technical controls

Prompt & tool contracts (least-privilege)

Prompt contracts

Tool least-privilege

Evals you need (accuracy, jailbreak, toxicity, cost)

Accuracy evals

Safety evals

Quality evals

Cost evals

Minimal eval suite checklist

Safety gates & rollback strategies

Pre-flight checks

Runtime guardrails

Post-execution validation

Rollback strategy

Human-in-the-loop UI patterns (approve/annotate/retry)

Approval workflows

Annotation workflows

Retry workflows

HITL reviewer flow (state machine)

Observability: success, fallback, cost

Success metrics

Fallback metrics

Cost metrics

Run logs

Frequently asked questions

Related resources

ALPR & Public-Records Redaction: Policy, CJIS, and Audit in Practice

Ready to get started?