Skip to content
Allerin, go to homepage

Agentic AI Systems: Guardrails, Evals, and Human-in-the-Loop

How to design multi-agent automations that pass a security review: policy→prompts, eval suites, safety gates, and HITL patterns.

By Dr. Amara Okafor, Lead, Agentic AI Practice · 12 min read

Risk model and policy mapping (PII, actions, approvals)

Before building agents, map your organization's policies to technical controls. Ask:

What can go wrong?

Identify failure modes:

  • Data leakage: PII, confidential info in prompts/responses
  • Unauthorized actions: Agents exceeding their authority
  • Cost overruns: Uncontrolled API usage
  • Quality issues: Hallucinations, incorrect outputs
  • Security: Prompt injection, jailbreaks

Policy → technical controls

Translate policies into enforceable rules:

  • "Don't share customer PII" → Input/output filters, redaction
  • "Require approval for orders >$10K" → Human-in-the-loop gate
  • "Limit AI spending to $500/day" → Rate limits, budget tracking
  • "Audit all decisions" → Complete logging

Document this mapping. It becomes your acceptance criteria.

Prompt & tool contracts (least-privilege)

Design agents with narrow, explicit capabilities. Don't give a customer-service agent access to your entire API. Scope tools to minimum necessary permissions.

Prompt contracts

Define clear interfaces for each agent:

  • Input schema: What data the agent receives (typed, validated)
  • Output schema: What the agent returns (structured, not free-form)
  • Constraints: Boundaries the agent must respect

Example:

Agent: Order Processor
Input: { orderId: string, action: "cancel" | "refund" }
Output: { success: boolean, message: string, auditLog: string }
Constraints: 
  - Order must belong to authenticated user
  - Refunds ≤$10K auto-approve; >$10K route to human
  - All actions logged to audit table

Tool least-privilege

Provide agents only the tools they need:

  • Customer service agent: Read orders, create support tickets (no delete)
  • Analyst agent: Read-only database access (no write)
  • Automation agent: Execute approved workflows (no arbitrary code)

Enforce through API keys with scoped permissions, not through prompts alone. Prompts can be jailbroken.

Evals you need (accuracy, jailbreak, toxicity, cost)

Continuous evaluation prevents silent degradation. Build automated test suites:

Accuracy evals

Test agent outputs against golden datasets:

  • Does the agent extract correct information?
  • Are calculations accurate?
  • Do responses match expected format?

Run on every deployment. Regression should trigger rollback.

Safety evals

Test for adversarial behavior:

  • Jailbreak attempts: Can users trick the agent into ignoring rules?
  • Prompt injection: Can users manipulate the agent's instructions?
  • PII leakage: Does the agent expose sensitive data?

Maintain a "red team" dataset of known attacks. Add new attack vectors as discovered.

Quality evals

Measure subjective quality:

  • Relevance: Does the response address the query?
  • Coherence: Is the response logically consistent?
  • Toxicity: Does the agent generate harmful content?

Use LLM-as-judge or human raters. Set acceptance thresholds (e.g., ≥4.0/5.0 average).

Cost evals

Track operational costs:

  • Tokens per interaction
  • API calls per workflow
  • Average cost per user session

Set budget alerts. If costs spike, investigate prompt inefficiencies or abuse.

Minimal eval suite checklist

Before production:

  • 100+ accuracy test cases (happy path + edge cases)
  • 50+ safety test cases (jailbreaks, injections)
  • Cost per interaction measured and within budget
  • Quality spot-checked by human raters (n≥50)
  • All evals automated in CI/CD

Safety gates & rollback strategies

Even with evals, things break. Build layered defenses:

Pre-flight checks

Before executing actions, validate:

  • Input schema compliance
  • User authorization
  • Rate limits not exceeded
  • Known-bad patterns not present

Reject invalid requests before they reach the agent.

Runtime guardrails

While agents run, monitor:

  • Token usage (abort if exceeds threshold)
  • Confidence scores (route low-confidence to human)
  • Execution time (timeout if too slow)

Implement circuit breakers. If error rate crosses threshold, disable agent and route to fallback.

Post-execution validation

After agent completes, check:

  • Output schema compliance
  • Sensitive data redaction
  • Audit log completeness

Don't return outputs that fail validation. Log failures and alert ops.

Rollback strategy

When issues are detected:

  1. Immediate: Disable agent, route traffic to fallback (static responses, human queue)
  2. Triage: Review logs, identify root cause
  3. Fix: Update prompts, retrain models, patch code
  4. Re-eval: Run full eval suite
  5. Gradual re-deploy: Canary → pilot → full rollout

Maintain version history. Fast rollback is essential.

Human-in-the-loop UI patterns (approve/annotate/retry)

Agents augment humans, not replace them. Design HITL workflows that keep humans in control:

Approval workflows

Route high-stakes decisions to humans:

  • Present agent recommendation + confidence + reasoning
  • Show relevant context (order history, customer profile)
  • Provide approve/reject/modify actions
  • Track approval latency (SLA monitoring)

Annotation workflows

Humans correct agent mistakes to improve future performance:

  • Show agent output vs. expected output
  • Provide easy correction interface (edit, select correct option)
  • Feed corrections back into retraining pipeline

Measure annotation quality (inter-rater agreement) to ensure reliable ground truth.

Retry workflows

When agents fail, let humans retry with adjustments:

  • Show error message and context
  • Allow manual parameter tweaks (temperature, prompt modifications)
  • Re-run agent with new settings
  • Log retry attempts for later analysis

HITL reviewer flow (state machine)

[Agent Completes] 
  ├─→ High confidence → Auto-approve → [Done]
  ├─→ Medium confidence → Human review → Approve/Reject → [Done]
  └─→ Low confidence → Human override → Manual completion → [Done]

[Human Review]
  ├─→ Approve: Log acceptance, execute action
  ├─→ Reject: Log rejection reason, route to manual queue
  └─→ Modify: Annotate correction, re-run agent, log update

Observability: success, fallback, cost

Instrument everything. You can't improve what you don't measure.

Success metrics

Track:

  • Completion rate: % of agent runs that succeed
  • Accuracy: % of outputs matching expected results (from evals)
  • Latency: P50/P95/P99 response times
  • User satisfaction: Thumbs up/down, CSAT surveys

Fallback metrics

When agents fail:

  • Fallback rate: % of requests routed to human/static fallback
  • Fallback reasons: Categorize failures (low confidence, timeout, error)
  • Recovery time: How long until agent restored after incident

Cost metrics

Monitor spending:

  • Token usage: Tokens per request, daily/monthly totals
  • API costs: Dollars per interaction, by model/provider
  • Infrastructure: Compute, storage, bandwidth

Set budgets and alerts. Cost spikes often indicate abuse or inefficiency.

Run logs

Store complete logs for every agent execution:

  • Timestamp, user ID, session ID
  • Input (prompt, parameters, context)
  • Output (response, confidence, tokens used)
  • Actions taken (API calls, database writes)
  • Success/failure status and error messages

Enable searchability (Elasticsearch, CloudWatch Logs Insights). Logs are your debugging and audit trail.

Frequently asked questions

Ready to build your product?

84-person senior engineering team, measurable outcomes, fast routes to production.

Procurement team? See our Trust Center →