Guardrail
Secondary LLM that validates tool calls and user messages against a safety policy before they execute or enter the conversation.
Overview
The guardrail feature uses a dedicated LLM (or the current agent’s model with a custom prompt) to classify content as safe or unsafe. Two independent scopes control what gets checked:
- tool_calls — validates tool calls before execution (after tool policy check)
- user_messages — validates user input before it enters the conversation
Each scope has its own agent and/or prompt, allowing different models or policies for each.
Modes
| Mode | Behavior |
|---|---|
"" (empty) | Disabled (default) |
"log" | Shows a flagged notice in the UI but proceeds normally |
"block" | Rejects the tool call or user message |
Configuration
File: .aura/config/features/guardrail.yaml
guardrail:
# "block", "log", or "" (disabled, default)
mode: block
# Error policy — what happens when the guardrail check itself fails.
# "block" = fail-closed (default for mode: block)
# "allow" = fail-open (default for mode: log)
on_error: allow
# Max duration per guardrail check. Default: 2m when enabled.
timeout: 2m
# Independent scopes — each can have its own agent or prompt.
# A scope is active when agent or prompt is set.
scope:
tool_calls:
# Dedicated agent for tool call validation.
agent: "GuardRail:Tool"
# OR: named prompt for self-guardrail (uses current agent's model).
# prompt: guardrail-tool
user_messages:
# Dedicated agent for user message validation.
agent: "GuardRail:Input"
# OR: named prompt for self-guardrail.
# prompt: guardrail-input
# Filter which tools trigger guardrail checks.
# Same glob pattern system as tool filtering elsewhere.
tools:
enabled: [] # only check matching tools (empty = all)
disabled: [] # skip matching tools (applied after enabled)
Agent vs Prompt
Each scope supports two resolution strategies:
- Agent (
agent: "GuardRail:Tool") — uses a dedicated hidden agent with its own provider, model, and system prompt. The guardrail model runs independently of the conversation model. - Prompt (
prompt: guardrail-tool) — self-guardrail mode. Uses the current agent’s provider and model with a named system prompt fromprompts/system/. No separate model needed.
If both agent and prompt are set, prompt takes precedence.
Response Protocol
The guardrail request includes a JSON schema constraint (response_format: json_schema) that forces compliant models to return structured JSON:
{"result": "safe"}
The schema restricts result to exactly "safe" or "unsafe".
Parsing order:
- JSON parse (primary) — parses the response as
{"result": "safe"|"unsafe"}. Used when the model honors the schema constraint. - First-token parse (fallback) — extracts the first whitespace-delimited token and matches against
safe(allow) orunsafe(flag). Used for providers or models that ignoreResponseFormat.
Any unrecognized response is treated as unsafe (fail-closed).
Tool Filtering
The tools.enabled and tools.disabled fields control which tool calls trigger guardrail checks. This is independent of tool availability — a tool can be enabled for use but excluded from guardrail validation.
guardrail:
mode: block
scope:
tool_calls:
agent: "GuardRail:Tool"
tools:
enabled:
- Bash
- Patch
- Write
disabled:
- "Bash:ls *"
Integration Points
Tool calls: Guardrail runs after tool policy check but before execution. If blocked, the tool call completes with an error result and the assistant continues to the next tool call. In log mode, the flagged notice is added as a DisplayOnly message (visible in the UI and session history, but never sent to the LLM).
Batch sub-calls: Each sub-call within a Batch invocation is individually guardrail-checked via the same CheckGuardrail() path. If a sub-call is blocked, it returns an error in the Batch results while other sub-calls proceed independently.
User messages: Guardrail runs after compaction and input size checks. If blocked, the message is rejected with a user-facing notice.
Error Policy
The on_error field controls what happens when the guardrail check itself fails (timeout, network error, model unavailable — after retries). This is independent of mode, which controls what happens when content is flagged as unsafe.
on_error | Behavior |
|---|---|
"block" | Fail-closed — block the content. Default when mode: block. |
"allow" | Fail-open — proceed without guardrail check. Default when mode: log. |
When on_error is omitted, it defaults to match mode: block→block, log→allow. Set it explicitly to decouple error behavior from policy enforcement — e.g. mode: block + on_error: allow blocks unsafe content but doesn’t paralyze the session when the guardrail provider is down.
Error Handling
- Guardrail errors with
on_error: block: The content is blocked (fail-closed). - Guardrail errors with
on_error: allow: Logged and the content proceeds normally (fail-open). - Unknown response tokens: Treated as unsafe (fail-closed) — this is a policy decision controlled by
mode, noton_error.
Per-Agent Override
Override guardrail settings per agent via features: in agent frontmatter:
features:
guardrail:
mode: log
on_error: block # fail-closed even in log mode
scope:
tool_calls:
prompt: guardrail-tool # self-guardrail with this agent's model
Default Agents
Two pre-built feature agents are provided:
- GuardRail:Tool — classifies tool calls against sandbox policy (path traversal, destructive commands, data exfiltration, indirect execution)
- GuardRail:Input — classifies user messages for sensitive content (API keys, tokens, passwords, credentials)
Both are hidden agents in .aura/config/agents/features/guardrail/.