Guardrail

Secondary LLM that validates tool calls and user messages against a safety policy before they execute or enter the conversation.

Overview

The guardrail feature uses a dedicated LLM (or the current agent’s model with a custom prompt) to classify content as safe or unsafe. Two independent scopes control what gets checked:

  • tool_calls — validates tool calls before execution (after tool policy check)
  • user_messages — validates user input before it enters the conversation

Each scope has its own agent and/or prompt, allowing different models or policies for each.

Modes

Mode Behavior
"" (empty) Disabled (default)
"log" Shows a flagged notice in the UI but proceeds normally
"block" Rejects the tool call or user message

Configuration

File: .aura/config/features/guardrail.yaml

guardrail:
  # "block", "log", or "" (disabled, default)
  mode: block

  # Error policy — what happens when the guardrail check itself fails.
  # "block" = fail-closed (default for mode: block)
  # "allow" = fail-open (default for mode: log)
  on_error: allow

  # Max duration per guardrail check. Default: 2m when enabled.
  timeout: 2m

  # Independent scopes — each can have its own agent or prompt.
  # A scope is active when agent or prompt is set.
  scope:
    tool_calls:
      # Dedicated agent for tool call validation.
      agent: "GuardRail:Tool"
      # OR: named prompt for self-guardrail (uses current agent's model).
      # prompt: guardrail-tool
    user_messages:
      # Dedicated agent for user message validation.
      agent: "GuardRail:Input"
      # OR: named prompt for self-guardrail.
      # prompt: guardrail-input

  # Filter which tools trigger guardrail checks.
  # Same glob pattern system as tool filtering elsewhere.
  tools:
    enabled: []     # only check matching tools (empty = all)
    disabled: []    # skip matching tools (applied after enabled)

Agent vs Prompt

Each scope supports two resolution strategies:

  • Agent (agent: "GuardRail:Tool") — uses a dedicated hidden agent with its own provider, model, and system prompt. The guardrail model runs independently of the conversation model.
  • Prompt (prompt: guardrail-tool) — self-guardrail mode. Uses the current agent’s provider and model with a named system prompt from prompts/system/. No separate model needed.

If both agent and prompt are set, prompt takes precedence.

Response Protocol

The guardrail request includes a JSON schema constraint (response_format: json_schema) that forces compliant models to return structured JSON:

{"result": "safe"}

The schema restricts result to exactly "safe" or "unsafe".

Parsing order:

  1. JSON parse (primary) — parses the response as {"result": "safe"|"unsafe"}. Used when the model honors the schema constraint.
  2. First-token parse (fallback) — extracts the first whitespace-delimited token and matches against safe (allow) or unsafe (flag). Used for providers or models that ignore ResponseFormat.

Any unrecognized response is treated as unsafe (fail-closed).

Tool Filtering

The tools.enabled and tools.disabled fields control which tool calls trigger guardrail checks. This is independent of tool availability — a tool can be enabled for use but excluded from guardrail validation.

guardrail:
  mode: block
  scope:
    tool_calls:
      agent: "GuardRail:Tool"
  tools:
    enabled:
      - Bash
      - Patch
      - Write
    disabled:
      - "Bash:ls *"

Integration Points

Tool calls: Guardrail runs after tool policy check but before execution. If blocked, the tool call completes with an error result and the assistant continues to the next tool call. In log mode, the flagged notice is added as a DisplayOnly message (visible in the UI and session history, but never sent to the LLM).

Batch sub-calls: Each sub-call within a Batch invocation is individually guardrail-checked via the same CheckGuardrail() path. If a sub-call is blocked, it returns an error in the Batch results while other sub-calls proceed independently.

User messages: Guardrail runs after compaction and input size checks. If blocked, the message is rejected with a user-facing notice.

Error Policy

The on_error field controls what happens when the guardrail check itself fails (timeout, network error, model unavailable — after retries). This is independent of mode, which controls what happens when content is flagged as unsafe.

on_error Behavior
"block" Fail-closed — block the content. Default when mode: block.
"allow" Fail-open — proceed without guardrail check. Default when mode: log.

When on_error is omitted, it defaults to match mode: block→block, log→allow. Set it explicitly to decouple error behavior from policy enforcement — e.g. mode: block + on_error: allow blocks unsafe content but doesn’t paralyze the session when the guardrail provider is down.

Error Handling

  • Guardrail errors with on_error: block: The content is blocked (fail-closed).
  • Guardrail errors with on_error: allow: Logged and the content proceeds normally (fail-open).
  • Unknown response tokens: Treated as unsafe (fail-closed) — this is a policy decision controlled by mode, not on_error.

Per-Agent Override

Override guardrail settings per agent via features: in agent frontmatter:

features:
  guardrail:
    mode: log
    on_error: block  # fail-closed even in log mode
    scope:
      tool_calls:
        prompt: guardrail-tool  # self-guardrail with this agent's model

Default Agents

Two pre-built feature agents are provided:

  • GuardRail:Tool — classifies tool calls against sandbox policy (path traversal, destructive commands, data exfiltration, indirect execution)
  • GuardRail:Input — classifies user messages for sensitive content (API keys, tokens, passwords, credentials)

Both are hidden agents in .aura/config/agents/features/guardrail/.


Back to top

Copyright © 2026 idelchi. Distributed under the MIT License.