Agentic Control Layers: Production Reliability Patterns for LLM Agents
Most agentic AI demos end the same way: a smooth video of an LLM that reads a calendar, books a flight, and emails a confirmation. The demos rarely show what happens when the agent picks the wrong calendar, books a flight to the wrong airport, or — our personal favorite — politely emails a confirmation that it could not actually do the thing it claimed to have done.
Running LLM agents in production is not a model problem. The model is fine. It is a control problem. The patterns below are the ones we run for clients who put LLM agents in front of real customers, real money, and real data. None of them are theoretical.
The agent failure modes you actually see
Before the patterns, the failure modes. In two years of running agents in production for clients, the same five issues account for ~90% of incidents:
- Wrong tool, confidently called. The agent invokes
deleteUserwhen the user asked to "remove this from my dashboard." - Right tool, wrong arguments. The agent calls
transferFundswith the recipient and sender swapped. - Hallucinated tool. The agent invents a tool that does not exist, then reports success when nothing happened.
- Infinite tool-call loops. The agent repeatedly calls the same tool with slightly different arguments until something kills the budget.
- Silent failure. A tool errors, the agent fabricates a plausible result, the user sees a green checkmark.
Every pattern below addresses one or more of these. None of them rely on the model getting smarter.
Layer 1: Tool whitelisting and scope per session
The model should never see tools it is not supposed to use in this session. "Filtering at the prompt level" — telling the model "do not use deleteUser" — is not control. It is a suggestion.
Real control:
- The orchestrator constructs the tool set per session, based on the authenticated user's role.
- Tools the user is not authorized for are physically absent from the request payload.
- A read-only session sees only read-only tools.
- A "guest" session sees a sharply reduced set, even if the underlying account technically has more permissions.
Implementation is mundane: a getToolsForSession(user, context) function that returns the JSON tool definitions to attach to the API call. The reason to make it ceremonial is auditing — every agent action can be traced back to "these tools were available, this user was authenticated, this was the prompt."
Layer 2: Argument validation before execution
Every tool call from the LLM goes through a schema validator before it touches a real system. Zod, Pydantic, JSON Schema — pick your flavor. The validator should:
- Reject malformed arguments with a structured error the model can read and retry from.
- Enforce business invariants the schema can express (
amount > 0,recipient != sender,date in future). - Reject types that look right but are wrong (
"userId": "user_42"when the schema requires a UUID).
We have caught more bugs at this layer than at any other. The LLM will confidently emit {"amount": "100 USD"} when the tool expects {"amount": 100, "currency": "USD"}. A 5-line validator catches it; a missing validator lets it through to a function that does something unexpected.
Layer 3: Budget caps — tokens, tool calls, wall clock
Every session has three caps that are enforced by the orchestrator, not by the model:
- Token cap. Maximum input + output tokens per session. Once exceeded, the session terminates with a structured error.
- Tool call cap. Maximum number of tool invocations per session. Prevents the loop-until-budget-explodes failure mode.
- Wall-clock cap. Maximum seconds per session. Catches stalled tool calls and runaway loops in real time.
These are not "we should add these someday." They are non-negotiable for any production agent. We have seen sessions go from $0.02 of normal usage to $400 of stuck-loop usage in 90 seconds. The caps prevent that.
The caps must be configurable per route. A "summarize this email" session might cap at 8K tokens and 3 tool calls. An "investigate this customer complaint" session might cap at 100K tokens and 50 tool calls. Pick the cap that matches the workflow's worst plausible run.
Layer 4: Fallback chains and graceful degradation
Real agents call real APIs that fail. Rate limits, timeouts, partial outages, expired tokens. The agent must not just stop — it must degrade.
Patterns we use:
- Retry with backoff at the tool layer, not at the model layer. The model should not be retrying its own tool calls; the tool wrapper should retry transparently.
- Surface tool failures to the model as structured tool results, not as exceptions.
{"error": "rate_limited", "retry_after_seconds": 30}is something the model can reason about. A raw stack trace is not. - Have a "human in the loop" fallback path for high-stakes actions. The agent does not retry a failed wire transfer; it surfaces the failure and asks for human intervention.
- Define explicit terminal states. "I could not complete this task because X" is a valid agent response and should be treated as a successful session, not a failure to retry.
The single biggest win here is the structured error format. When tool errors are well-shaped, the model handles them gracefully. When they are exceptions or empty results, the model invents what it cannot do.
Layer 5: Telemetry — log everything, alert on the right things
Every agent session emits structured logs covering:
- The user, the session ID, the route, the toolset.
- Every model call: input tokens, output tokens, model version, latency.
- Every tool call: tool name, arguments (with PII redaction), result, latency, success/failure.
- Final session state: completed, terminated by budget cap, terminated by error, terminated by user.
These flow into whatever observability stack you have — Honeycomb, Datadog, Grafana, ClickHouse-on-a-budget. The pattern matters more than the tool.
Alerts that pay for themselves:
- Tool call failure rate spike on a single tool (the tool's upstream is broken)
- Budget cap hit rate spike on a route (the route is misbehaving or under attack)
- p95 session latency over threshold (the model or a tool is degrading)
- Same user hitting budget cap repeatedly (potential abuse)
Alerts that are noise:
- Individual tool failures (handled by retry/fallback)
- Individual budget terminations (expected)
- Token usage above some absolute threshold (uninformative without context)
Layer 6: Output guardrails
The last line of defense is the agent's output. Even if every layer above passed, the agent's final message to the user can contain PII it should not have, claims it cannot back up, or instructions that violate policy.
Patterns:
- PII redaction on output, especially for shared/multi-user contexts. The agent learned about employee salaries from a CRM tool? Strip those numbers from the message before showing it to the requester unless they are authorized.
- Citation enforcement for factual responses. If the agent makes a claim, the orchestrator checks that the claim references a retrieved chunk or tool result. Uncited claims are stripped or flagged.
- Profanity / policy filters at the boundary, even for "trusted" internal agents. They catch the edge cases your prompt did not anticipate.
These are not censorship layers. They are correctness layers. The model is great at sounding right; the guardrails check whether it actually is.
What you do not need
A few things that get sold as "agent infrastructure" and that we have found do not move the needle for typical production use cases:
- Heavy multi-agent orchestration frameworks. A single agent with good tool design beats a five-agent crew for almost every business workflow. Multi-agent is fashionable; single-agent ships.
- "Agentic memory" stores beyond conversation history. A clean conversation + retrieval over your domain data is enough for most use cases. Long-term agent memory is a rabbit hole.
- Self-improving prompt loops. The agent rewriting its own prompts sounds clever and produces unstable behavior in production.
If you are building, focus on the six layers above. Get them right and the agent is reliable. Skip them and the model's quality stops mattering.
A reference architecture
Putting it together, a production agentic system looks like:
[User request]
│
▼
[Orchestrator: session, auth, budget caps]
│
├─► [getToolsForSession()] ──► tool whitelist
│
▼
[LLM API call with tool definitions]
│
▼
[Tool call from model]
│
▼
[Argument validator]
│
▼
[Tool wrapper: retry, timeout, structured errors]
│
▼
[Real tool execution]
│
▼
[Result → back to LLM, with telemetry emit]
│
▼
[Loop until done or budget hit]
│
▼
[Output guardrails: PII, citation, policy]
│
▼
[Response to user]
This is not novel. It is the production-engineered version of what every "build an agent in 50 lines" tutorial leaves out. Every box in the diagram corresponds to an actual incident class we have seen in client systems.
The honest summary
Agents in production are an engineering problem with an LLM shaped hole in the middle. The model is one component, not the system. The system is everything around it: the auth boundary, the tool surface, the validators, the budget caps, the fallback chains, the telemetry, the guardrails.
When clients ask "what is the secret to making agents work?" the answer is genuinely boring: build the layers. The teams whose agents work are the ones who treat agentic systems as distributed-systems problems with strict invariants — and the teams whose agents do not are the ones still hoping the next model release will save them.
It will not. The next model release will be better at calling the wrong tool more confidently. The layers are the work.