This is your last article that you can read this month before you need to register a free LeadDev.com account.
Estimated reading time: 8 minutes
Key takeaways:
- AI agents don’t fail like software, they fail unpredictably.
- MLOps won’t cut it: AgentOps is a different discipline.
- Define AI-writable boundaries before an agent defines them for you.
The next production outage might not come from a code bug or an infrastructure failure. It could come from an AI agent making a decision we never anticipated, in a workflow we didn’t realize was AI-writable.
The question isn’t whether AI agents will touch our production systems. It’s whether our architecture is ready when they arrive. By AI agents, I am referring to autonomous or semi-autonomous software systems powered by large language models (LLMs) that can independently take actions within production infrastructure.
AI is moving from the background to the forefront of software engineering. Where problems were once manually troubleshot and verified by humans, we now see AI agents taking on troubleshooting and resolution tasks independently. That shift brings real power, and real risk.
Identifying the gaps in how these agents operate is essential if we want to harness that power while maintaining customer trust. Monitoring and verifying agent outputs, and setting up guardrails to prevent repeated failures, will be critical. Without that, teams won’t just miss issues. They risk actively misrepresenting system state and eroding customer trust.
The root cause of such failures is often not a surface-level model error or a prompt injection attack. It can be something far harder to detect, like a hallucination occurring deep in the agent’s processing pipeline during concurrent operations.
To stay at the forefront of AI adoption, engineering leaders need to treat AI not as an assistant but as an equal partner in their system designs. That means rethinking architecture from the ground up, not just improving test coverage or letting models run unchecked.
Your inbox, upgraded.
Receive weekly engineering insights to level up your leadership approach.
What makes AgentOps different from MLOps?
Many engineering leaders assume Machine Learning Operations (MLOps) practices extend naturally to agents. They don’t.
MLOps focuses on deploying and monitoring machine learning models – systems that provide predictions within defined boundaries. Agent operations (AgentOps) manages autonomous actors that make decisions, invoke tools, and follow dynamic execution paths in real-time environments.
The fundamental difference is agency itself. A machine learning model waits for input and returns a prediction. An agent observes its environment, plans actions, executes tools, and adapts based on outcomes, often without waiting for human approval.
This has real architectural consequences. Unlike stateless models, agents maintain conversational memory and context across interactions. They remember previous decisions, learn from outcomes, and adjust behavior accordingly. Managing that persistent, evolving state at scale is a first-order systems design problem, especially when coordinating multiple agents that need to share context and maintain consistency.
Machine learning models follow deterministic inference paths. Agents construct their execution trajectory on the fly, choosing which tools to invoke, in what sequence, based on runtime conditions. We can’t predict the exact path an agent will take any more than we can predict how a person will approach a novel problem.
A model has a relatively predictable compute and cost profile. An agent might take three tool invocations to answer one query and 30 for a superficially similar one. That unpredictability makes capacity planning and cost control substantially more difficult.
For engineering leaders making investment decisions, this gap is consequential. MLOps infrastructure focuses on model versioning and deployment pipelines. AgentOps requires additional layers for tool orchestration, state management, behavioral evaluation, and dynamic execution monitoring. Organizations that recognize this early invest appropriately. Those that assume MLOps extend naturally find out when production incidents force rapid architectural retrofitting.
Reducing blast radius through architectural isolation
One practical approach to managing agent risk is implementing what I call a “testing pods” approach: small, isolated, and iterative environments that mirror production topology but operate at reduced scale with comprehensive instrumentation. They offer a reliable way to estimate the cost and time implications of using AI agents before those agents encounter real workloads.
The principle is straightforward: create environments where agents can exhibit production-grade behavior without production-scale consequences. Each testing pod runs against a subset of actual issues, but with several important design characteristics. Testing pods maintain their own databases, completely isolated from production state. Anything that goes wrong cannot cascade into the systems that leadership relies on for decision-making.
This isolation lets teams observe how agents behave under realistic conditions, surface failure modes before they cause real damage, and tune guardrails based on evidence rather than assumption. The cost of running these environments is real, but it is far lower than the cost of an agent-driven incident in production.
More like this
The boundary between AI-writable and human-controlled systems
One of the most consequential architectural decisions we will make is defining which parts of our system should be AI-writable versus strictly human-controlled. Get this boundary wrong, and we are either constraining the agent’s usefulness or creating the conditions for catastrophic failure.
In my experience, the boundary conversation is the one engineering leaders avoid the longest, because it forces a real argument about who owns what.
The systems best suited to agent writes are those where errors are visible, contained, and reversible. Some domains should remain strictly off-limits to autonomous agent writes. Even with high accuracy, agents should not have direct write access to systems that update financial records, or to healthcare and other sensitive data stores. These decisions should route through human-in-the-loop approval workflows.
Similarly, agents should never modify access controls, user permissions, or security policies. These are foundational trust boundaries that require human judgment about intent and context. Infrastructure settings, deployment configurations, and system parameters should be immutable from the agent’s perspective. Agents can recommend changes, but implementation should remain a human-controlled process through standard change management workflows.
It is worth making this boundary an explicit architectural decision rather than an implicit assumption that different teams interpret differently. That clarity prevents the gradual boundary erosion that leads to agents having access to systems they should never touch.
Making agent reasoning visible through observability
The most sophisticated architectural patterns are useless if we cannot understand why an agent made specific decisions. Observability infrastructure for agents differs significantly from traditional application monitoring. We are not just tracking inputs, outputs, and error rates. We are making autonomous reasoning processes visible and debuggable.
That means capturing not just what the agent did, but why. When an agent categorizes an issue as low priority, the logs should show what features it extracted from the ticket text, which previous examples it considered similar, and what reasoning led to the final classification.
When something goes wrong, we need to reconstruct its thought process, not just its actions. This requires building observability from the ground up, not bolting it on after deployment. Every tool invocation, every state transition, and every reasoning step needs instrumentation. That creates overhead, but it is the price of making autonomous systems debuggable and trustworthy.
In practice, agent architecture needs three distinct logging layers.
- Execution logs cover what happened.
- Reasoning logs explain why it happened.
- Performance logs measure how efficiently it happened.
Traditional application logging handles the first. Agent-specific infrastructure must handle the other two. If we skip any of the above logging layers, when something breaks, we won’t have the full picture of what went wrong, or enough information to stop the AI-induced error from happening again.
What this means for engineering leaders
The strategic question is not whether AI agents will integrate into our systems. They will. The question is whether our architecture will be ready when they arrive, or whether we will be retrofitting safety mechanisms into production systems while incidents pile up.
For organizations just starting agent deployments, the foundation matters more than sophisticated features. Start by capturing real user interactions with systems. These become test cases for agent behavior. Even a few hundred representative examples give a baseline for measuring whether changes improve or degrade quality. Without that foundation, every deployment is a guess.
Before any agent version reaches production, it should demonstrate acceptable performance against an evaluation dataset. That single architectural decision prevents most catastrophic regressions. It is the difference between discovering problems in staging and discovering them when executives are making decisions based on corrupted dashboards.
For organizations scaling existing deployments, the priority shifts toward modularizing, encapsulating, and reusing workflows wherever possible. Production insights should flow automatically back into development. Queries that confuse the agent become evaluation test cases. Errors that triggered alerts become scenarios in our test suite. This approach transforms static deployments into continuously improving systems that get better with each interaction.
The testing pod pattern scales with you. As we deploy more agents across more workflows, we need parallel testing environments to validate behavioral changes before they reach production. That infrastructure is expensive to build, but it pays for itself the first time it catches a regression that would have affected business operations.

New York • September 15-16, 2026
Speakers Camille Fournier, Gergely Orosz and Will Larson confirmed 🙌
The core principle: design for safe autonomy
Resilience comes from designing architectural structures that can safely handle autonomy. As Kelsey Hightower, a well-known Kubernetes contributor and former Google Cloud engineer, has noted in discussions on AI and infrastructure, the fundamentals of reliable systems engineering have not changed. Underneath the autonomous behavior, agents are still software that manipulates data, and the discipline required to make that software dependable remains the same.
Most agent projects fall short not because the technology isn’t ready, but because the surrounding infrastructure isn’t prepared for autonomous systems in my experience. Once we close that gap between prototype and production-ready infrastructure, AI agents stop being clever demos and start becoming trusted entities that compound in value and efficiency for the overall software system over time.
That transition is the real work of engineering leadership in the age of AI agents.