How LLMs became Walmart’s on-call engineer

LLMs aren’t just productivity tools.

By Lokesh Lagudu

May 13, 2026

You have 1 article left to read this month before you need to register a free LeadDev.com account.

Estimated reading time: 5 minutes

Key takeaways:

Natural language is the new dashboard. Walmart put an LLM between support staff and complex infrastructure, using MCP to help cut triage time by 85%.
AI eliminated the grunt work, not the engineers.
Scale without headcount. Walmart’s real breakthrough wasn’t hiring more support staff – it was making complexity disappear for the people already there.

At Walmart, retail shrink – loss due to theft, fraud, or operational errors – costs billions annually. Our team operates an AI-powered Shrink Avoidance solution deployed across 5,000+ stores, monitoring over 100,000 checkout lanes worldwide.

The scale of this operation is massive: millions of transactions daily, petabytes of video and scan data processed in real time, and critical dependencies on hardware and software working in perfect harmony.

This system is a complex, real-time network of Graphics Processing Unit- (GPU-) powered edge servers, thousands of cameras at self-checkouts, over 20 microservices processing scan and video data, and infrastructure spanning BigQuery, Prometheus, Splunk, and internal platforms.

When anything breaks – a misaligned camera, a failing GPU, a network timeout, or a service degradation – our ability to detect potential theft is compromised. Even minor issues can cascade across dozens of stores, affecting customer experience and loss prevention effectiveness.

Your inbox, upgraded.

Receive weekly engineering insights to level up your leadership approach.

Scaling support without scaling people

Each month, we handle 200-300 production incidents and 150-200 alerts spread across thousands of lanes and stores. Our first line of defense is a non-technical L2 support team following a detailed 40-50 step manual playbook for triage. On average, each issue takes 15 minutes to investigate before escalating to engineering.

The manual playbook approach had significant limitations. Support staff needed to check multiple dashboards, cross-reference disparate data sources, and often lacked the context to distinguish between normal operational variance and genuine incidents.

This led to alert fatigue, inconsistent triage quality, and frequent escalations that pulled senior engineers away from strategic work. As our footprint expanded, we recognized that hiring more support staff wasn’t sustainable. We needed a fundamental shift in how we approached operational support.

This approach was time-consuming, repetitive, and a drain on engineering resources. We needed a system that could handle routine triage work without requiring deep technical knowledge or access to engineering tools.

Building an LLM-powered operations assistant

We built a solution using Large Language Models (LLMs) and Model Context Protocol (MCP). MCP is a lightweight orchestration layer that allows LLMs such as OpenAI’s GPT-4o, Anthropic’s Claude, and others to securely connect with our systems and answer complex operational queries in natural language.

The key insight was recognizing that LLMs excel at synthesizing information from multiple sources and presenting it in an accessible format. Rather than building yet another dashboard or requiring support staff to learn complex query languages, we could leverage LLMs as an intelligent intermediary between our technical infrastructure and non-technical operators.

Our L2 support team can now ask simple questions like:

What’s wrong with Store 1182?
Why is Lane 6 not detecting scans?
Is the camera for self-checkout #8 in Store 3410 working properly?

Behind the scenes, MCP fetches real-time data from BigQuery (scan events), Prometheus (infrastructure metrics), Splunk (application logs), and edge systems (virtual machine and camera health). LLMs interpret the data, apply context, and return a plain-language summary of what’s wrong and often how to fix it.

Building with AI, not just using it

We used GitHub Copilot and LLMs to generate the code for this system:

MCP protocols to query and unify disparate telemetry sources.
Adapters for BigQuery, Prometheus, Splunk, and edge platforms.
API integrations, error handling, and authentication flows.
Prompt engineering to convert data into actionable diagnostics.

This approach demonstrates a shift from traditional software development to prompting systems into existence. Our engineers focused on designing the architecture and validating outputs rather than writing boilerplate code. Development velocity increased dramatically: what would have taken weeks of traditional coding was accomplished in days.

The iterative nature of working with AI code generation also meant we could experiment with different approaches quickly, refining the system based on real-world usage patterns.

Importantly, this wasn’t about replacing engineering judgment. Our team still made critical architectural decisions, validated security requirements, and ensured proper error handling. AI accelerated the implementation, but human expertise remained essential for ensuring production-readiness.

Measurable impact of MCP

Previously, diagnosing issues required toggling between complex dashboards and specialized knowledge. With MCP:

L2 support receives instant, actionable answers without scripting or dashboards.
Triage time dropped from 15+ minutes to under two minutes (85% reduction).
Engineering teams focus on innovation instead of firefighting.
Store uptime improved, and shrink-detection reliability is stronger.

Non-technical support staff are now empowered to work with technical systems effectively. This translates to real-world productivity gains, improved reliability, and measurable return on investment.

What’s next?

MCP currently runs locally on engineers’ machines for testing and iteration. Our next milestones include:

Building a user-friendly interface: L2 support will access MCP through a simple UI without needing engineering tools or credentials. The interface will support conversational queries, historical incident lookup, and guided troubleshooting workflows.
Deploying to production: we’re moving MCP into a secure, scalable environment with role-based access control, monitoring, and enterprise-grade infrastructure. This includes implementing audit logging for all queries and actions, ensuring compliance with internal security policies, and building redundancy to guarantee high availability.
Expanding automation: beyond diagnostics, we’re enabling MCP to execute known fixes automatically – restarting services, realigning camera feeds, or revalidating GPU health – so incidents can resolve with minimal human intervention. We’re starting with low-risk automated remediation actions and gradually expanding scope as we build confidence in the system’s reliability.

We’re also exploring how this approach can extend beyond incident management into proactive monitoring, capacity planning, and even predictive maintenance. The same natural language interface that helps diagnose current problems could help us anticipate future ones.

LDX3 London 2026 agenda is live - See who is in the lineup

New York • September 15-16, 2026

Speakers Camille Fournier, Gergely Orosz and Will Larson confirmed 🙌

Explore

Key takeaways for engineering leaders

We asked ourselves: what if LLMs could be more than assistants? What if they could run operations?

The answer is they can. With MCP, LLMs became our on-call engineer, diagnostic expert, and operations copilot. For engineering leaders managing complex systems at scale, this approach offers several insights:

Reduce cognitive load: enable non-technical teams to handle technical triage by abstracting complexity through natural language interfaces.
Accelerate time-to-resolution: automated diagnostics cut incident response time by over 80% in our case.
Build with AI tools: code generation through LLMs allowed our team to build faster and iterate more effectively.
Focus engineering effort: free senior engineers from repetitive operational work to focus on strategic initiatives.

LLMs aren’t just productivity tools. When integrated thoughtfully into operational workflows, they can fundamentally transform how teams manage infrastructure at scale.

About the author

Lokesh Lagudu

Lokesh Lagudu is a senior engineering manager at Walmart.

Newsletters

Panel discussions

Videos

Reports

For you

London

Meetups

New York

Berlin

How LLMs became Walmart’s on-call engineer

By Lokesh Lagudu

Your inbox, upgraded.

Scaling support without scaling people

Building an LLM-powered operations assistant

More like this

Building with AI, not just using it

Measurable impact of MCP

What’s next?

Key takeaways for engineering leaders

About the author

Lokesh Lagudu

London

Meetups

New York

Berlin

How LLMs became Walmart’s on-call engineer

By Lokesh Lagudu

Your inbox, upgraded.

Scaling support without scaling people

Building an LLM-powered operations assistant

More like this

Building with AI, not just using it

Measurable impact of MCP

What’s next?

Key takeaways for engineering leaders

Share:

About the author

Share:

More like this