AI Agents Aren’t Ready for Consumer-Facing Work – But They Can Excel at Internal Processes.

Read the article on

Consumer adoption remains relatively low as well. Despite the use of words like “revolution,” data shows that most users touch in weekly rather than daily, unlike the more frequent use of social networks or platforms such as Google. This suggests that gen AI hasn’t yet become a true consumer habit. People use it occasionally—sometimes intensely—but it hasn’t become a bedrock of daily life. The hype, in other words, is still ahead of reality.

Despite these reports, we believe that gen AI represents a fundamental shift on the same scale as the internet or the smartphone. The internet gave us roughly two decades of innovation and company-building. The smartphone revolution created 15 years of growth fueled by mobile apps. We believe gen AI will drive a similar era of transformation—perhaps a decade or more of new value creation.

Such a pattern is common with new technologies (the excessive optimism followed by disillusionment then real value creation, often articulated through the Gartner hype cycle). In our view, many of the leading AI proponents are overhyping when they make bold statements that entire swaths of the economy will be shortly replaced by AI. That’s because real, functional AI in established companies is hard work: it takes relatively clean data, process mapping, and deep experimentation—and even then often requires a human in the loop. However, we are seeing real, behind-the-scenes progress, examples of which suggest that using multi-agent systems to automate and replace repetitive tasks may lead to a longer, more foundational ramp in productivity.

Our view is that leaders shouldn’t try to guess what is going to happen in 10 years. Instead, they should ask what can they realistically achieve in the next two. Based on the projects we have done since late 2024, agentic AI is proving to be the real game changer (at least on the short term), providing real value to companies. The reality is also that the financial gains per project are good, but none of them are eye-popping. These incremental gains here are akin to Lean, a comparison that Microsoft CEO Satya Nadella has also made.

Based on our work on a portfolio of projects that are successfully implementing agentic AI systems, we’ve found that succeeding with this technology requires cutting through the hype, understanding what the technology can do in, and matching that capability to clear value creating opportunities. It also requires a practical approach to experiment-and-learn implementation of multi-agents systems.

The Rise of the Agentic Workflow Paradigm

The past few years have seen the maturity of AI technology rapidly evolve through at least three distinct phases:

Prompting (2022): Early enthusiasm was centered on “power prompts.” In proofs of concept (POCs), prompts seemed to work. But in production, reliability quickly dropped. Business processes typically require 95–99% accuracy. Based on experience over more than 50 cases, we’d guess prompting alone rarely surpassed 70%.
Retrieval-Augmented Generation (RAG, 2023): RAG improved stability by anchoring gen AI outputs to knowledge bases. Again, POCs looked promising, but production complexity often exposed weaknesses leading to unacceptably low accuracy.
Agentic systems (2024 to the present): The most recent advances involve networks of small, specialized agents. Some route questions. Others perform narrowly defined tasks. Still others check and correct outputs. Crucially, falling token costs now make cascading multi-agent systems commercially viable. This layered design significantly improves reliability.

This fall saw an explosion of agentic commercial initiatives. OpenAI launched partnerships with Stripe and Shopify. Google announced its Agent Payment Protocol, which automates the purchase and transaction process. While companies may be tempted to follow the tech giants’ lead, this may not be where the first wave of sustainable value will emerge. In a recent Bain survey of consumers, 76% reported that they wouldn’t be comfortable using agentic systems for purchases—most cited concerns about security and privacy as the reason for their reticence.

Customer-facing contexts are a bad fit for the current capabilities of AI agents. They’re messy and unpredictable; inputs are unstructured, tone and context shift constantly, and regulators and consumers have little tolerance for hallucinations or errors. Multi-agent systems can reach high levels of accuracy, but doing so requires treating each individual agent rather like a toddler. You wouldn’t ask a toddler to set the dinner table. But if you break up the task and guide them step by step—“first, put one plate down,” “now, add the forks,” “next, the glasses”—the toddler can make a meaningful contribution. Crucially, the environment must also be controlled: no noisy siblings, no distracting pets, and only one parent giving instructions. But by developing multi-agent systems structured like how you instruct a toddler—breaking the task up, feeding tasks one by one, checking on the tasks for accuracy—we are building remarkably accurate systems.

Notably, these systems are often for back-end processes, where perfection isn’t essential because you have a human in the loop. By contrast, while front-end experiments can be inspiring, they are unlikely to be the first area of real enterprise value creation. Backend and operational processes are fertile ground because they structured and repetitive—much better suited for agentic workflow automation. Tightly scoped tasks, well-defined environments, and structured inputs can yield projects that make a meaningful contribution.

Building Agentic Systems at the Enterprise Level

At enterprise scale, designing these systems is conceptually simple but operationally demanding. As a general framework for building multi-agent systems, 1) a task is sent to a router agent such as Google ADK that, like a parent instructing a toddler, divides the task into sub-tasks; 2) the subtasks are then completed by individual task agents who complete a smaller part of the task, like the parent telling one toddler to set the glasses on the table and another to set the forks; after which 3) the results of these subtasks are checked by a validation agent; and 4) if an error is discovered, an improvement agent recommends an adjustment.

A fast-growing ecosystem of tools, methodologies, and services supports this approach, and these are excellent for non-core processes. But when you move into core operations—where data integrity and control over hallucinations are essential—you need custom-coded agents, deeper integration with enterprise systems and better control and guardrails implemented.

A Case Example: Reinventing Field Operations

To illustrate, consider a project we did with a major European internet provider. Our goal was to cut both resolution time and cost for service calls to be solved. Most people have experienced calling a help desk about a broken connection, repeating information multiple times, and eventually waiting for a technician. What happens (or fails to happen) behind the scenes is revealing: technicians often arrive with incomplete context, forced to troubleshoot from scratch. This leads to long downtime—sometimes over a month—and thousands of wasted operator hours.

We decided to start small. We focused on building a system centered around helping technicians do their tasks faster and better—a helper in the process, not a standalone agent. As part of this effort, we integrated data from more than 15 information systems, providing the technicians with a brief of the reported failures and a history of which solutions had been tried. This gave them an overview of the task—say, troubleshooting a customer’s connection—that they could read or listen to en route to the job. That way they could start working to fix it as soon as they arrived, saving time that was often wasted getting up to speed on the problem.

Next, we created a feature that generated recommendations for the next best action for resolution. Another feature included a conversational interface that allowed the technician to query the underlying IT systems of the internet company in natural language to find the root causes. Finally, we automated many simple and repetitive actions: e.g., correcting CRM records when the wrong household was linked or triggering network resets when a switch malfunctioned in the central neighborhood connector box. This saved the technician an immense amount of time, as they did for not have to call the internal call center to help with small changes to make the fix.

Over eight months, we worked iteratively—mapping processes, fixing pain points, and adding functionality step by step, with weekly feedback from field technicians testing the tablet-based solution.

The results:

60% reduction in resolution time
More than a million euros saved recurringly on an annual basis
Significant improvement in customer net promoter score

Based on these results, the client wanted it to scale towards seven more regions. This required a good deal more work: The methodology and some agent components were reusable, but every region had different IT systems. Each rollout required new integrations and data mappings. Expanding to each of the seven regions took half the original effort of the original one per region added.

The Challenges of Implementing Multi-Agent Systems

As illustrated above, multi-agent system implementations can successfully create real value, but too few people talk about the real work of implementing them. What were the realities and roadblocks we faced?

Testing fast vs. scaling up.

Did we build it in a nice scalable architecture from the beginning? As much as we might like to claim so, it would have been impossible. Just as innovators iteratively discover product-market fit, the multi-agents system use cases and the solution evolved iteratively as we engaged in a cycle of rapid experimentation. Next to this, the technology, methodology and services to build these systems evolved rapidly.

We did not start with the full system. Instead, we started out with one LLM plus RAG as a central piece to basically solve the first use case. As we tested, we learned that we needed break the system into smaller agents pursuing more specialized tasks to increase reliability; slowly, this evolved into a fully agentic system. Eventually, we developed a highly-reliable, functioning system that delivers value. With this knowledge and the results we have now, we are rebuilding it into an architecture that is much more robust towards scaling other parts of the company and can be more easily maintained.

Problem zones vs. root causes.

As a rule of thumb, we found that leaders and middle managers know roughly which process take a lot of time or effort, but they have ill-informed opinions on where the complexity and opportunity is. Only the operators know this. The implication is that you need to do two things before you can really start to build: 1) spend sufficient time with understanding the issues from a manager perspective, but also 2) talk with operators about what they understand the as the root cause of any given problem.

For example, managers would point us to parts of the process where time or resource were wasted (e.g., in the shared service center) and tell us to look into finding the right “knowledge items” for the operator to more quickly resolve an issue. When we started working directly with the operators, however, we found that half of the operators found the “knowledge items” within 10 seconds, while the other half took minutes to find the same information because they weren’t good at searching the system. This is not a problem that agentic AI can solve well—it’s a training issue. But we also discovered that leaders and managers had completely overlooked that the operators spend roughly 50% of their time filling out the CRM after customer calls. This is a good problem for an agent, which can transcribe the call and put all information in the right fields, vastly speeding up the process and improving data quality. The operator only has to check and press OK.

IT systems don’t slow you down—people do.

The most effort-intensive and complex parts of our work were getting into the right management discussions, winning over stakeholders, and identifying and resolving dependencies that happened because of our work. Integrating with a dozen IT systems to make the solution work is complex from a technology perspective, but the real challenge is that all of these systems have their own development teams with different timelines, priorities, and roadmaps. Making API end-points available and testing them through might be two weeks’ worth of work. Getting into the respective roadmaps of these systems took much, much longer. Most teams we were coordinating with deprioritized this work for several months, claiming (probably with validity from their viewpoint) that they had more important work that took priority.

Models can and will hallucinate.

Agents are still quite unstable and can hallucinate, necessitating strong guardrails and checks in the form of LLM-as-a-judge (the validation agent). The system prompt of the agent has to be strong enough and yet light enough for the agent to execute tasks properly. This requires nuance, time, and data science and data engineering skills to make these agentic systems work with high enough reliability to let them run in your business. In other words, good developers and experience in business are still are very relevant.

The New Discipline of Agentic Transformation

What more general lessons can we extract from the case? In many ways, it feels like rediscovering Lean—reengineering work from the ground up. The difference is that today’s toolset is vastly more powerful, enabling not just incremental optimization but full process redesign, even across departments.

Beyond data and AI capabilities, success depends on deep process literacy—understanding the current state, envisioning the future state, and translating that into small, buildable steps. In that sense, we’re seeing the return of “Lean Black Belts,” but this time, they’re powered by gen AI.

The work is detailed and methodical, not glamorous. You must go step by step. The approach scales in method, not in magic. Each new business area requires fresh analysis and custom adaptation.

Fully autonomous agents remain far off. For now, the most effective setup keeps a human in the loop—making the operator smarter, faster, and better equipped.

Progress is incremental at first. Only when core systems connect and information flows smoothly do major efficiency gains appear.

Technology also evolves faster than projects. The tools we used eight months ago are already outdated. That’s why we focus on use cases with payback within a year—before the underlying technology shifts.

More importantly, companies must build internal capability—data engineers, data scientists, gen AI UX designers, and what some now call context engineers or gen AI black belts: people who deeply understand processes and can break transformation into achievable steps. Building these capabilities as a company will allow you to build new agentic workflows faster (as the tech evolves), which can be a real differentiator towards the competition.

Finally, while these initiatives will eventually integrate into normal business operations, it’s essential to start them with strong governance—one that unites technical and business perspectives. That balance is what turns experimentation into transformation.

The Decade Ahead

Gen AI’s hype may have outpaced adoption, but its potential is real. Like the internet and smartphone revolutions before it, this platform shift will reshape industries—not through overnight disruption, but through years of disciplined reinvention.

The organizations that win won’t just adopt tools; they’ll build the capability to continuously reinvent themselves through them.