In March 2025, a research organization called METR published a finding that got less attention than it deserved. They had been measuring something unfashionable: how long AI systems could work on tasks before they broke down. Not what they could do in a single interaction. METR wanted to know how long they could sustain coherent, useful effort.
Their methodology was rigorous: 170 tasks across software, research, and problem-solving. Human baselines from 236 runs by domain experts. Statistical modeling to predict success probability based on task duration.
The headline finding: task-completion capabilities had been doubling every seven months for six years straight.
In mid-2024, leading AI systems could reliably complete tasks that would take a human expert about nineteen minutes. By early 2025, that stretched to nearly an hour. The newest systems reach over two hours. The curve was accelerating.
But beneath that headline lies a more complicated story.
The Gap Between Demos and Delivery
In August 2025, METR published a follow-up that should have made everyone pause. They tested AI on eighteen real tasks—the kind of work that actually matters in organizations.
Some Results are:
- Automated test pass rate: 38%.
- Ready to use without human cleanup: 0%.
Zero. Not a single piece of work produced by the AI was ready to use without human intervention. Even when automated checks passed, every output had gaps: missing documentation, incomplete verification, quality issues that would never survive review in a real organization.
Average cleanup time: 26 minutes—roughly one-third of the original task duration.
This gap between automated benchmarks and real-world usability is the first problem nobody talks about. AI systems are getting better at the narrow things benchmarks measure. They are improving much more slowly at the full scope of what “completing a task” actually means in practice.
The distinction matters for anyone planning investments. A system that passes a test but produces unusable output is not a system that can work autonomously. It is a system that requires someone to clean up after it.
The Productivity Paradox
Here is the second problem. In July 2025, METR ran a controlled study with sixteen experienced professionals. The question: does AI actually make people faster?
Expected result before the study: 24% speedup.
Actual result: 19% slowdown.
People using AI took longer to complete tasks than people working without assistance. And here is the disturbing part: after the study, participants still believed they had gotten a 20% speedup. Their perception was completely inverted from reality.
Five factors explained the slowdown:
- Debugging overhead. Time saved generating work was lost correcting mistakes.
- Context switching costs. Moving between the AI’s suggestions and the actual work created cognitive load.
- Learning curve. Even experienced professionals spend time figuring out how to use AI effectively.
- Hidden quality requirements. AI output required documentation, verification, and formatting that the systems did not provide.
- Real-world standards. Mature organizations have quality bars that AI-generated work systematically failed to meet.
This is not a story about AI being useless. It is a story about the gap between vendor demos and organizational reality. The productivity paradox suggests that for experienced professionals on mature processes, current AI assistance may create more work than it saves.
Why AI Systems Degrade Over Time
Why do AI systems struggle with longer tasks? The answer is architectural, but the implications are strategic.
Think of AI as having a working memory—a limited capacity for holding information about the current task. As that capacity fills up, performance degrades. The system loses track of earlier decisions. It contradicts itself. It forgets what it was trying to accomplish.
Anyone who has used an AI assistant for more than thirty minutes knows this firsthand. The system starts sharp. An hour in, it has forgotten decisions made twenty minutes ago. It introduces errors it would have caught earlier. It loses the thread.
Research has quantified this. As the amount of information AI needs to track grows, accuracy can drop by 20-30 percentage points. On complex reasoning tasks, one study showed accuracy falling from 82% to 22% as task complexity increased.
The mathematics is unforgiving. Small errors compound. If there is even a 1% chance of losing a critical detail each time the system processes new information, after 100 interactions the chance of retaining that detail drops to 37%.
This is not a bug in any particular product. It is an inherent limitation of how current AI systems work. And it creates a hard ceiling on what they can own versus what they can only assist with.
Not All Tasks Are Equal
METR’s research revealed another complication: AI capability varies dramatically by domain.
Analytical and structured tasks like data analysis and report generation show strong capability—systems can handle work that would take humans one to three hours.
Tasks requiring interaction with external systems—browsing, coordinating across platforms, handling visual information—show capability 40-100x lower. They are roughly two years behind in maturity.
Physical world applications like autonomous vehicles improve much slower than digital tasks.
The implication: “AI can work for hours” is domain-specific. A system that handles a two-hour analytical task may struggle with a twenty-minute coordination task. The capability curve is not uniform.
For organizations, this means careful matching of AI deployment to task characteristics. The hype cycle treats AI as a general-purpose solution. The reality is highly specific.
What the Trend Actually Means
METR described their findings as potentially “one of the most important trends in human history.” That may be overselling it. But the trend has held for six years, and the direction is clear.
Current AI achieves near-perfect success on tasks taking humans less than four minutes. It achieves less than 10% success on tasks taking more than four hours. The interesting zone—and the zone that matters for organizational decisions—is somewhere in between.
Here is my honest assessment: we have AI that can sustain effort for one to two hours on well-defined tasks in specific domains. These systems cannot do this reliably. They cannot do this without human oversight. They cannot do this in ways that meet real organizational quality standards without cleanup.
But the trajectory suggests these limitations may not be permanent. Every major AI company optimizes for sustained operation. Every serious adopter builds workarounds for current limitations. Every enterprise platform adds infrastructure for longer-running AI work.
The question is not whether AI will eventually work for hours. The question is when—and whether the current generation of solutions will get there, or whether we are still waiting for a breakthrough that has not happened yet.
In Part 2, we will examine the three approaches that have emerged for extending AI working duration: fresh-start cycling, selective memory, and team-based coordination.
References
Research Papers
- Building Production-Ready AI with Scalable Long-Term Memory — arxiv.org/abs/2504.19413
- Multi-Graph Based Memory Architecture for AI — arxiv.org/abs/2601.03236
- Measuring AI in Production — arxiv.org/abs/2512.04123
Industry Reports & Whitepapers
- Failure Modes in AI Systems — Microsoft
- Lessons from 2025 on AI and Trust — Google Cloud
- State of AI Engineering — LangChain
- Benchmark vs. Real-World Evaluation — METR
Technical Documentation
- How We Built Our Multi-Agent Research System — Anthropic
- Model Context Protocol Specification — modelcontextprotocol.io
- Fresh-Start Cycling Documentation (“Ralph Wiggum”) — Geoffrey Huntley (ghuntley.com/ralph/)

BLOG






