Part 2 | From memory to navigation: Scaling autonomous agents beyond retrieval

Einführung

In einem früheren Beitrag habe ich untersucht, wie acht unabhängige Forschungsteams zu derselben Erkenntnis kamen: anstatt Speichersysteme um das Modell herum zu bauen, trainiert man das Modell selbst im Umgang mit dem Speicher als erlernte Fähigkeit. Post-Memory-Training – die Verwendung von Reinforcement Learning in der Post-Training-Phase – erzeugt Agenten, die entscheiden, was gespeichert, gelöscht, konsolidiert und abgerufen wird, alles optimiert für den Erfolg bei der Erledigung von Aufgaben.

But what has happened since is more interesting. The research has collided with two other developments. Recursive Language Models reframe context management as navigation rather than retrieval. Production systems like OpenClaw are proving these ideas work outside benchmarks. The combined picture points to something bigger than improved memory.

It points to a democratic path for scaling autonomous agents — one that does not require fine-tuning, scarce compute, or deep ML expertise. Here is how the navigation piece and the production evidence fit together.

Context as navigation: the RLM paradigm shift

Here is what caught my attention more recently. While the post-memory training community was teaching agents to manage context, a parallel line of research was reframing the problem entirely.

Recursive Language Models, introduced by Alex Zhang, Tim Kraska, and Omar Khattab, propose that context engineering is not a retrieval problem. It is a navigation problem. The difference matters.

Traditional RAG treats the knowledge base as a database you query. You embed chunks, compute similarity scores, and pull the top-k results into the prompt. The model passively receives whatever the retrieval system decides is relevant. The model itself has no agency over what it sees.

RLMs flip this. The model gets access to a persistent Python REPL environment. Long inputs are loaded as variables. The model can inspect them, grep through them, partition them, and launch recursive sub-queries, spinning up fresh LLM instances that process chunks in parallel and return results. The model navigates its context the way a developer navigates a codebase: by exploring, filtering, and selectively reading what it needs.

The numbers back this up. RLMs process inputs up to two orders of magnitude beyond the model’s native context window without degradation. On real-world datasets exceeding 1.5 million characters, RLMs significantly outperform standard LLMs and common long-context scaffolds. A post-trained version, RLM-Qwen3-8B, outperforms its base model by 28.3% on average and approaches GPT-5 quality on three long-context tasks — from an 8B model.

What makes this complementary to post-memory training is the separation of concerns. Post-memory training teaches agents what to remember and what to forget — the editorial decisions about state management. RLMs teach agents how to explore context they have not yet committed to memory. One is about managing what is inside the agent’s head. The other is about navigating what is outside it.

The combination is more powerful than either alone. An agent that can navigate large external contexts and manage a compact internal memory state has the two capabilities needed for genuinely long-running autonomous workflows: broad awareness and selective retention.

The OpenClaw moment

Theory is useful. Production evidence is better.

OpenClaw — the open-source AI agent framework that gained over 100,000 GitHub stars in under a week after launching in late January 2026 — is one of the first systems to demonstrate these principles working at scale outside of research benchmarks.

OpenClaw’s architecture embodies the convergence described above. Its memory system uses structured Markdown with timestamps and metadata, stored as plain text files — no proprietary databases, no encrypted blobs. Context-aware lookups happen through semantic search where the agent finds related past conversations even when the user employs completely different words. Memory flows across integrated tools automatically, so information mentioned in a chat becomes available when the agent works in a code editor or browser.

What makes OpenClaw relevant to this discussion is not just its memory architecture. It is the accessibility of its approach. OpenClaw agents can autonomously write code to create new skills and maintain long-term memory, all without fine-tuning. The agent learns through use, not through gradient descent. Memory management happens at the application layer, not the model layer, which means any team can deploy and customize it.
Version 2026.2.3 specifically targets the reliability problem that kills long-running workflows: improved tool execution consistency, session management, memory reliability, and agent isolation for stable long-running automation. These are not research features. They are production fixes for the exact failure modes that post-memory training research identifies.

Here is how I read the OpenClaw moment. It is the first widely adopted system that proves you can build agents with durable, self-managed memory without any of the traditional barriers: no fine-tuning costs, no scarce compute requirements, no deep ML expertise. The agent manages its own context. The user configures the workflow. The system runs.

That is what democratization looks like in practice.

Enterprise translation: why this changes the scaling equation

Let me translate this convergence into operational impact.

The cost barrier drops

Post-memory training compute requirements are an order of magnitude below fine-tuning. AgeMem trains on a single 8xA100 node. Memory-R1 needs 152 examples. MemAct’s 51% context reduction and MEM1’s constant memory overhead translate directly to lower inference costs at scale. For an enterprise running 30,000 agent conversations per day at $0.14 each, a 50% context reduction does not just cut costs in half — it changes what is economically viable. Tasks that were too expensive for agents to handle over extended interactions become feasible.

The expertise barrier drops

Fine-tuning requires ML engineers who understand distributed training, reward design, gradient debugging, and checkpoint management. Post-memory training and RLM-style navigation operate at the application layer. OpenClaw operates with Markdown files and configuration. The expertise shifts from “can you train a model” to “can you design a workflow” — a far larger talent pool.

Reliability over long horizons

This is the bigger prize. The reason 60% of multi-agent pilots fail to scale is not cost — it is reliability degradation. Agents that cannot maintain coherent state over 50+ turns are operationally useless for the workflows that matter most: multi-step research, complex customer service, code migration, incident response.

Post-memory training addresses this directly. MEM1 maintains near-constant performance at 16 objectives. MemAgent sustains accuracy across 3.5 million tokens. RLMs process inputs two orders of magnitude beyond native windows without degradation. The combination — agents that navigate broad context and maintain compact internal state — is what makes multi-hour autonomous workflows viable.

Three things to watch in your stack

Memory as a trainable skill, not a fixed pipeline. The heuristic memory systems in production today (Mem0, Zep, LangChain memory) deliver real value. But learned memory policies outperform hand-designed ones, especially as tasks grow longer. Evaluate whether your memory layer can evolve.
Navigation over retrieval. If your agents access large knowledge bases through RAG alone, you are leaving performance on the table. The RLM pattern — giving agents tools to explore, filter, and recursively query their context — is complementary and often superior for deep analysis tasks. Watch for this capability to appear in agent frameworks over the next 12 months.
Application-layer optimization over model-layer customization. The OpenClaw model — memory management through configuration and runtime learning rather than weight modification — is likely the pattern most organizations will adopt. It does not require you to be a frontier lab.

Limitations and open questions

The convergence is real, but so are the gaps.

Training data scarcity. RL-based memory training requires environments where the agent can practice memory management at scale. Most papers use synthetic tasks or narrow benchmarks. Whether these training signals transfer to the messy diversity of enterprise workflows is unproven.

Reward engineering is fragile. The mem-agent team’s experience is instructive: initial reward designs led to models gaming format rewards instead of solving tasks. They went through multiple iterations before finding a stable recipe. This is not plug-and-play.

Navigation overhead. RLMs consistently require more wall-clock time due to sub-LLM parallelization overhead. For latency-sensitive applications — chatbots, real-time customer support — traditional RAG remains the faster option. The RLM trade-off favors accuracy over speed, which limits its applicability to deep analysis and long-running workflows.

Multi-modal memory is early. All eight post-training papers focus on text. Agents working with images, structured data, tool outputs, or mixed modalities need memory systems that handle heterogeneous content. Nobody has solved this yet.

Safety surface. Memory that persists and evolves creates attack vectors. Adversarial inputs could poison long-term memory. Over-retention raises privacy concerns in regulated industries. OpenClaw’s 2026.2.3 release addresses some of this with stronger prompt injection protection, but the broader safety implications of learned memory policies remain underexplored.

Where this is heading

Three developments have converged in the span of months. Post-memory training teaches agents what to remember. Recursive Language Models teach agents how to navigate what they have not yet remembered. And production systems like OpenClaw prove that these capabilities can be delivered without the traditional barriers of fine-tuning, scarce compute, and deep ML expertise.

The trajectory is clear. In Phase 1 (now), heuristic memory systems and RAG are the production standard. They work. Use them. In Phase 2 (2026–2027), RL-trained memory modules become available as plug-in components, small specialized models that handle memory management for larger reasoning models. RLM-style navigation becomes a standard agent capability. In Phase 3 (2027+), memory training and context navigation merge into standard post-training pipelines, alongside instruction tuning and reasoning RL.

Whether RL on memory turns out to be the answer or one piece of a larger puzzle remains open. Something entirely different might emerge next year. But the number of independent teams that arrived at this approach — combined with the speed at which production systems like OpenClaw have adopted the principles — is difficult to dismiss. That kind of convergence usually means something.

For practitioners building today, the actionable takeaway is this: architect your systems so the memory layer is modular and replaceable, and give your agents tools to navigate their context, not just receive it passively. The heuristic systems you deploy now will eventually be swapped for trained ones. If your memory management is hardwired into your pipeline, that swap will be expensive. If it is a clean interface, it will be an upgrade.

The agents that scale in production will not be those with the biggest context windows or the most expensive models. They will be those who have learned what to remember — and how to find what they have not.

Kontaktieren Sie uns

Teil 2 | Vom Gedächtnis zur Navigation: Skalierung autonomer Agenten über den Abruf hinaus

Autor