Despite their remarkable capabilities, modern AI models share a surprisingly human limitation: they tend to forget. When tasked with extended conversations, intricate multi-step reasoning, or projects spanning several days, AI assistants often lose track of earlier details. This phenomenon, known as “context degradation,” has quietly emerged as a major hurdle in developing AI agents that perform consistently in real-world scenarios.
A collaborative research team from China and Hong Kong has proposed an innovative approach to tackle this challenge. Their newly introduced system, called GAM, is designed to maintain long-term information without overwhelming the AI model. The fundamental concept is straightforward yet effective: divide memory into two distinct functions-one that records everything comprehensively, and another that selectively retrieves the most relevant information precisely when needed.
Initial experiments with GAM show promising results, arriving at a pivotal moment as the AI industry shifts focus from mere prompt engineering to the broader and more nuanced discipline of context engineering. GAM’s emergence aligns perfectly with this evolving landscape.
Why Expanding Context Windows Alone Falls Short
At the core of every large language model (LLM) lies a fixed “working memory” capacity, commonly referred to as the context window. As conversations lengthen, earlier information is either truncated, summarized, or discarded altogether. This limitation has been well-known among AI researchers, prompting a surge in efforts since early 2023 to enlarge context windows and enable models to process more information in a single input.
For instance, Mistral’s Mixtral 8x7B launched with a 32K-token window, roughly equivalent to a few pages of text. This was soon surpassed by MosaicML’s MPT-7B-StoryWriter-65k+, which doubled that capacity. Google’s Gemini 1.5 Pro and Anthropic’s Claude 3 pushed the boundaries further, offering context windows of 128K and 200K tokens respectively, with extensions reaching up to one million tokens. Microsoft also advanced from a 2K-token limit in earlier Phi models to a 128K-token window in Phi-3.
However, simply increasing context size is not a panacea. Even models capable of handling hundreds of pages struggle to recall details from the beginning of long dialogues. Larger context windows introduce new challenges: as input length grows, the model’s ability to focus on and accurately interpret distant tokens diminishes, leading to degraded performance.
Moreover, longer inputs can dilute the signal-to-noise ratio, where including every detail paradoxically hampers the quality of responses compared to more focused prompts. Additionally, processing extensive context increases latency, slowing down response times and imposing practical limits on how much context can be effectively utilized.
The High Cost of Memory in AI Systems
For many organizations, expanding context windows comes with significant financial implications. API costs scale with the number of input tokens, making large prompts expensive. While caching strategies can mitigate some costs, routinely overloading models with excessive context remains economically unsustainable. This creates a fundamental tension: memory is vital for enhancing AI capabilities, yet scaling it indiscriminately is neither technically nor financially viable.
Alternative solutions like summarization and Retrieval-Augmented Generation (RAG) offer partial relief but are far from perfect. Summaries often omit subtle yet crucial details, and traditional RAG methods, effective for static documents, falter when information spans multiple sessions or evolves dynamically. Even advanced variants like agentic RAG and RAG 2.0, which improve retrieval control, still treat retrieval as a fix rather than addressing memory as the core issue.
Lessons from Software Engineering: Just-in-Time Memory Management
Since retrieval alone cannot solve the memory bottleneck, a different strategy is needed-one inspired by decades-old software engineering principles. GAM adopts a Just-in-Time (JIT) compilation approach to memory. Instead of compressing and precomputing a fixed memory snapshot, GAM maintains a complete, lossless archive of all interactions alongside a minimal set of cues. When the AI requires information, it dynamically “compiles” a tailored context, assembling only the relevant details on demand.
This dual-layered memory system prevents premature compression or guesswork about what information will be important later. By combining a full historical record with intelligent, on-the-fly retrieval, GAM ensures that the AI receives precisely the right context at the right time.
GAM’s Architecture: A Dual-Agent Memory Framework
GAM’s design centers on separating memory storage from retrieval, embodied in two specialized agents: the “memorizer” and the “researcher.”
The Memorizer: Comprehensive and Unbiased Recording
The memorizer’s role is to capture every interaction in its entirety, converting each exchange into a concise memo while preserving the full session in a structured, searchable archive. It avoids aggressive compression or premature filtering, instead organizing data into metadata-rich pages that facilitate efficient retrieval. Optional lightweight summaries are generated for quick reference, but no information is discarded, ensuring a complete and faithful record.
The Researcher: Intelligent and Iterative Retrieval
When the AI needs to act, the researcher takes charge by devising a search strategy that combines semantic embeddings with keyword-based methods like BM25. It navigates through the memorizer’s page store, performing layered searches that blend vector similarity, keyword matching, and direct lookups. The researcher evaluates results, identifies missing pieces, and iterates until it assembles a coherent, task-specific briefing-much like a human analyst synthesizing notes and primary sources.
This Just-in-Time memory pipeline allows GAM to generate rich, contextually relevant information on demand, avoiding the pitfalls of brittle, precomputed summaries. The synergy between a complete archive and an active retrieval engine is what sets GAM apart.
Benchmarking GAM: Surpassing RAG and Large Context Models
The research team evaluated GAM against traditional RAG pipelines and large-context models such as GPT-4o-mini and Qwen2.5-14B across four rigorous benchmarks designed to test long-term memory and reasoning:
- LoCoMo: Assesses an agent’s ability to maintain and recall information over extended, multi-session conversations involving single-hop, multi-hop, temporal reasoning, and open-domain tasks.
- HotpotQA: A multi-hop question-answering benchmark derived from Wikipedia, adapted with memory-intensive contexts of 56K, 224K, and 448K tokens to test handling of noisy, sprawling inputs.
- RULER: Measures retrieval accuracy, multi-hop state tracking, aggregation over long sequences, and question-answering performance within a 128K-token context.
- NarrativeQA: Requires answering questions based on entire books or movie scripts, with sampled contexts averaging 87K tokens.
Across all tests, GAM consistently outperformed competitors. Its most notable success was on RULER, where it achieved over 90% accuracy. In contrast, RAG methods faltered due to loss of critical details in summaries, and large-context models struggled as early information “faded” despite being technically present.
These results underscore that simply enlarging context windows is insufficient. GAM’s strength lies in precise, intelligent retrieval rather than indiscriminate token accumulation.
GAM’s Role in the Era of Context Engineering
Often, the root cause of AI’s memory challenges is poorly structured context rather than inherent model limitations. GAM addresses this by guaranteeing that no information is permanently lost and that relevant data can be reliably retrieved even after extended periods. This innovation coincides with the growing emphasis on context engineering-the practice of carefully shaping all inputs an AI model receives, including instructions, history, retrieved documents, tools, user preferences, and output formats.
While context engineering is rapidly overtaking prompt engineering in importance, other research efforts are exploring alternative memory solutions. For example, Anthropic is developing curated, evolving context states; DeepSeek experiments with encoding memory as images; and another Chinese research group proposes “semantic operating systems” for lifelong adaptive memory.
GAM’s philosophy is distinct: it avoids premature data loss and leverages intelligent retrieval. By preserving every detail and employing a dedicated research engine to extract relevant information at runtime, GAM offers a dependable memory system ideal for AI agents managing multi-day projects, ongoing workflows, or long-term relationships.
The Future of AI Memory: Why GAM Matters
Just as increasing computational power alone does not guarantee better algorithms, expanding context windows is not a standalone solution for AI’s long-term memory challenges. Meaningful advancement requires reimagining memory as an engineering problem that benefits from structure and strategy rather than brute force.
As AI agents evolve from impressive prototypes to essential tools, their capacity to remember extensive histories with accuracy and precision becomes critical. Businesses demand AI systems capable of tracking complex, evolving tasks, maintaining continuity, and recalling past interactions flawlessly. GAM provides a practical blueprint for this future, signaling a shift in AI’s next frontier: not larger models, but smarter memory architectures and context management systems that empower truly reliable intelligence.
