← Notes
blog · May 22, 2026

Agent Context Management and Engineering

Agent

Summary

“Context engineering” for agents is broader than prompt engineering. Prompt engineering optimizes the wording, structure, and demonstrations inside a prompt; context engineering optimizes the entire information payload and runtime that shape an agent’s behavior at inference time: conversation state, retrieval, memory, tool schemas, intermediate artifacts, summaries, caches, and multi-agent communication. Recent surveys now explicitly frame context engineering as a discipline that spans context retrieval/generation, processing, management, and system-level implementations such as RAG, memory systems, tool-integrated reasoning, and multi-agent systems. In parallel, agent-memory surveys argue that memory should be treated as a first-class systems primitive rather than a vague synonym for “long context.” (1)

The last three years of research have converged on five high-confidence findings.

  1. sheer context-window size is not enough: long-context models still exhibit position bias, “lost in the middle” effects (2), and steep performance decay as reasoning complexity rises.
  2. external memory systems help, but naive vector-store retrieval is often too shallow for temporal, multi-hop, or action-conditioned tasks.
  3. better results increasingly come from structured memory—session decomposition, temporal indices, graph memory, reflection loops, and learned or agentic memory operations—rather than from simply “retrieve top-k chunks.”
  4. tool-use itself has become a context problem: large tool catalogs, long tool responses, and multi-turn trajectories can degrade performance sharply.
  5. benchmarks are moving from static recall toward memory-in-action settings, where an agent must both remember and use memory to plan, act, and update state over time.

The “context engineering” problem should be treated as a systems-and-evaluation problem, not just a modeling problem. Focus on the design/implementations of a memory taxonomy, retrieval/indexing choices, context assembly and compression, state persistence (3), tool schemas, observability, cost/latency tradeoffs, and benchmark-driven evaluation.

What matters mostWhy it matters
Benchmark before optimizingLoCoMo (13), LongMemEval (29), LongBench v2, BFCL(30), ToolSandbox, LongFuncEval (32), and MemoryArena (33) measure different failure modes; optimizing one can miss the others.
Prefer structure over brute forceIn 2024–2026, the most interesting gains come from graph memory (6), temporal retrieval, learned memory actions, reflection, and better assembly/compression.

What Context Engineering Means for Agents

A practical definition that matches the recent literature is: context engineering is the systematic design, optimization, and governance of the information that an LLM receives and produces during inference. That includes prompt content, retrieved evidence, session state, memory, tool definitions, tool outputs, routing decisions, caches, and compaction policies.

For agents, it helps to separate five concepts.

  • Prompt engineering is the craft of writing instructions, exemplars, role specifications, XML structure, or chain/pipeline prompts without changing model weights.
  • State is the structured runtime object for an ongoing run or conversation: message history, partial plans, tool outputs, counters, and control flags.
  • Memory is information persisted beyond the immediate active context, so it can be recalled later across turns or sessions.
  • Tool use is active externalization: the agent acquires new context or takes action by calling functions, APIs, search, code, browsers, or MCP servers.
  • Multi-agent context adds partitioning and communication: who sees which context, what is shared versus private, and how messages or memory fragments are coordinated across agents.

Recent agent-memory work also sharpens the memory vocabulary. The 2025 survey argues that traditional “short-term vs long-term” labels are too coarse, and proposes thinking in terms of forms, functions, and dynamics, with functional categories such as working memory, factual memory, and experiential memory. (4)

  1. Factual Memory: The agent’s declarative knowledge base, established to ensure consistency, coherence, and adaptability by recalling explicit facts, user preferences, and environmental states. This system answers the question: “What does the agent know?”
  2. Experiential Memory: The agent’s procedural and strategic knowledge, accumulated to enable continual learning and self-evolution by abstracting from past trajectories, failures, and successes. This system answers: “How does the agent improve?”
  3. Working Memory: The agent’s capacity-limited, dynamically controlled scratchpad for active context management during a single task or session. This system answers: “What is the agent thinking about now?”

One of the clearest early blueprints for agent context handling is the retrieve–reflect–plan loop in Generative Agents (10), which stores a stream of experiences, retrieves a relevant subset, synthesizes higher-level reflections, and uses them for planning. That paper, together with later systems such as Voyager (11), MemGPT, MemoryBank (18), and A-Mem (21), shifted the field from “fit more tokens” toward “decide what deserves to be in context.”

flowchart LR
    U[User turn or environment event] --> S[Runtime state manager]
    S --> W[Working context]
    S --> T[Tool router]
    S --> M[(Persistent memory)]
    T --> R[Retriever or search layer]
    R --> X[Reranker or compressor]
    M --> R
    W --> A[Context assembler]
    X --> A
    A --> L[Reasoning model]
    L --> O[Answer or action]
    O --> T
    O --> C[Compaction and summarization]
    C --> M

Recent Research and Benchmarks

The last three years produced a coherent research arc. Early 2023 work established the architectural motif: Generative Agents (10) formalized memory streams, reflection, and planning; Voyager (11) showed lifelong skill accumulation and retrieval in an embodied setting; Toolformer and related tool-use work pushed models toward self initiated API usage; and Self-RAG (17) turned retrieval into an adaptive, on-demand process rather than a fixed pre-retrieval step. At the same time, Lost in the Middle (2) and LongBench (26) demonstrated that long context is not equivalent to robust long-context reasoning, especially when relevant evidence is buried or distributed.

In 2024–2025, the center of gravity shifted toward memory systems and benchmark realism. LoCoMo (13) and LongMemEval (29) moved evaluation from single long documents to long, multi-session interactions with temporal reasoning, updates, and abstention. HippoRAG introduced graph-and-PageRank-style associative retrieval; MemoRAG introduced global-memory-augmented retrieval (6); A-Mem organized memory as linked notes rather than flat chunks; Reflective Memory Management added prospective and retrospective reflection (21); and LongBench v2 raised difficulty toward deeper real-world reasoning. This period also saw aggressive work on compression and efficiency—LLMLingua (24), LongLLMLingua, Selective Context, and RetrievalAttention—because long context is expensive even when accuracy improves.

In 2026, the research moves from “better recall” to memory as policy and memory-in-action. AgeMem learns memory operations as tool-like actions integrated into the agent policy (14); MemoryArena evaluates whether memory actually improves future multi-session decision making (33); ASTRA-bench stresses tool-based planning with messy personal context (34); M2CL and Epistemic Context Learning treat multi-agent discussion as a context-coherence and trust-estimation problem; and recent memory preprints such as MemMachine and decoupled retrieval frameworks emphasize preserving episodic ground truth and composing retrieval contexts rather than merely ranking chunks by similarity. Some of these 2026 works are still preprints, but they are strong indicators of current research direction. (56)

ThemeRepresentative papers and methodsWhy they matter
Memory streams and reflectionGenerative Agents introduced observation → retrieval → reflection → planning. (10)Still the clearest conceptual template for agent memory loops.
Lifelong skill memoryVoyager stores executable skills and retrieves them in Minecraft. (11)Shows that memory can be procedural, not just textual.
Adaptive retrievalSelf-RAG retrieves on demand and critiques its own generations with reflection tokens. (17)Important for deciding when extra context is necessary.
Human-like long-term memoryMemoryBank stores, recalls, updates memories and user portraits. (18)Good reference point for personalization-oriented memory.
Graph/associative retrievalHippoRAG combines LLMs, knowledge graphs, and Personalized PageRank; HippoRAG 2 extends this toward continual non-parametric learning.Strong for multi-hop and associative recall, where plain vector top-k often fails.
Global-memory RAGMemoRAG uses a long-range “global memory” model to guide retrieval. (20)Useful when tasks need holistic understanding of a corpus before pinpoint retrieval.
Agentic memory organizationA-Mem dynamically links notes using Zettelkasten-like organization. (21)Reframes memory from flat storage into evolving relational structure.
Reflective memory updateReflective Memory Management adds prospective summary and retrospective RL-style retrieval refinement. (22)Makes memory writing and retrieval adaptive rather than static.
Learned memory actionsAgeMem integrates store/retrieve/update/summarize/discard into the agent policy. (14)One of the clearest signs that memory control is becoming a learning problem.
Efficient long-context handlingLLMLingua, LongLLMLingua, Selective Context, RetrievalAttention. (24)The engineering reality: context quality must be balanced against cost and latency.
BenchmarkWhat it measuresWhy you should know it
Lost in the Middle (2)Position bias in long contextsCanonical evidence that longer windows do not imply equal access to all positions.
LongBench and LongBench v2 (26)Long-context QA, summarization, few-shot, code, and deeper reasoningGood general-purpose long-context benchmark family.
∞Bench and BABILong (27)Very-long-context and distributed-fact reasoningStress tests for extreme lengths and reasoning under sparse evidence.
LoCoMo (13)Very long-term conversational memory, QA, summarization, multimodal dialogueImportant for multi-session, interference-heavy conversational memory.
LongMemEval (29)Information extraction, multi-session reasoning, temporal reasoning, updates, abstentionOne of the most useful memory benchmarks for assistant-style agents.
BFCL (30)Function-calling/tool-use accuracy, including AST and executable correctnessEssential if your internship touches tool-heavy agents.
τ-bench (31) and ToolSandbox (70)Stateful tool-agent-user interaction and on-policy conversational tool useMove evaluation past single static prompts.
LongFuncEval (32)Tool calling under long catalogs, long responses, and long dialoguesDirectly relevant to context engineering for enterprise agents.
MemoryArena and MemoryAgentBench (33)Memory-in-action, test-time learning, selective forgetting, interdependent tasksImportant emerging benchmarks because they connect memory to action quality.
ASTRA-bench (34)Tool-use planning with personal user contextStrong fit for personalized assistants and context-aware planning.

The major open problems are remarkably stable across papers. Existing systems still struggle with temporal reasoning, knowledge updates, selecting the right granularity for memory writes, assembling non-redundant but sufficient evidence, robust tool use under long responses or large toolsets, and fair evaluation when memory quality and acting quality are tightly coupled. Benchmarks are improving, but cross-benchmark transfer remains weak: systems that do well on static long-context recall can still perform poorly when memory must support future action. (29)

Industry and Open-Source Landscape

Industry practice has moved decisively toward explicit context/runtime features. OpenAI’s current platform exposes persistent conversation state, server-side compaction for long-running interactions, automatic prompt caching, tool search to avoid loading entire tool catalogs up front, and MCP/connectors for external services (39); Anthropic exposes prompt caching with automatic or explicit cache breakpoints and has publicly framed “effective context engineering” as the core mental model for agent quality (40); Google’s Gemini/Vertex stack similarly supports implicit and explicit context caching with resource IDs, expiration policies, and cost/latency benefits (41). The common pattern is clear: major vendors increasingly productize context management as a first-class API concern rather than leaving it entirely to application code.

Open-source frameworks differ mainly in how opinionated they are about state, memory, and orchestration. LangGraph emphasizes long-running, stateful workflows with low-level graph control (37); AutoGen popularized multi-agent conversation but is now in maintenance mode (42); Haystack offers modular pipelines with explicit control over retrieval, routing, memory, and generation (43); Semantic Kernel positions itself as enterprise middleware for agents and multi-agent systems (44); LlamaIndex is strongest as a data orchestration and retrieval layer (45); Letta is explicitly memory-first and treats certain memory blocks as always-in-context; CrewAI packages multi-agent orchestration with built-in memory abstractions (38).

The memory-layer market has also become more specialized. Letta productizes the MemGPT-style distinction between always-visible core memory and retrievable external memory. Mem0 positions itself as a universal, self-improving memory layer with scoped memory types and a production-focused paper claiming large token and latency savings relative to full-context baselines (46). Zep emphasizes “context engineering” and graph-based assembly of personalized context from chat history, documents, business data, and events, with low-latency retrieval (35). MemMachine is a newer open-source entrant that stresses preserving full conversational episodes rather than aggressively extracting lossy facts. (56)

Framework or productCore abstractionContext and memory postureBest use case
OpenAI Agents SDK / Responses / Conversations (39)Agent runs over persistent conversations and toolsStrong platform-native state, compaction, caching, MCP, tool searchProduct teams building tool-heavy agents quickly
Anthropic Claude API (40)Prompting + tools + cachingStrong caching and context-design guidance; less opinionated OSS runtimeTeams optimizing cost/latency in long multi-turn flows
Vertex AI / Gemini context cache (41)Explicit or implicit cached context objectsGood cloud-native context reuse with TTLs and managed lifecycleEnterprise workloads already on Google Cloud
LangGraph (37)Stateful graph runtimeExplicit control over long-running state and orchestrationResearchers and engineers who want control over execution graphs
Haystack (43)Modular pipelines and agentsStrong retrieval/routing/transparency orientationRetrieval-heavy production systems
Semantic Kernel (44)Enterprise middleware for agentsBroad connector/tool orchestration; good for structured enterprise appsC#, Python, Java enterprise stacks
Letta (38)Memory-first stateful agentsCore memory pinned in context; recall/archival memory retrievablePersistent personalized assistants
Mem0 (46) / Zep (35) / MemMachine (56)Dedicated memory layerStrong focus on long-term memory, graph/context assembly, production retrievalAdd-on memory subsystem for existing agents

A striking industry convergence is interoperability around tools and external context. MCP has emerged as a standard protocol for exposing tools and resources to LLM applications (47), and OpenAI, Anthropic, and others now support or integrate with it. This matters for context engineering because it standardizes how tools and resources enter the agent’s context, but it does not solve the harder problem of which tools/resources should be loaded, preserved, summarized, or hidden during long-horizon execution.

Technical Design Dimensions

A useful technical decomposition has six dimensions: memory type, retrieval method, indexing structure, context processing, orchestration policy, and systems tradeoff. On memory type, the practical triad is working memory, factual/semantic memory, and episodic/experiential memory. Working memory sits in the active prompt or short-lived scratchpad; factual memory stores stable facts, preferences, or schema-like knowledge; episodic memory stores temporally grounded interactions, observations, and action traces. The difficulty is not defining the tiers; it is deciding what moves between them, when consolidation occurs, and how lossy summaries can be without destroying future retrieval value. (4)

On retrieval, dense vector retrieval remains the default, but recent work keeps showing that it is insufficient on its own for associativity, temporal reasoning, or exact-match-sensitive tasks. That is why production systems and recent papers increasingly use hybrid retrieval—dense plus sparse/BM25, often plus metadata filters and reranking. Pinecone (49), Weaviate, Qdrant, Milvus, Chroma, and Vespa all now expose hybrid or multi-vector patterns; Qdrant and Milvus also support dense+sparse or multi-vector fields in the same logical object. In research, HippoRAG (6), GraphRAG-style systems, and A-Mem push further by making retrieval graph- or structure-aware rather than purely embedding-similarity-based (21).

On indexing, the main choices are vector ANN indices, sparse/inverted indices, graph indices, temporal/session-aware partitions, and namespacing. FAISS remains the canonical library for efficient dense ANN search (50); pgvector gives a relational option with HNSW and IVFFlat tradeoffs (57); FAISS and pgvector remain common when you want explicit control, while managed stores trade that control for faster operational ramp-up. HNSW typically offers a stronger speed–recall tradeoff than IVFFlat at the cost of more memory and slower builds, which matters for whether your system is read-heavy or write-heavy.

On context processing, the major techniques are chunking, reranking, summarization, compression, and stitching. The literature now makes a strong case that chunking is not a boring preprocessing step; it is a modeling decision. Coarse chunks preserve more context but increase noise; fine chunks improve precision but can destroy temporal or causal dependencies. Reranking narrows candidate sets before generation (51); LLMLingua/LongLLMLingua and Selective Context compress prompts to reduce cost while preserving salient information (24); MemoryArena (33) and MemMachine (56) implicitly show why preserving whole episodes or contextual neighborhoods can matter when evidence spans turns.

On orchestration, context engineering chooses between single-shot assembly, iterative retrieval, reflection loops, graph walks, or tool-mediated acquisition. IRCoT and later iterative RAG methods show why “retrieve once, then read” underperforms on multi-step tasks: reasoning can change the retrieval query (52). Self-RAG pushes that insight into learned behavior (17); LongFuncEval shows that the same logic applies when the “documents” are long tool responses or tool catalogs (32). In other words, context assembly increasingly looks like a control problem, not a static template.

On systems tradeoffs, the key triangle is quality, latency, cost. Long contexts increase token cost and inference time; RetrievalAttention highlights the KV-cache and quadratic-attention bottlenecks (53); prompt/context caching reduces repeated compute; compaction avoids dragging obsolete branches forward; and semantic caching can avoid whole-model calls for near-duplicate queries. The best systems are therefore not the ones that maximize context size; they are the ones that maximize useful evidence per token.

Finally, evaluation should be multi-axis. Across public benchmarks, useful metrics include retrieval relevance, downstream QA correctness, temporal reasoning and update handling, abstention when evidence is missing, tool-call AST/executable correctness, trajectory or task success, and operational metrics such as tokens, latency, and cache hit rate.

Implementation Patterns and Tooling

A production-quality agent usually implements three loops: a write loop that decides what memory to persist, a read loop that selects and assembles the minimum useful context for the current step, and a control loop that decides whether the next step should be reasoning, retrieval, compaction, delegation, or tool use. OpenAI’s Conversations API plus compaction and prompt caching are examples of vendor-native support for the read/control loops; Anthropic and Google provide analogous caching primitives; memory layers such as Letta (38), Mem0 (46), and Zep (35) specialize the write/read loops.

A robust implementation pattern is: store raw events first, derive views later. Write the canonical event stream—messages, tool arguments, tool outputs, environment observations—into durable storage; then derive session summaries, facts, graph edges, or embeddings asynchronously. This pattern is increasingly favored because lossy extraction at ingestion time can permanently discard evidence that later turns need. MemMachine explicitly argues for ground-truth-preserving episodic storage (56), while modern context caches, compaction systems, and semantic caches can operate on top of raw event logs or derived artifacts.

For storage backends, the choice should follow your dominant access pattern. Use relational storage when you need transactions, joins, or compliance-friendly lineage; add pgvector if you want lightweight ANN inside Postgres (57). Use FAISS when you want local/offline control over vector search. Use Qdrant, Weaviate, Pinecone, Milvus, Chroma, or Vespa when you need higher-level retrieval features such as hybrid search, named vectors, metadata filtering, multi-vector retrieval, or managed operations at scale. Use Redis or Upstash as a semantic cache in front of expensive model calls when repeated or near-duplicate queries are common.

ComponentRecommended defaultStrong alternativesMain tradeoff
Durable state storePostgres + JSONB + object storageSQLite for prototyping; cloud KV for simple casesRelational stores make lineage and updates easier, but pure vector systems can be simpler for retrieval-first prototypes. (57)
Vector retrievalpgvector or Qdrant for small-to-mid scalePinecone, Weaviate, Milvus, Vespa, FAISSManaged systems reduce ops; self-managed systems give more control. (57)
Hybrid retrievalDense + sparse + metadata filter + rerankerGraph memory for multi-hop-heavy tasksHigher quality, but more moving parts. (60)
Memory write policyAppend raw events, derive summaries asynchronouslyImmediate fact extraction for latency-critical appsRaw-event preservation is safer; immediate extraction is cheaper online. (61)
Context reductionCompaction + summarization + semantic cachePrompt caching and explicit context cachesCaches reduce repeated compute; compaction reduces future prompt size. (62)
ObservabilityLangSmith or OpenTelemetry-compatible tracingAgentOps, custom spansYou need traces for memory reads/writes, tool calls, token use, and latency. (63)

At code level, the most valuable pattern is to make context assembly explicit and testable. Instead of one monolithic “build_prompt()” function, separate: candidate retrieval, reranking, deduplication, summary/compression, policy filters, formatting, and provenance tagging. That modularity is what lets you swap retrieval policies, measure token budgets, and debug why the model saw one memory fragment but not another. (29)

Consistency and scaling issues are often underestimated. If multiple agents can write shared memory, you need provenance, timestamps, and visibility rules; collaborative-memory work shows why dynamic, asymmetric permissions matter in multi-user/multi-agent environments (65). If you use semantic caching, you need careful similarity thresholds and TTLs to avoid serving stale or subtly wrong results. If you use compaction or summarization, you need regression tests to verify that critical entities, timestamps, and commitments survive the reduction step.

Research Directions

The most credible internship projects are narrowly framed, benchmarked, and systems-conscious.

Research question or projectSuggested methodDatasets / benchmarksSuccess criterion
Temporal memory retrieval for assistant dialoguesAdd session decomposition + time-aware query expansion + temporal rerankingLongMemEval (29), LoCoMo (13)Improve temporal-reasoning and update categories without increasing token cost by more than 20%
Graph memory vs flat vector memoryCompare flat top-k retrieval, graph memory, and hybrid graph+vector retrievalLoCoMo (13), HotpotQA-style multi-hop sets, HippoRAG evaluation recipes (6)Higher multi-hop accuracy and fewer redundant passages in assembled context
Learned memory actionsImplement AgeMem-style memory operations as tool actions; train or optimize with offline RL or bandit feedbackMemoryArena (33), MemoryAgentBench (68)Better task success per token than hand-written memory heuristics
Ground-truth-preserving episodic memoryStore raw episodes, then compare against extractive fact memoryLoCoMo, LongMemEval-S, MemMachine-style ablationsBetter recall on evidence spanning multiple turns, especially temporal and multi-hop questions
Compression for tool-heavy agentsApply LLMLingua/Selective Context-style compression to long tool responses and compare against raw inclusionLongFuncEval (32), ToolSandbox (70)Lower latency and tokens with minimal drop in tool-call correctness
Shared/private memory for multi-agent systemsBuild shared memory with access-control metadata and compare to fully shared or fully isolated memoryCollaborative Memory setups, ASTRA-bench (34), MemoryArena (33)Higher task success with zero unauthorized leakage in controlled tests
Context-aware multi-agent discussionReproduce or simplify M2CL(72) / Epistemic Context Learning ideasMath/reasoning benchmarks plus ASTRA-like personal context casesBetter consensus quality under fixed token budget than standard debate
Memory write selectionLearn when not to store memory, using importance/novelty/reuse signalsLongMemEval (29), MemoryAgentBench (68)Same or better accuracy with fewer stored memories and lower retrieval noise
Retrieval by decoupling and aggregationTest whether fine-grained search + coarse-grained context assembly beats plain top-k chunksLongMemEval (29), multi-hop QA, 2026 decoupled-retrieval settings (xMemory (74))Higher answer faithfulness and fewer missing prerequisites in assembled evidence
Cache-aware orchestrationAdd prompt/context caching, semantic cache routing, and compaction to a baseline agent runtimeAny multi-turn assistant benchmark + production-like tracesSignificant cost/latency reduction with no unacceptable drop in benchmark score

A strong methodological pattern for all ten is the same. Start with a transparent baseline, instrument every memory read/write/tool call, evaluate on at least one static benchmark and one interactive or multi-session benchmark, and report not just quality but latency, tokens, memory size growth, and failure categories. This experimental style aligns well with where the field is moving and is usually stronger than adding yet another generic agent loop. (33)

Practical Roadmap and Risks

A good 10–12 week learning plan is to move from conceptual clarity to one benchmarked subsystem. Weeks 1–2 should be reading and vocabulary: the context-engineering survey, the agent-memory survey (eg. (1), (4)), Generative Agents (10), Lost in the Middle (2), LoCoMo (13), and LongMemEval (29). Weeks 3–5 should be retrieval and memory infrastructure: implement dense, sparse, and hybrid search; compare vector-only vs reranked retrieval; add session-aware chunking. Weeks 6–8 should be context policies: summarization, compaction, memory write rules, and tool-response compression. Weeks 9–12 should be benchmarked experimentation, ablations, and write-up.

flowchart LR
    A[Weeks 1-2\nRead surveys and core papers] --> B[Weeks 3-5\nImplement retrieval and storage baselines]
    B --> C[Weeks 6-8\nAdd memory policies, compression, and caching]
    C --> D[Weeks 9-10\nRun benchmarks and ablations]
    D --> E[Weeks 11-12\nWrite report, polish demo, prepare interview narrative]

Three mini-projects are especially internship-friendly.

  • Mini-project A: a benchmarked long-term memory assistant using LoCoMo/LongMemEval, with vector, hybrid, and graph-memory variants.
    • Deliverables: reproducible benchmark scripts, trace dashboard, and a short technical memo.
  • Mini-project B: a tool-heavy agent on BFCL (30)/LongFuncEval with prompt caching, semantic caching, and response compression.
    • Deliverables: latency/cost comparison, failure taxonomy, and a demo notebook.
  • Mini-project C: a multi-agent shared-memory sandbox with private/shared memory and access control, evaluated on a simplified ASTRA-style personalized planner.
    • Deliverables: architecture diagram, policy tests, and an end-to-end demo.

The main risks and evaluation pitfalls are not abstract. Tool and MCP integrations expand the attack surface for prompt injection, unsafe tool use, and sensitive-data exposure; OpenAI’s own MCP/connectors guidance explicitly flags these risks. Persistence also creates data-governance problems: cached and stored context has retention, expiration, and access-control implications. On the modeling side, overly aggressive summarization or compaction can silently delete commitments, dates, or provenance; semantic caches can return plausible but stale outputs; and LLM-as-judge metrics can reward polished answers even when evidence is missing. Finally, benchmark mismatch remains a real problem: strong results on static long-context QA do not guarantee good performance when memory must guide later actions.

Open questions and limitations. The 2026 literature is moving quickly, and several promising directions cited above—such as AgeMem, MemoryArena (33), ASTRA-bench (34), decoupled retrieval for memory (74), and MemMachine (56)—are recent preprints rather than long-established conference benchmarks. They are highly relevant for internship ideation, but their results and standardization may still shift. The highest-confidence foundations remain the older 2023–2025 works and the benchmark families that are already widely reused.

Reference

  • [1] Mei, L., Yao, J., Ge, Y., Wang, Y., Bi, B., Cai, Y., Liu, J., Li, M., Li, Z., Zhang, D., Zhou, C., Mao, J., Xia, T., Guo, J., & Liu, S. (2025). A Survey of Context Engineering for Large Language Models. ArXiv, abs/2507.13334.
  • [2] Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157-173.
  • [3] https://developers.openai.com/api/docs/guides/conversation-state
  • [4] Hu, Y., Liu, S., Yue, Y., Zhang, G., Liu, B., Zhu, F., Lin, J., Guo, H., Dou, S., Xi, Z., Jin, S., Tan, J., Yin, Y., Liu, J., Zhang, Z., Sun, Z., Zhu, Y., Sun, H., Peng, B., Cheng, Z., Fan, X., Guo, J., Yu, X., Zhou, Z., Hu, Z., Huo, J., Wang, J., Niu, Y., Wang, Y., Yin, Z., Hu, X., Liao, Y., Li, Q., Wang, K., Zhou, W., Liu, Y., Cheng, D., Zhang, Q., Gui, T., Pan, S., Zhang, Y., Torr, P., Dou, Z., Wen, J., Huang, X., Jiang, Y., & Yan, S. (2025). Memory in the Age of AI Agents. ArXiv, abs/2512.13564.
  • [6] Gutierrez, B. J., Shu, Y., Gu, Y., Yasunaga, M., & Su, Y. (2024). HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. The Thirty-Eighth Annual Conference on Neural Information Processing Systems. Retrieved from https://openreview.net/forum?id=hkujvAPVsg
  • [10] Park, J.S., O’Brien, J., Cai, C.J., Morris, M.R., Liang, P., & Bernstein, M.S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology.
  • [11] Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., … Anandkumar, A. (2024). Voyager: An Open-Ended Embodied Agent with Large Language Models. Transactions on Machine Learning Research. Retrieved from https://openreview.net/forum?id=ehfRiF0R3a
  • [13] Maharana, A., Lee, D.-H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024, August). Evaluating Very Long-Term Conversational Memory of LLM Agents. In L.-W. Ku, A. Martins, & V. Srikumar (Eds), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 13851–13870). doi:10.18653/v1/2024.acl-long.747
  • [14] Yu, Y., Yao, L., Xie, Y., Tan, Q.S., Feng, J., Li, Y., & Wu, L. (2026). Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents. ArXiv, abs/2601.01885.
  • [17] Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2024). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. The Twelfth International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=hSyW5go0v8
  • [18] Zhong, W., Guo, L., Gao, Q., Ye, H., & Wang, Y. (2024). MemoryBank: enhancing large language models with long-term memory. Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence. doi:10.1609/aaai.v38i17.29946
  • [20] Qian, H., Liu, Z., Zhang, P., Mao, K., Lian, D., Dou, Z., & Huang, T. (2025). MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation. Proceedings of the ACM on Web Conference 2025, 2366–2377. Presented at the Sydney NSW, Australia. doi:10.1145/3696410.3714805
  • [21] Xu, W., Liang, Z., Mei, K., Gao, H., Tan, J., & Zhang, Y. (2026). A-Mem: Agentic Memory for LLM Agents. The Thirty-Ninth Annual Conference on Neural Information Processing Systems. Retrieved from https://openreview.net/forum?id=FiM0M8gcct
  • [22] Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, Anand Rajan Iyer, Tianlong Chen, Huan Liu, Chen-Yu Lee, and Tomas Pfister. 2025. In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8416–8439, Vienna, Austria. Association for Computational Linguistics.
  • [24] Jiang, H., Wu, Q., Lin, C.-Y., Yang, Y., & Qiu, L. (2023, December). LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. In H. Bouamor, J. Pino, & K. Bali (Eds), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 13358–13376). doi:10.18653/v1/2023.emnlp-main.825
  • [26] Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., … Li, J. (2024, August). LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. In L.-W. Ku, A. Martins, & V. Srikumar (Eds), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3119–3137). doi:10.18653/v1/2024.acl-long.172
  • [27] Zhang, X., Chen, Y., Hu, S., Xu, Z., Chen, J., Hao, M., … Sun, M. (2024, August). ∞Bench: Extending Long Context Evaluation Beyond 100K Tokens. In L.-W. Ku, A. Martins, & V. Srikumar (Eds), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 15262–15277). doi:10.18653/v1/2024.acl-long.814
  • [29] Wu, D., Wang, H., Yu, W., Zhang, Y., Chang, K.-W., & Yu, D. (2025). LongMemEval: Benchmarking chat assistants on long-term interactive memory. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR).
  • [30] Patil, S. G., Mao, H., Yan, F., Ji, C. C.-J., Suresh, V., Stoica, I., & Gonzalez, J. E. (13—19 Jul 2025). The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, … J. Zhu (Eds), Proceedings of the 42nd International Conference on Machine Learning (pp. 48371–48392). Retrieved from https://proceedings.mlr.press/v267/patil25a.html
  • [31] Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2025). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR).
  • [32] Kate, K., Pedapati, T., Basu, K., Rizk, Y., Chenthamarakshan, V., Chaudhury, S., Agarwal, M., & Abdelaziz, I. (2025). LongFuncEval: Measuring the effectiveness of long context models for function calling. ArXiv, abs/2505.10570.
  • [33] He, Z., Wang, Y., Zhi, C., Hu, Y., Chen, T., Yin, L., Chen, Z., Wu, T., Ouyang, S., Wang, Z., Pei, J., McAuley, J., Choi, Y., & Pentland, A.’. (2026). MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks. ArXiv, abs/2602.16313.
  • [34] Xiu, Z., Sun, D.Q., Cheng, K., Patel, M.J., Date, J., Zhang, Y., Lu, J., Attia, O., Vemulapalli, R., Tuzel, O., Cao, M., & Bengio, S. (2026). ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context. ArXiv, abs/2603.01357.
  • [35] Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., & Chalef, D. (2025). Zep: A Temporal Knowledge Graph Architecture for Agent Memory. ArXiv, abs/2501.13956.
  • [37] https://docs.langchain.com/oss/python/langgraph/overview
  • [38] https://docs.letta.com/guides/core-concepts/memory/memory-blocks/
  • [39] https://developers.openai.com/api/docs/guides/agents
  • [40] https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
  • [41] https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/context-cache/context-cache-overview
  • [42] https://github.com/microsoft/autogen
  • [43] https://github.com/deepset-ai/haystack
  • [44] https://learn.microsoft.com/en-us/semantic-kernel/overview/
  • [45] https://www.llamaindex.ai/
  • [46] https://docs.mem0.ai/introduction
  • [47] https://modelcontextprotocol.io/specification/2025-06-18
  • [49] https://docs.pinecone.io/guides/search/hybrid-search
  • [50] https://faiss.ai/index.html
  • [51] https://www.pinecone.io/learn/series/rag/rerankers/
  • [52] Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. (2022). Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10014–10037, Toronto, Canada. Association for Computational Linguistics.
  • [53] Liu, D., Chen, M., Lu, B., Jiang, H., Han, Z., Zhang, Q., … Qiu, L. (2026). RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval. The Thirty-Ninth Annual Conference on Neural Information Processing Systems. Retrieved from https://openreview.net/forum?id=8z3cOVER4z
  • [56] Wang, S., Yu, E., Love, O., Zhang, T., Wong, T., Scargall, S., & Fan, C. (2026). MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents. arXiv [Cs.AI]. Retrieved from http://arxiv.org/abs/2604.04853
  • [57] https://github.com/pgvector/pgvector
  • [60] https://docs.pinecone.io/guides/search/hybrid-search
  • [61] https://arxiv.org/abs/2604.04853
  • [62] https://developers.openai.com/api/docs/guides/compaction
  • [63] https://docs.langchain.com/langsmith/home
  • [65] Rezazadeh, A., Li, Z., Lou, A., Zhao, Y., Wei, W., & Bao, Y. (2025). Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control. ArXiv, abs/2505.18279.
  • [68] Hu, Y., Wang, Y., & McAuley, J. (2026). Evaluating memory in LLM agents via incremental multi-turn interactions. In Proceedings of The Fourteenth International Conference on Learning Representations. (ICLR)
  • [70] Lu, J., Holleis, T., Zhang, Y., Aumayer, B., Nan, F., Bai, H., Ma, S., Ma, S., Li, M., Yin, G., Wang, Z., & Pang, R. (2025). ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. Findings of the Association for Computational Linguistics: NAACL 2025. https://aclanthology.org/2025.findings-naacl.65/
  • [72] Hua, X., Yue, S., Li, X., Zhao, Y., Zhang, J., & Ren, J. (2026). Context learning for multi-agent discussion. In Proceedings of The Fourteenth International Conference on Learning Representations (ICLR)
  • [74] Hu, Z., Zhu, Q., Yan, H., He, Y., & Gui, L. (2026). Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation. ArXiv, abs/2602.02007.
  • [75] https://developers.openai.com/api/docs/guides/prompt-caching