https://arxiv.org/abs/2602.11988
The following content is generated by LLMs and may contain inaccuracies.
I’ll help you create a structured deep dive on this research paper. Let me first fetch the full paper to examine its contents and then search for related work.Let me search for more information about related work on context in coding agents and the authors of this paper.Now let me search for more information about the benchmark dataset they created (AGENTbench).Now I have comprehensive information to create a structured deep dive. Let me synthesize everything into the requested format.
Context
Repository context files like AGENTS.md have become widespread in software development to tailor coding agents to repositories, with over 60,000 open-source repositories adopting them by early 2026. These files function as “READMEs for agents,” providing repository overviews and information on developer tooling to help coding agents navigate repositories, run build and test commands, adhere to style guides, and solve tasks. The AGENTS.md format emerged from collaborative efforts across the AI software development ecosystem, including OpenAI Codex, Amp, Jules from Google, Cursor, and Factory.
Despite strong industry encouragement from model providers like OpenAI and Anthropic, this paper from ETH Zürich’s Secure, Reliable, and Intelligent Systems Lab addresses a critical gap: there is currently no rigorous investigation into whether such context files are actually effective for real-world tasks. The work challenges prevailing assumptions at a moment when coding agents are rapidly advancing on benchmarks like SWE-bench, where top agents score 20% on the full benchmark and 43% on SWE-bench Lite.
Key Insights
Counterintuitive core finding: Across multiple coding agents and LLMs, context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%. This directly contradicts agent developer recommendations.
Benchmark innovation: The authors constructed AGENTbench, a novel benchmark comprising Python software engineering tasks from 12 recent and niche repositories, which all feature developer-written context files. This complements existing evaluations: SWE-bench tasks from popular repositories are evaluated with LLM-generated context files following agent-developer recommendations, while AGENTbench provides a novel collection of issues from repositories containing developer-committed context files. The distinction matters because context files have only been formalized in August 2025, and adoption is not uniform across the industry.
Differential impact by provenance: Developer-provided files only marginally improve performance compared to omitting them entirely (an increase of 4% on average), while LLM-generated context files have a small negative effect on agent performance (a decrease of 3% on average). This pattern held across different LLMs and prompts used to generate the context files.
Behavioral mechanism: Both LLM-generated and developer-provided context files encourage broader exploration (e.g., more thorough testing and file traversal), and coding agents tend to respect their instructions. The problem is not agent non-compliance but rather that unnecessary requirements from context files make tasks harder. Context files lead to increased exploration, testing, and reasoning by coding agents, and, as a result, increase costs by over 20%.
Content analysis of existing files: One recommendation for context files is to include a codebase overview. Across the 12 developer-provided context files in AGENTbench, 8 include a dedicated codebase overview, with 4 explicitly enumerating and describing the directories and subdirectories in the repository. Functional directives (build, test, implementation detail, architecture) dominate, while guidance on non-functional requirements (security, performance, usability) is relatively uncommon. These files exhibit a median update interval of 22 hours, with most changes involving the addition or minor modification of 50 words or fewer.
Implications for practice: The authors recommend omitting LLM-generated context files for the time being, contrary to agent developers' recommendations, and including only minimal requirements (e.g., specific tooling to use with this repository). This aligns with emerging practitioner wisdom: Factory advises aiming for ≤ 150 lines, warning that long files slow the agent and bury signal, while some developers argue for ruthless minimalism—just a one-sentence project description and package manager specification.
Broader context tensions: This finding sits within ongoing debates about retrieval-augmented approaches for coding. Industry practitioners like Nick Pash, Head of AI at Cline, argue that RAG can be a ‘seductive trap’ for coding tasks because code is inherently logical and structured and does not always benefit from being broken down into semantically similar but contextually isolated chunks. Anthropic ultimately abandoned RAG approaches when agentic search consistently outperformed RAG across both internal benchmarks and subjective quality evaluations.
Related work on repository-level code generation: The paper builds on the SWE-bench ecosystem, where given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem. Recent work shows SelectSolve demonstrates that in fully observable environments such as SWE-bench, simply providing the entire codebase to a long-context LLM with proper prompting can achieve, and sometimes surpass, the performance of carefully designed multi-tool approaches, suggesting that when sufficient context capacity exists, explicit context management may become less critical.
Direct link to the paper: Gloaguen et al., “Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?"
Open Questions
-
What is the optimal context budget allocation? If context files increase costs by 20%+ with negative performance impact, how should those tokens be reallocated—toward deeper code retrieval, longer conversation history, or expanded test coverage feedback?
-
Can selective, task-adaptive context files outperform static ones? Rather than a single AGENTS.md file consumed at initialization, could agents dynamically query minimal, task-specific guidance (e.g., “build commands only” for dependency issues, “test patterns only” for bug fixes) to capture the marginal benefit of developer-written context while avoiding the breadth penalty?
https://arxiv.org/abs/2602.11988
以下内容由 LLM 生成,可能包含不准确之处。
我会帮您创建一个关于这篇研究论文的结构化深度分析。让我首先获取完整论文来审查其内容,然后搜索相关工作。
让我搜索关于编码代理中上下文的相关工作以及本论文作者的更多信息。
现在让我搜索关于他们创建的基准数据集(AGENTbench)的更多信息。
现在我有全面的信息来创建结构化深度分析。让我将所有内容综合成请求的格式。
背景
像AGENTS.md这样的代码库上下文文件在软件开发中变得越来越普遍,用于定制编码代理以适应特定代码库,到2026年初,超过60,000个开源项目已采用这种格式。这些文件充当"代理的README",提供代码库概览和开发工具相关信息,帮助编码代理导航代码库、运行构建和测试命令、遵守风格指南以及解决任务。AGENTS.md格式源于AI软件开发生态系统(包括OpenAI Codex、Amp、谷歌的Jules、Cursor和Factory)的协作努力。
尽管来自OpenAI和Anthropic等模型提供商的强烈行业支持,但来自ETH Zürich安全、可靠和智能系统实验室的本论文解决了一个关键空白:目前还没有严格的调查来验证这些上下文文件是否真正对实际任务有效。该工作在编码代理在SWE-bench等基准上迅速进步的时刻(顶级代理在完整基准上得分20%,在SWE-bench Lite上得分43%)质疑了普遍的假设。
关键见解
违反直觉的核心发现: 在多个编码代理和LLM中,与不提供任何代码库上下文相比,上下文文件倾向于降低任务成功率,同时还会使推理成本增加20%以上。这与代理开发者的建议直接矛盾。
基准创新: 作者构建了AGENTbench,一个新型基准,包含来自12个最近和小众代码库的Python软件工程任务,这些代码库都具有开发者编写的上下文文件。这补充了现有评估:SWE-bench任务来自热门代码库,使用按照代理开发者建议生成的LLM生成的上下文文件进行评估,而AGENTbench提供了包含开发者提交的上下文文件的代码库中的问题的新集合。这一区别很重要,因为上下文文件仅在2025年8月正式确立,行业采用并不均匀。
按来源划分的差异化影响: 开发者提供的文件相比完全省略它们仅略微改善性能(平均增幅4%),而LLM生成的上下文文件对代理性能有小幅负面影响(平均下降3%)。这种模式在不同LLM和用于生成上下文文件的提示中保持一致。
行为机制: LLM生成和开发者提供的上下文文件都会鼓励更广泛的探索(例如,更彻底的测试和文件遍历),编码代理倾向于遵守其指令。问题不在于代理不遵守要求,而在于上下文文件中不必要的要求使任务变得更难。上下文文件导致编码代理增加探索、测试和推理,因此成本增加20%以上。
现有文件的内容分析: 上下文文件的一项建议是包括代码库概览。在AGENTbench的12个开发者提供的上下文文件中,8个包含专门的代码库概览,4个明确列举并描述了代码库中的目录和子目录。功能指令(构建、测试、实现细节、架构)占主导地位,而对非功能需求(安全性、性能、可用性)的指导相对少见。这些文件的中位数更新间隔为22小时,大多数更改涉及添加或小幅修改50字或更少。
对实践的影响: 作者建议暂时省略LLM生成的上下文文件,这与代理开发者的建议相反,并仅包括最少的要求(例如,与该代码库配合使用的特定工具)。这与新兴的从业者智慧相一致:Factory建议瞄准≤150行,警告长文件会减慢代理速度并淹没信号,而某些开发者主张彻底的极简主义——只需一句项目描述和包管理器规范。
更广泛的上下文张力: 这一发现存在于关于检索增强方法用于编码的持续辩论中。像Cline首席AI官Nick Pash这样的行业从业者辩称,RAG对编码任务可能是"诱人的陷阱",因为代码本质上是逻辑和结构化的,并不总是受益于被分解成语义相似但上下文隔离的块。Anthropic最终在内部基准和主观质量评估中放弃了RAG方法,因为代理搜索始终优于RAG。
关于代码库级代码生成的相关工作: 该论文以SWE-bench生态系统为基础,给定代码库和问题,语言模型的任务是生成解决所述问题的补丁。最近的工作显示SelectSolve表明,在SWE-bench这样的完全可观察环境中,简单地使用适当提示将整个代码库提供给长上下文LLM可以达到,有时甚至超越精心设计的多工具方法的性能,这表明当存在足够的上下文容量时,显式上下文管理可能变得不那么关键。
论文直接链接:Gloaguen等人,《评估AGENTS.md:代码库级上下文文件对编码代理是否有帮助?》
未解决的问题
-
最优上下文预算分配是什么? 如果上下文文件增加20%以上的成本且对性能有负面影响,这些令牌应该如何重新分配——是用于更深入的代码检索、更长的对话历史,还是扩展的测试覆盖反馈?
-
选择性、任务自适应的上下文文件能否优于静态文件? 与在初始化时使用的单个AGENTS.md文件相反,代理是否可以动态查询最少的、特定于任务的指导(例如,对于依赖问题使用"仅构建命令",对于bug修复使用"仅测试模式"),以捕获开发者编写上下文的边际收益,同时避免广度惩罚?