AI Agents and Byzantine Fault Tolerance

What if your AI agents are lying to each other?

We’ve been building multi-agent systems like it’s a team meeting: everyone contributes, we take the best idea, and ship it. But distributed systems engineers solved a harder version of this problem 40 years ago under the name Byzantine Fault Tolerance, where the core challenge is reaching agreement when some participants are unreliable, deceptive, or just confidently wrong.

Most people in the agent community haven’t internalized how directly this applies. LLM hallucinations are structurally identical to a Byzantine node sending contradictory messages to different peers. And when all your agents share the same base model or similar training data, their failure modes become correlated, which is precisely the condition that breaks classical fault tolerance guarantees.

Someone actually tested this recently. A group of researchers ran LLM agents through a simple scalar consensus task where agents just needed to agree on a number. Even without any adversaries in the system, agreement was unreliable and degraded as group size grew. What’s interesting is that the agents didn’t converge on wrong answers so much as they just failed to converge at all, which suggests multi-agent coordination has a liveness problem we haven’t been paying enough attention to.

This got me thinking about what changes when you treat consensus as a hard gate rather than a soft aggregation layer. In most current multi-agent setups, you generate multiple outputs and pick the best one through some form of voting. But there’s a fundamentally different design where certain actions simply cannot proceed until independent agents genuinely converge, and the absence of convergence is itself a meaningful signal that something is underspecified or wrong.

A few directions I’ve been exploring:

One is consensus-gated autonomous actions, where high-risk agent decisions like spec changes or large refactors must pass a multi-agent agreement gate before execution, while low-risk actions flow through freely. The gate threshold itself can be learned over time.

Another is testing reframed as distributed consensus, where multiple independent agents must agree on pass/fail rather than relying on a single judge. When agents disagree on whether a test passes, that disagreement often points to ambiguity in the spec itself.

A third is partial agreement maps that output a structured view of where agents converge versus where they diverge, turning consensus from an all-or-nothing binary into a gradient that tells you exactly where human attention is needed.

The deeper question I keep coming back to: if agreement among stochastic agents is fundamentally fragile, maybe the right move is designing systems that treat disagreement as the most informative signal rather than an obstacle to be averaged away.

I’m curious whether others see this connection between distributed consensus theory and agent system design. It feels like there’s 40 years of hard-won intuition sitting in the distributed systems literature that the agent community hasn’t fully absorbed yet.

The following content is generated by LLMs and may contain inaccuracies.

I’ll search for relevant research on Byzantine Fault Tolerance in AI agents and multi-agent consensus to deepen this analysis.Let me search for the specific scalar consensus research mentioned in the original note.Perfect! Now let me fetch the key papers to verify URLs and get precise details.Now let me get more information about the multi-agent coordination research and Byzantine consensus papers:Based on the extensive research I’ve gathered, I can now provide a comprehensive deep dive on this idea about Byzantine Fault Tolerance in AI agent systems.

Context

The rise of multi-agent AI systems marks a collision between two engineering traditions that have evolved separately for decades. Large language models are now being integrated into autonomous agents that collaborate on complex tasks, yet most practitioners approach these systems as if they were building a team meeting rather than a distributed system. Meanwhile, Byzantine Fault Tolerance—a consensus-based algorithmic technique that ensures systems can function correctly even when components fail or act maliciously—has been a cornerstone of distributed systems engineering since the 1980s.

The core insight is that hallucinated intermediate fields in LLM outputs are structurally identical to the Byzantine problem: an agent returns a confident-looking response with no error indication, and everything downstream treats that invented value as real. When agents share the same base model or similar training data, their failure modes become correlated—if a single fault can simultaneously flip more than f modules, Byzantine fault tolerance guarantees are void, which is why design diversity is essential. This violates a fundamental assumption of classical BFT: that faults are independent.

The practical implications are severe. Multi-agent LLM systems fail at 41-86.7% rates in production, and many failures arise from organizational design and agent coordination challenges rather than individual agent limitations. The agent community is rediscovering hard-won lessons from distributed systems, often without the vocabulary or theoretical foundations that would accelerate progress.

Key Insights

LLM consensus failures are primarily liveness problems, not safety problems. Recent empirical work confirms the original hypothesis: researchers tested LLM agents on scalar consensus tasks and found that valid agreement is not reliable even in benign settings and degrades as group size grows. Specifically, valid consensus drops from 46.6% at N=4 to 33.3% at N=16. Crucially, Byzantine agents primarily harm liveness by preventing agreement rather than steering outcomes to corrupted values—agents fail to converge at all rather than converging on wrong answers. This is detailed in Berdoz et al., “Can AI Agents Agree?", which systematically evaluates LLM-based Byzantine consensus games.

Correlated failures break classical fault tolerance assumptions. When every server runs the same exact software with the same limits and failure modes, a software bug or load-related failure that causes one server to fail can impact the rest of the fleet simultaneously. In AI systems, when multiple organizations deploy autonomous agents based on similar underlying models, the risk of correlated failure arises. Amazon’s operational experience, documented in their Builders' Library on minimizing correlated failures, shows that correlated failures eat away at availability gains from redundancy, including issues with power, network, cooling, and common infrastructure dependencies like DNS.

Weighted consensus mechanisms show promise for LLM-agent reliability. The research community has begun developing BFT-inspired protocols tailored to LLM characteristics. LLM-based agents demonstrate stronger skepticism when processing erroneous message flows, enabling them to outperform traditional agents across different topological structures. Building on this, Zheng et al. propose CP-WBFT (Confidence Probe-based Weighted Byzantine Fault Tolerant consensus), which leverages the inherent reflective and discriminative capabilities of LLMs, assigning higher transmission weights to more credible agents. Under extreme conditions, this approach achieved +85.71% Byzantine Fault Tolerance improvement on complete graphs while maintaining 100% round-level accuracy.

Coordination failures manifest as emergent system-level phenomena. Multi-agent systems introduce failures that are emergent behaviors—encoded nowhere but arising everywhere from agents that learn from each other, mislead each other, or accidentally form coalitions. Empirically, coordination failures account for 36.94% of multi-agent system failures, verification gaps 21.30%, and infrastructure issues ~16%. The MAST taxonomy documents 14 fine-grained failure modes mapped to execution stages where their root causes typically emerge, providing the first systematic framework for understanding multi-agent LLM system breakdowns.

Hallucination propagation creates unique challenges beyond traditional distributed systems. Multi-agent visual hallucination snowballing occurs where hallucinations are seeded in a single agent and amplified by following ones due to over-reliance on textual flow, with vision tokens gradually diminishing in deeper agent turns. Unlike traditional Byzantine nodes that send contradictory messages, a token-level hallucination can propagate through a workflow and surface as a compliance breach, creating what one practitioner calls “expensive, slow workers that are occasionally wrong in ways that look correct”.

Leaderless consensus architectures reduce single points of failure. Traditional multi-agent frameworks often rely on leader-based protocols, but consensus latency can increase significantly due to consecutive Byzantine leaders—if the leader is Byzantine or submits a low-quality answer that fails to obtain a quorum, the round must be rerun, and multiple consecutive Byzantine leaders can dramatically increase latency. The DecentLLMs framework addresses this by employing a leaderless consensus architecture where worker agents generate answers in parallel and evaluator agents score them, enabling all agents within each role to participate equally.

Network topology and scale fundamentally constrain coordination capabilities. Performance smoothly decreases as agent networks grow in size, with all coordination tasks becoming substantially more challenging as network size increases—for 100-agent networks, performance drops to near zero across the board. This is documented in AgentsNet, a benchmark drawn from classical distributed systems problems that explicitly assesses coordination and collaboration capabilities that should be seen as fundamental to effective distributed systems.

Consensus mechanisms themselves can harm performance through premature convergence. Due to LLM hallucinations, confidence scores may be unreliable, and if a small subset of agents is compromised via prompt injection attacks, the system may converge toward a shared but incorrect answer—when decisions are made using mechanisms such as majority voting, this can lead to complete failure. The FREE-MAD framework proposes assigning scores to all candidate responses without requiring consensus in the debate stage, avoiding the conformity pressure that can suppress minority viewpoints.

Organizational design principles matter more than model capability. Improvements in base model capabilities will be insufficient to address the full taxonomy of multi-agent system failures—good MAS design requires organizational understanding, as even organizations of sophisticated individuals can fail catastrophically if the organization structure is flawed. This aligns with research showing that well-defined design principles from high-reliability organizations can prevent such failures, suggesting that the path forward involves importing concepts from organizational theory and industrial management, not just better language models.

Open Questions

Can we develop formal verification methods for stochastic consensus protocols? Classical BFT provides deterministic guarantees, but LLM-based agents operate through probabilistic inference. If disagreement is the most informative signal in stochastic multi-agent systems, what does a formally verified “consensus on lack of consensus” look like, and can we prove bounds on the informativeness of disagreement patterns?

What is the theoretical limit of coordination in homogeneous vs. heterogeneous agent ensembles? If correlated failures are inevitable when agents share base models, and design diversity introduces non-trivial integration complexity, is there an optimal point on the homogeneity-diversity spectrum? Can we quantify the coordination tax of diversity and determine whether it’s fundamentally worthwhile for Byzantine robustness in production LLM systems?

如果你的AI代理相互欺骗呢？

我们一直在构建多代理系统，就像开团队会议一样：每个人都贡献想法，我们选择最好的想法，然后推出。但40年前，分布式系统工程师在拜占庭容错的名义下解决了这个问题的更复杂版本，其核心挑战是当某些参与者不可靠、具有欺骗性或只是自信地错误时达成共识。

代理社区的大多数人还没有内化这如何直接适用。LLM幻觉在结构上与拜占庭节点向不同对等方发送矛盾消息是一致的。当所有代理共享相同的基础模型或类似的训练数据时，它们的故障模式会产生相关性，这正是打破经典容错保证的条件。

最近确实有人测试了这一点。一组研究人员让LLM代理执行了一个简单的标量共识任务，代理只需同意一个数字。即使系统中没有任何对手，共识也不可靠，并随着组规模的增加而下降。有趣的是，代理不是聚合错误答案，而是根本无法聚合，这表明多代理协调存在一个我们还没有充分关注的活性问题。

这让我思考当你将共识视为硬门而不是软聚合层时会发生什么。在大多数当前的多代理设置中，你生成多个输出并通过某种形式的投票选择最佳输出。但存在一个根本不同的设计，其中某些操作在独立代理真正达成共识之前根本无法进行，而共识的缺失本身就是某些东西未被充分指定或存在问题的有意义信号。

我一直在探索的几个方向：

一个是共识门控自主行动，其中高风险代理决策（如规范变更或大型重构）必须在执行前通过多代理协议门，而低风险操作可自由流动。门槛本身可以随时间学习。

另一个是重新框架化为分布式共识的测试，其中多个独立代理必须对通过/失败达成共识，而不是依赖单一评判者。当代理对测试是否通过意见不一时，这种分歧通常指向规范本身的歧义。

第三个是部分协议映射，它输出代理在何处聚合与何处分歧的结构化视图，将共识从全或无的二进制转变为一个梯度，它告诉你确切需要人类关注的位置。

我一直在思考的更深层问题是：如果随机代理间的协议本质上是脆弱的，也许正确的做法是设计系统，将分歧视为最有信息量的信号，而不是被平均化掉的障碍。

我很好奇其他人是否看到了分布式共识理论与代理系统设计之间的这种联系。感觉好像分布式系统文献中有40年来之不易的直觉，而代理社区还没有完全吸收。

以下内容由 LLM 生成，可能包含不准确之处。

我将搜索有关人工智能代理中拜占庭容错和多代理共识的相关研究，以深化这一分析。让我搜索原始笔记中提到的具体标量共识研究。

完美！现在让我获取关键论文以验证URL并获取精确详情。

现在让我获取更多关于多代理协调研究和拜占庭共识论文的信息：

基于我收集的广泛研究，我现在可以提供关于AI代理系统中拜占庭容错这一想法的全面深入分析。

背景

多代理AI系统的兴起标志着两个已分别发展数十年的工程传统之间的碰撞。大型语言模型现在正被整合到在复杂任务上协作的自主代理中，但大多数实践者都在构建这些系统，就好像他们在召开团队会议而不是在构建分布式系统。与此同时，拜占庭容错——一种基于共识的算法技术，确保系统即使在组件故障或恶意行为时也能正确运行——自1980年代以来一直是分布式系统工程的基石。

核心洞察是，LLM输出中的幻觉中间字段在结构上与拜占庭问题相同：一个代理返回看似自信的响应且没有错误指示，所有下游组件都将这个虚构的值视为真实。当代理共享相同的基础模型或相似的训练数据时，它们的故障模式会变得相关——如果单个故障可以同时导致超过f个模块失效，拜占庭容错保证就失效了，这就是为什么设计多样性至关重要。这违反了经典BFT的一个基本假设：故障是独立的。

实际影响严重。多代理LLM系统在生产中的失败率为41-86.7%，许多失败源于组织设计和代理协调挑战，而非单个代理的局限。代理社区正在重新发现分布式系统中经验丰富的教训，通常缺乏能够加速进展的术语或理论基础。

关键洞察

LLM共识故障主要是活性问题，而非安全问题。 最近的实证工作证实了原始假设：研究人员在标量共识任务上测试了LLM代理，发现即使在良性设置中有效协议也不可靠，且随着群体规模增加而恶化。具体而言，有效共识从N=4时的46.6%下降到N=16时的33.3%。关键是，拜占庭代理主要通过阻止协议而非引导结果到损坏值来危害活性——代理无法完全收敛而不是收敛于错误答案。这在Berdoz等人，“AI代理能达成一致吗？"中有详细说明，该论文系统地评估了基于LLM的拜占庭共识博弈。

相关故障破坏经典容错假设。 当每个服务器运行完全相同的软件并具有相同的限制和故障模式时，导致一个服务器故障的软件bug或与负载相关的故障可能同时影响整个舰队。在AI系统中，当多个组织部署基于相似底层模型的自主代理时，相关故障的风险就产生了。亚马逊的运营经验，记录在其Builders库关于最小化相关故障中，表明相关故障会削弱冗余带来的可用性收益，包括电源、网络、冷却和DNS等常见基础设施依赖的问题。

加权共识机制显示对LLM代理可靠性的承诺。 研究社区已开始开发针对LLM特性定制的BFT启发协议。基于LLM的代理在处理错误消息流时表现出更强的怀疑态度，使其在不同拓扑结构上的表现优于传统代理。在此基础上，Zheng等人提出CP-WBFT（基于信心探测的加权拜占庭容错共识），它利用LLM固有的反思和判别能力，为更可信的代理分配更高的传输权重。在极端条件下，该方法在完全图上实现了+85.71%的拜占庭容错改进，同时保持100%的轮级准确度。

协调故障表现为涌现系统级现象。 多代理系统引入无处编码但无处不在的涌现故障——源于相互学习、相互误导或意外形成联盟的代理。在经验上，协调故障占多代理系统故障的36.94%，验证缺陷占21.30%，基础设施问题约占16%。MAST分类法记录了14种细粒度故障模式映射到执行阶段，其根本原因通常在此出现，提供了第一个系统框架来理解多代理LLM系统故障。

幻觉传播创造超越传统分布式系统的独特挑战。 多代理视觉幻觉级联会发生，其中幻觉在单个代理中播种，并由后续代理放大，原因是过度依赖文本流，视觉令牌在更深的代理轮次中逐渐减少。与发送矛盾消息的传统拜占庭节点不同，令牌级幻觉可以通过工作流传播并表现为合规违规，创建一位实践者所称的"昂贵、缓慢的工作者，偶尔以看似正确的方式出错”。

无领导共识架构减少单点故障。 传统多代理框架通常依赖于基于领导者的协议，但由于连续的拜占庭领导者，共识延迟可以显著增加——如果领导者是拜占庭的或提交了无法获得法定人数的低质量答案，则必须重新运行该轮，多个连续的拜占庭领导者可显著增加延迟。DecentLLMs框架通过采用无领导共识架构来解决这个问题，其中工作代理并行生成答案，评估代理对其进行评分，使每个角色中的所有代理都能平等参与。

网络拓扑和规模从根本上约束协调能力。 随着代理网络规模增大，性能平稳下降，所有协调任务随着网络规模增加而变得更具挑战性——对于100代理网络，所有领域的性能都下降到接近零。这在AgentsNet中有文献，这是从经典分布式系统问题中得出的基准，明确评估应被视为有效分布式系统基础的协调和协作能力。

共识机制本身可能通过过早收敛而损害性能。 由于LLM幻觉，信心分数可能不可靠，如果通过提示注入攻击破坏了一小部分代理，系统可能会向共享但不正确的答案收敛——当使用多数投票等机制做出决策时，这可能导致完全失败。FREE-MAD框架建议为所有候选响应分配分数而不需要在辩论阶段达成共识，避免可能压制少数观点的一致性压力。

组织设计原则比模型能力更重要。 基础模型能力的改进将不足以应对多代理系统失败的完整分类法——良好的MAS设计需要组织理解，因为即使是组织精良的个体也可能因为组织结构缺陷而灾难性地失败。这与研究一致，表明来自高可靠性组织的明确设计原则可以防止此类故障，建议前进的道路涉及从组织理论和工业管理中导入概念，而不仅仅是更好的语言模型。

开放问题

我们能否为随机共识协议开发形式验证方法？ 经典BFT提供确定性保证，但基于LLM的代理通过概率推理运行。如果分歧是随机多代理系统中最具信息性的信号，形式验证的"关于缺乏共识的共识"是什么样子，我们能否证明分歧模式信息量的界限？

同质与异质代理集合中协调的理论极限是什么？ 如果当代理共享基础模型时相关故障是不可避免的，且设计多样性引入非平凡的集成复杂性，是否存在同质性-多样性频谱上的最优点？我们能否量化多样性的协调税并确定对于生产LLM系统中的拜占庭稳健性而言它在根本上是否值得？