Human-in-the-loop approval scaling in agentic systems

In agentic applications, to ensure the agent actions executed in a safe and trusted manner, in particular write operations, we often use a human in the loop.

The native practice for the setup with one agentic loop + MCP tool servers, is to add human confirmation per tool call. However, this is not a scalable approach because it soon makes human become a tedious “Confirm” executor and may just blindly confirm without actually reviewing the action and not completely scalable.

One potential mitigation is to use diff based review approach, and providing summary of the ongoing executions and ask for confirm. However this remains non-scalable when agent attempts to do a huge diff (e.g. writing 10k lines of code and wants to commit), where human cannot review the whole thing efficiently.

Another idea is to prepare some sort of intent overview and contract to let user to review, but it seems generally hard to prepare tool call sequence in advance because it is non-deterministic and depending on the context. It can also happen that one tool call is catastrophic but missed.

The following content is generated by LLMs and may contain inaccuracies.

Context

Human-in-the-loop (HITL) approval for agentic systems addresses a fundamental tension in AI safety: balancing autonomy with control. As agents gain write permissions—whether modifying codebases, executing financial transactions, or altering production systems—the risk of cascading failures grows. Traditional per-action approval gates create approval fatigue, degrading the very oversight they’re meant to provide. This challenge intensifies as agents integrate with Model Context Protocol (MCP) tool servers, where tool composition can generate unbounded action sequences.

Key Insights

Hierarchical approval boundaries: Rather than uniform gating, systems could implement trust tiers based on reversibility and blast radius. Anthropic’s Constitutional AI work suggests learned policies can classify actions by consequence severity. Read operations and idempotent writes might auto-approve, while irreversible operations (deletions, external API calls) trigger review. This mirrors capability-based security patterns where permissions are granular rather than binary.

Semantic compression for review: The 10k-line diff problem isn’t unique to agents—code review research tackles this via change impact analysis. Agents could pre-compute intent summaries using formal specifications or property-based testing. Instead of reviewing raw diffs, humans approve high-level invariants (“maintains API compatibility,” “preserves data integrity”). Microsoft’s Copilot Workspace experiments with this by generating editable task plans before execution.

Auditable sandboxing with rollback: Non-determinism makes pre-approval contracts fragile, but post-hoc auditing with cheap rollback changes the calculus. Systems like Deno’s permission model prove that runtime permission prompts can work when paired with clear scope boundaries. For agents, execution in isolated environments with speculative checkpointing lets humans review outcomes rather than intentions, then commit or revert atomically.

Open Questions

Can we develop a “differential trust calculus” that dynamically adjusts approval thresholds based on agent track record, action reversibility, and environmental context, similar to credit scoring for automation?
What design patterns from transactional databases (two-phase commit, optimistic concurrency) could apply to multi-step agent workflows with deferred human approval gates?

在代理应用中，为了确保代理操作以安全和可信的方式执行，尤其是写操作，我们通常会使用人在回路的方式。

对于一个代理循环加 MCP 工具服务器的配置，最原始的做法是在每次工具调用时添加人工确认。然而这种方式不具备可扩展性，因为它很快就会让人变成一个乏味的"确认"执行者，可能会不加审查地盲目确认，完全无法扩展。

一种可能的缓解方案是使用基于差异的审查方法，提供执行摘要并请求确认。但当代理尝试生成巨大的差异（例如写一万行代码并想要提交）时，这种方式仍然不可扩展，因为人无法高效地审查全部内容。

另一个想法是准备某种意图概览和合约让用户审查，但由于工具调用序列是非确定性的且依赖上下文，通常很难提前准备。也可能出现某个灾难性的工具调用被遗漏的情况。

以下内容由 LLM 生成，可能包含不准确之处。

背景

人在回路（HITL）批准对于代理系统解决了AI安全中的一个基本矛盾：平衡自主性与控制。当代理获得写入权限——无论是修改代码库、执行金融交易还是改变生产系统——级联故障的风险都会增长。传统的逐个操作批准门控会导致批准疲劳，削弱了它们本应提供的监督。当代理与模型上下文协议（MCP）工具服务器集成时，这一挑战会加剧，因为工具组合可以生成无限的操作序列。

关键洞见

分层批准边界：与其采用统一的门控，系统可以基于可逆性和影响范围实现信任层级。Anthropic的宪法AI工作表明，学习策略可以按后果严重程度对操作进行分类。读取操作和幂等写入可能会自动批准，而不可逆操作（删除、外部API调用）会触发审查。这反映了基于能力的安全模式，其中权限是精细化而非二进制的。

用于审查的语义压缩：万行代码差异问题不仅限于代理——代码审查研究通过变更影响分析来解决这个问题。代理可以使用形式化规范或基于属性的测试预先计算意图摘要。与其审查原始差异，人类可以批准高级不变量（“维护API兼容性”、“保留数据完整性”）。微软的Copilot工作区通过在执行前生成可编辑的任务计划来尝试这种方法。

具有回滚功能的可审计沙箱：非确定性使得预批准合同变得脆弱，但带有廉价回滚的事后审计改变了成本效益计算。Deno的权限模型等系统证明，当与清晰的作用域边界配对时，运行时权限提示可以有效。对于代理，在隔离环境中执行并结合推测性检查点让人类审查结果而非意图，然后原子性地提交或回滚。

开放问题

我们能否开发一种"差异信任计算"，根据代理的历史记录、操作可逆性和环境背景动态调整批准阈值，类似于自动化的信用评分？
事务数据库中的哪些设计模式（两阶段提交、乐观并发）可以应用于具有延迟人工批准门控的多步代理工作流？