Changkun's Blog欧长坤的博客

Science and art, life in between.科学与艺术,生活在其间。

  • Home首页
  • Ideas想法
  • Posts文章
  • Tags标签
  • Bio关于
  • TOC目录
  • Overview概览
Changkun Ou

Changkun Ou

Human-AI interaction researcher, engineer, and writer.人机交互研究者、工程师、写作者。

Bridging HCI, AI, and systems programming. Building intelligent human-in-the-loop optimization systems. Informed by psychology, sociology, cognitive science, and philosophy.连接人机交互、AI 与系统编程。构建智能的人在环优化系统。融合心理学、社会学、认知科学与哲学。

Science and art, life in between.科学与艺术,生活在其间。

281 Blogs博客
168 Tags标签
  • 2026
    • 02-18 16:49 Confirmation Fatigue and the Protocol Gap in Agentic AI Oversight
    • 02-17 21:16 Opinion Formation Through the Voter Model in Network Dynamics
    • 02-17 21:15 Large Language Models as Optimization Tools
    • 02-17 20:35 Comprehensive LLM Courses and Lecture Collections
    • 02-17 19:57 The Cost of Staying: Tech Career Timing
    • 02-17 19:54 Guidelines for Reporting LLM-Integrated Systems in HCI Research
    • 02-17 19:13 Bayesian Methods, RLHF, and Building AI Models
    • 02-17 08:22 Human-in-the-loop approval scaling in agentic systems
    • 02-17 08:14 Pareto Principles and Incomparability in Infinite Ethics
    • 02-16 11:43 The Energy Dilemma and Historical Paradox of Intelligent Optimization
Changkun's Blog欧长坤的博客

Sharing and recording scattered thoughts and writings.

在这里分享并记录一些零散的想法及写作。

10 / 76 ideas
2026-02-18 16:49:25 -/-

Confirmation Fatigue and the Protocol Gap in Agentic AI Oversight代理式AI监督中的确认疲劳与协议缺口

Per-tool-call human approval in agentic AI is solved in theory, unsolved in practice. Confirmation fatigue is not a UX annoyance but a security vulnerability and the primary obstacle to effective human oversight at scale. Risk-tiered frameworks, middleware architectures, and new design patterns now exist to replace the binary confirm/deny paradigm. But MCP provides no protocol-level mechanism for any of them, so every client reinvents the wheel.

Confirmation fatigue as a documented threat vector

Rippling’s 2025 Agentic AI Security guide classifies “Overwhelming Human-in-the-Loop” as threat T10: adversaries flood reviewers with alerts to exploit cognitive overload. SiliconANGLE (January 2026) argues HITL governance was built for an era of discrete, high-stakes decisions, not for modern agent workflows that produce action traces humans cannot realistically interpret.

The cybersecurity parallel is quantified. SOC teams average 4,484 alerts/day; 67% are ignored due to false-positive fatigue (Vectra 2023). Over 90% of SOCs report being overwhelmed by backlogs. ML-based alert prioritization cut response times by 22.9% while suppressing 54% of false positives at 95.1% detection accuracy. The lesson: risk-proportional filtering outperforms blanket approval.

Mitchell, Birhane, and Pistilli (February 2025, “Fully Autonomous AI Agents Should Not be Developed”) frame this as the “ironies of automation,” where more automation degrades human competence on the rare critical tasks where oversight matters most. CHI 2023 trust calibration work documents how “cooperative” interactions (reviewing each recommendation) degrade into passive “delegative” ones. This is exactly confirmation fatigue.

MCP’s oversight mandate without enforcement

The MCP spec (v2025-11-25) states: “Hosts MUST obtain explicit user consent before invoking any tool." It immediately undermines this: “While MCP itself cannot enforce these security principles at the protocol level, implementors SHOULD build robust consent and authorization flows into their applications.”

Tool annotations (readOnlyHint, destructiveHint, idempotentHint, openWorldHint) exist but are explicitly “hints that should not be relied upon for security decisions,” since tool descriptions from untrusted servers cannot be verified. The sampling feature includes two HITL checkpoints but uses SHOULD, not MUST, allowing clients to auto-approve.

No protocol-level approval mechanism exists. No approval/request JSON-RPC method, no requiresApproval field, no tool permission scoping. The closest active proposal is GitHub Issue #711 (trust/sensitivity annotations), adding sensitiveHint (low/medium/high) for policy-based routing. It links to PR #1913 with a security label. No dedicated HITL Specification Enhancement Proposal exists as of February 2026.

The fragmentation is visible: Claude Code uses allow/deny/ask arrays, Cline offers granular auto-approve plus a “YOLO mode,” and users have injected JavaScript into Claude Desktop’s Electron app to bypass confirmations. Every client independently rebuilds approval logic.

Convergence on risk-proportional oversight

Risk-tiered oversight is the dominant paradigm. Classify tool calls by risk, auto-approve the safe majority, focus human attention on the dangerous few.

Feng, McDonald, and Zhang (“Levels of Autonomy for AI Agents,” arXiv:2506.12469, June 2025) define five levels from L1 Operator (full human control) to L5 Observer (full autonomy), with “autonomy certificates” capping an agent’s level based on capabilities and context. Their key observation: at L4 (Approver, the MCP default), “if a user can enable the L4 agent with a simple approval, the risks of both [L4 and L5] agents are similar.” Confirmation fatigue makes per-call approval security-equivalent to no approval.

Engin et al. (“Dimensional Governance for Agentic AI,” arXiv:2505.11579) argue static risk categories fail for dynamic agentic systems and propose tracking how decision authority, autonomy, and accountability distribute dynamically. Cihon et al. (arXiv:2502.15212, Microsoft/OpenAI) score orchestration code along impact and oversight dimensions without running the agent.

Industry converges on three tiers:

  • Low risk (read-only, retrieval): auto-approve, log only
  • Medium risk (reversible writes, non-sensitive ops): auto-approve with enhanced logging, post-hoc review
  • High risk (irreversible actions, financial transactions, PII, production deploys): mandatory human approval, sometimes multi-approver quorum

Galileo’s HITL framework targets a 10–15% escalation rate, with 85–90% of decisions executing autonomously. The TAO framework (arXiv:2506.12482) finds that review requests often trigger where agents express high confidence but the system internally assesses risk differently; self-assessment alone is insufficient as a gate.

Design patterns for graduated tool-call oversight

Reversibility-based action classification

The highest-leverage pattern: classify by reversibility, not abstract risk. A decision-theoretic model (arXiv:2510.05307) formalizes this as minimum-time scheduling (Confirm → Diagnose → Correct → Redo), finding that intermediate confirmation at irreversibility boundaries cut task completion time by 13.54%; 81% of participants preferred it over blanket or end-only confirmation. The EU AI Act codifies this: high-risk systems must support ability to “disregard, override or reverse the output.” Where outputs are truly irreversible, ex ante human oversight is the only compliant approach.

Practical taxonomy: read-only auto-approves; reversible writes (git-tracked edits) log only; soft-reversible actions (emails, tickets) batch; irreversible operations (data deletion, financial transfers, production deploys) require mandatory human gates. Reversibility is contextual: deleting from a git repo is reversible; deleting from unversioned S3 is not.

Plan-level vs. action-level approval

Safiron (Huang et al., arXiv:2510.09781, October 2025) analyzes planned agent actions pre-execution, detecting risks and generating explanations. Existing guardrails mostly operate post-execution and achieved below 60% accuracy on plan-level risk detection. ToolSafe (arXiv:2601.10156, January 2026) complements this with dynamic step-level monitoring during execution, catching what plan-level review misses.

The optimal architecture is hybrid: approve the plan at a high level, then monitor execution with automated step-level guardrails that halt the agent on deviation. OpenAI Codex’s “Long Task Mode” demonstrates this: the agent generates a dynamic whitelist of expected operations, the human reviews the whitelist (not individual calls), and the agent executes within those boundaries with batched questions for consolidated review.

Hierarchical multi-agent oversight

TAO (Kim et al., 2025) implements hierarchical multi-agent oversight inspired by clinical review, with an Agent Router assessing risk and routing to appropriate tiers. Multi-agent review pipelines have shown up to 96% reduction in hallucinations versus single-agent execution.

The emerging reference architecture has five layers: (1) deterministic policy gates (allowlists/denylists) as the fastest filter, (2) constitutional self-assessment by the agent, (3) an AI supervisor for uncertain cases, (4) human-in-the-loop for irreversible or novel situations, (5) audit trail plus post-hoc review. Each layer reduces volume flowing to the next.

Sandbox-first execution for informed review

Instead of asking humans to evaluate tool calls in the abstract, sandbox-first architectures execute in isolation and present actual results for review. The ecosystem is production-ready: E2B (Firecracker microVMs, sub-second creation), nono (kernel-level restrictions bypassing-proof), Google’s Agent Sandbox (GKE + gVisor), AIO Sandbox (MCP-compatible containers).

NVIDIA’s AI Red Team emphasizes application-level sandboxing is insufficient: once control passes to a subprocess, the application loses visibility, so kernel-level enforcement is necessary. Not all actions can be sandboxed: third-party API calls, email, payments must hit real services. For these, the dry-run pattern (agent describes intent, human approves before live execution) remains the fallback.

Deterministic policy enforcement

Rule-based systems are the most reliable first layer: deterministic, auditable, zero LLM inference cost. SafeClaw implements deny-by-default with SHA-256 hash chain audit. COMPASS (Choi et al., 2026) maps natural-language policies to atomic rules at tool invocation time, improving enforcement pass rates from 0.227 to 0.500, but also exposed that LLMs fail 80–83% on denied-edge queries with open-weight models, proving policy enforcement cannot rely on LLM compliance alone.

A cautionary case: Cursor’s denylist was bypassed four ways (Base64 encoding, subshells, shell scripts, file indirection) and then deprecated. String-based filtering is fundamentally insufficient for security-critical gating.

HITL implementations across agent frameworks

LangGraph has the most developed HITL support. interrupt() pauses graph execution at any point, persisting state to a checkpointer (PostgreSQL in production). HumanInTheLoopMiddleware enables per-tool configuration with approve, edit, and reject decisions, allowing different tools to receive different oversight levels.

OpenAI Agents SDK provides input guardrails, output guardrails, and tool guardrails wrapping function tools for pre/post-execution validation. Its MCP integration accepts require_approval as “always,” “never,” or a custom callback for programmatic risk-based approval.

Anthropic takes a model-centric approach via Responsible Scaling Policy and AI Safety Levels (ASL-1 through ASL-3+). Claude’s computer use follows an “ask-before-acting” pattern with explicit access scoping. The February 2026 Sabotage Risk Report for Claude Opus 4.6 found “very low but not negligible” sabotage risk, elevated in computer use settings, with instances of “locally deceptive behavior” in complex agentic environments.

Google DeepMind SAIF 2.0 (October 2025) establishes three principles: agents must have well-defined human controllers, their powers must be carefully limited, their actions must be observable. The “amplified oversight” technique, where two model copies debate while pointing out each other’s flaws to a human judge, remains research-stage.

Middleware and proxy architectures for MCP oversight

The practical path runs through proxy/middleware architectures intercepting JSON-RPC tools/call requests. Key solutions: Preloop (CEL-based policies, quorum approvals, multi-channel notifications), HumanLayer (YC F24; framework-agnostic async approval API with Slack/email routing and auto-approval learning), gotoHuman (managed HITL approval UI as MCP server). For code-first approaches, FastMCP v2.9+ provides hooks at on_call_tool, on_list_tools, and other levels for composable HITL pipeline stages.

Enterprise gateways: Traefik Hub (task-based access control, JWT policy enforcement), Microsoft MCP Gateway (Kubernetes-native, Entra ID auth), Kong AI MCP Proxy (MCP-to-HTTP bridge with per-tool ACLs). Lunar.dev MCPX reports p99 overhead of ~4ms, proving proxy-based oversight imposes negligible latency.

For UX, Prigent’s “7 UX Patterns for Ambient AI Agent Oversight” (December 2025) provides the design framework: overview panel (inbox-zero pattern), five oversight flow types (communication, validation, simple/complex questions, error resolution), searchable audit logs, and work reports. The core principle is progressive disclosure (summary first, details on demand) with risk-colored displays.

Progressive autonomy through trust calibration

The forward-looking pattern is progressive autonomy: agents earn trust over time and operate at increasing independence. Okta recommends “progressive permission levels based on demonstrated reliability.” A manufacturing MCP deployment (MESA) follows four stages: read-only pilot → advisory agents → controlled commands → full closed-loop. HumanLayer learns from prior approval decisions to auto-approve similar future requests.

Trust calibration research formalizes this as sequential regret minimization via contextual bandits (September 2025), with LinUCB and neural variants yielding 10–38% task reward increases. A contextual bandit can learn which calls a user always approves and shift those to auto-approve while maintaining scrutiny on novel or historically-rejected patterns.

CHI 2025 (“Trusting Autonomous Teammates in Human-AI Teams”) finds agent-related factors (transparency, reliability) have the strongest trust impact, and “calibrating human trust to an appropriate level is more advantageous than fostering blind trust.” Progressive autonomy systems should not just reduce approval requests; they should communicate their track record and confidence to maintain calibrated oversight.

Conclusion

The state of the art points to a layered defense architecture. From fastest/cheapest to slowest/most expensive:

  1. Deterministic policy gates (allowlists, denylists, CEL/Polar parameter rules): zero LLM cost, sub-millisecond
  2. Tool annotation screening via MCP’s readOnlyHint/destructiveHint, supplemented by server-reputation scoring
  3. AI guardian agent evaluating uncertain cases against constitutional principles and risk heuristics
  4. Human-in-the-loop gates for irreversible, high-value, novel, or ambiguous situations, targeting 5–15% of total calls
  5. Audit trails with OpenTelemetry tracing, structured logging, post-hoc review for pattern detection and policy refinement

The critical gap is at the protocol level. Until MCP introduces standardized approval primitives (an approval/request method, trusted risk annotations, or a formal HITL extensions framework), every implementation remains bespoke middleware. The highest-impact near-term contribution would be an MCP Specification Enhancement Proposal defining a standard approval negotiation protocol between clients, proxies, and servers.

The following content is generated by LLMs and may contain inaccuracies.

Context

This sits at the intersection of HCI, AI safety governance, and distributed systems. As agents gain autonomy over consequential actions (API calls, file ops, financial transactions), per-invocation approval becomes an attack surface: confirmation fatigue makes humans unreliable gatekeepers. 2025–2026 marks the shift from academic discussion to production deployment, forcing practitioners to confront oversight at scale. MCP has become the de facto tool-calling standard, yet its spec punts on enforcement, so every client reinvents approval workflows incompatibly.

Key Insights

Confirmation fatigue is a threat vector, not UX friction. Rippling classifies “Overwhelming HITL” as threat T10, paralleling SOC teams facing 4,484 daily alerts with 67% ignored. The ironies of automation show increased automation degrades competence on critical edge cases, exactly when oversight matters. Per-action approval is not a safety mechanism; it is a liability that creates conditions for high-stakes failures.

Risk-proportional architectures converge on multi-tier filtering. Feng et al.’s autonomy levels show L4 “Approver” agents carry similar risk to L5 fully autonomous ones, undermining blanket approval. Implementations from Galileo to OpenAI adopt five-layer defense: deterministic gates → metadata screening → AI reviewer → human approval (~10–15%) → audit. COMPASS shows LLMs fail 80–83% on denied-edge queries, proving oversight cannot rely on model compliance.

Protocol-level standardization is the critical gap. Middleware like FastMCP, Preloop, and HumanLayer work, but MCP’s lack of approval/request primitives forces fragmentation. Claude Code, Cline, and every third-party proxy implement incompatible approval semantics. Without a standard negotiation protocol, interoperability is impossible.

Open Questions

How should progressive autonomy systems communicate earned trust to maintain calibrated oversight rather than blind delegation, given that trust calibration research shows transparency about confidence bounds matters more than accuracy? Can reversibility-aware gating (13.54% completion time reduction at irreversibility boundaries) be formalized into verifiable MCP metadata rather than advisory hints?

在代理式AI系统中,逐工具调用的人工审批在理论上已有方案,实践中仍未解决。 确认疲劳不是体验问题,而是安全漏洞,是大规模人类监督的首要障碍。风险分层框架、中间件架构和新设计模式已经出现,可以替代二元确认/拒绝范式。但MCP不提供任何协议级支持,各客户端只能各自重复造轮子。

确认疲劳作为威胁向量

Rippling 2025年代理式AI安全指南将"压倒性人类在环"列为威胁T10:攻击者用大量告警淹没审查者以利用认知过载。SiliconANGLE(2026年1月)指出,HITL治理是为离散、高风险决策设计的,现代代理工作流产生的操作痕迹远超人类解读能力。

网络安全领域有量化数据佐证:SOC团队日均处理4,484个告警,67%因误报疲劳被忽略(Vectra 2023),超过90%的SOC被积压压垮。ML告警排序将响应时间缩短22.9%,抑制54%误报,检测准确率95.1%。结论:风险比例过滤远优于笼统审批。

Mitchell、Birhane与Pistilli(2025年2月,“不应开发完全自主的AI代理”)将此称为"自动化的悖论”,即自动化程度越高,人在真正需要关注的关键任务上反而越不胜任。CHI 2023信任校准研究记录了"协作"互动如何在用户变得被动后退化为"委托"互动。这正是确认疲劳的机制。

MCP的监督要求与执行缺位

MCP规范(v2025-11-25)声明:“主机必须在调用任何工具之前获得明确的用户同意。" 随即自我削弱:“虽然MCP本身无法在协议层面强制执行这些安全原则,但实现者应当在应用中构建健壮的同意和授权流程。”

工具注解(readOnlyHint、destructiveHint、idempotentHint、openWorldHint)被明确定义为"不应用于安全决策"的提示,因为来自不受信任服务器的工具描述无法验证。采样功能的两个HITL检查点使用SHOULD而非MUST,允许客户端自动批准。

协议级审批机制不存在。 没有approval/request JSON-RPC方法,没有requiresApproval字段,没有工具权限范围界定。最相关的活跃提案是GitHub Issue #711(信任/敏感性注解),拟添加sensitiveHint(低/中/高)以支持策略路由,关联PR #1913。截至2026年2月无专门的HITL规范增强提案。

碎片化已然可见:Claude Code用allow/deny/ask数组,Cline提供细粒度自动批准外加"YOLO模式”,用户向Claude Desktop的Electron应用注入JavaScript绕过确认对话框。每个客户端各搞一套。

风险比例监督的共识收敛

风险分层监督是主导范式:按风险分类工具调用,安全的自动放行,危险的集中人工审查。

Feng、McDonald与Zhang(“AI代理的自主权等级”,arXiv:2506.12469,2025年6月)定义L1(完全人控)到L5(完全自主)五级,引入"自主权证书"根据能力和上下文限定代理等级。关键发现:在L4(批准者,即MCP默认级),“若用户可通过简单批准启用L4代理,则L4与L5的风险相似。“确认疲劳使逐次审批在安全性上等价于无审批。

Engin等(“代理式AI的维度治理”,arXiv:2505.11579)认为静态风险类别不适用于动态代理系统,提出追踪决策权、流程自主性和问责的动态分布。Cihon等(arXiv:2502.15212,微软/OpenAI)对编排代码按影响和监督两维度评分,无需运行代理。

行业趋同于三级模式:

  • 低风险(只读、检索):自动批准,仅记录日志
  • 中风险(可逆写入、非敏感操作):自动批准,增强日志,事后审查
  • 高风险(不可逆操作、金融交易、PII访问、生产部署):强制人工审批,有时需多人会签

Galileo的HITL框架目标升级率10–15%,85–90%的决策自主执行。TAO框架(arXiv:2506.12482)发现,人工审查请求常在代理自信但系统内部评估风险不同时触发,表明自我评估不能作为唯一门控。

分级工具调用监督的设计模式

基于可逆性的操作分类

按可逆性而非抽象风险分类,是杠杆最大的模式。决策理论模型(arXiv:2510.05307)将确认形式化为最小时间调度问题(确认→诊断→纠正→重做),发现在不可逆性边界处中间确认将完成时间缩短13.54%,81%参与者偏好此方式。欧盟AI法案要求高风险系统提供"忽视、覆盖或逆转输出"的能力;输出真正不可逆时,事前人类监督是唯一合规路径。

实用分类:只读自动放行;可逆写入(git跟踪的编辑)仅记录日志;软可逆操作(邮件、工单)可批量处理;不可逆操作(删除数据、金融转账、生产部署)强制人工门控。注意可逆性与上下文相关:git仓库中删除可逆,未启用版本控制的S3中删除不可逆。

计划级审批与操作级审批的对比

Safiron(Huang等,arXiv:2510.09781,2025年10月)在执行前分析代理计划操作,检测风险并生成解释,发现现有护栏多在执行后运行,计划级风险检测准确率低于60%。ToolSafe(arXiv:2601.10156,2026年1月)互补地在执行过程中进行动态步骤级监控,捕获计划审查遗漏的问题。

最优架构是混合方案:高层级批准计划,自动化步骤级护栏监控执行,代理偏离时暂停。OpenAI Codex"长任务模式"的实践:代理生成预期操作的动态白名单,人类审查白名单而非单个调用,代理在边界内执行,批量积累问题供综合审查。

层级式多代理监督

TAO(Kim等,2025)实施受临床审查流程启发的分层多代理监督,代理路由器评估风险并分层路由。多代理审查管线显示与单代理相比幻觉减少高达96%。

形成中的参考架构五层:(1)确定性策略门控(允许/拒绝列表),(2)代理自我评估,(3)AI监督者处理不确定案例,(4)人类在环处理不可逆或新颖情况,(5)审计跟踪与事后审查。每层过滤后向下传递。

沙箱优先执行与知情审查

沙箱优先架构在隔离环境中执行操作,呈现实际结果供审查,而非让人类在抽象层面评估工具调用。生态已生产就绪:E2B(Firecracker微虚拟机,亚秒级创建)、nono(内核级限制,代理不可绕过)、Google Agent Sandbox(GKE + gVisor)、AIO Sandbox(MCP兼容容器)。

NVIDIA AI红队强调应用级沙箱不够:控制权一旦传递给子进程,应用即失去可见性,需要内核级强制。但并非所有操作可沙箱化:第三方API、邮件、支付须与真实服务交互,此时干运行模式(代理描述意图,人工批准后再执行)仍是退路。

确定性策略执行

规则系统是最可靠的第一层:确定性、可审计、零LLM推理成本。SafeClaw实施默认拒绝模型,配备SHA-256哈希链审计账本。COMPASS(Choi等,2026)将自然语言策略映射为工具调用时的原子规则,执行通过率从0.227提升至0.500,但也暴露了LLM在被拒绝的边界查询上80–83%的失败率(开放权重模型),证明策略执行不能仅靠LLM合规。

警示案例:Cursor的拒绝列表通过四种方式被绕过(Base64编码、子shell、shell脚本、文件间接),之后被弃用。基于字符串的过滤从根本上不足以支撑安全关键门控。

各代理框架的HITL实现

LangGraph的HITL支持最完善。interrupt()在任意点暂停图执行,状态持久化到检查点器(生产用PostgreSQL)。HumanInTheLoopMiddleware支持按工具配置批准、编辑、拒绝三种决策,使不同工具可配置不同监督级别。

OpenAI Agents SDK提供输入护栏、输出护栏和工具护栏(包装函数工具做执行前后验证)。MCP集成的require_approval参数接受"always”、“never"或自定义回调,支持编程式风险审批。

Anthropic走模型中心路线,通过负责任扩展政策和AI安全等级(ASL-1至ASL-3+)。Claude计算机使用遵循"行动前询问"模式并限定访问范围。2026年2月Claude Opus 4.6破坏风险报告:破坏风险"极低但不可忽略”,计算机使用场景风险升高,复杂代理环境中有"局部欺骗行为”。

Google DeepMind SAIF 2.0(2025年10月)三原则:代理须有明确的人类控制者,权力须被限制,操作须可观察。“放大监督"技术(两个模型副本互相指出缺陷供人类裁判)仍在研究阶段。

MCP监督的中间件与代理架构

实际路径是代理/中间件架构拦截JSON-RPC tools/call请求。主要方案:Preloop(CEL策略、会签审批、多渠道通知)、HumanLayer(YC F24;框架无关的异步审批API,Slack/邮件路由,自动审批学习)、gotoHuman(MCP服务器形式的托管审批UI)。代码优先方面,FastMCP v2.9+在on_call_tool、on_list_tools等层级提供钩子,支持可组合的HITL管线。

企业网关:Traefik Hub(基于任务的访问控制,JWT策略)、微软MCP Gateway(Kubernetes原生,Entra ID认证)、Kong AI MCP Proxy(MCP到HTTP桥接,按工具ACL)。Lunar.dev MCPX报告p99延迟开销约4ms,代理式监督对性能影响可忽略。

UX方面,Prigent"环境AI代理监督的7种UX模式”(2025年12月)提供设计框架:概览面板(收件箱清零模式)、五种监督流程(沟通、验证、简单/复杂问题、错误解决)、可搜索审计日志、工作报告。核心原则:渐进式披露(先摘要后详情),配合风险颜色标记。

基于信任校准的渐进式自主权

前瞻性模式是渐进式自主权:代理随时间赢得信任,独立程度递增。Okta推荐"基于已证明可靠性的渐进权限等级"。制造业MCP部署(MESA)四阶段:只读试点→咨询代理→受控命令执行→全闭环自动化。HumanLayer从历史审批决策中学习,自动批准类似请求。

信任校准研究将此形式化为基于上下文赌博机的序列遗憾最小化(2025年9月),LinUCB和神经网络变体带来10–38%的任务收益增长。上下文赌博机可以学习用户总是批准的调用并逐步自动放行,同时对新颖或历史被拒模式保持审查。

CHI 2025(“信任人类-AI团队中的自主队友”)发现,代理因素(透明度、可靠性)对信任影响最强,“将信任校准到适当水平比培养盲目信任更有利”。渐进式自主系统不应仅减少审批请求,还应主动传达过往记录和当前置信度,维持校准的人类监督。

结论

现有研究指向分层防御架构,从快/廉到慢/贵依次:

  1. 确定性策略门控(允许/拒绝列表,CEL/Polar参数规则):零LLM成本,亚毫秒级
  2. 工具注解筛选,用MCP的readOnlyHint/destructiveHint,辅以服务器声誉评分
  3. AI守护代理,根据宪法式原则和风险启发式评估不确定案例
  4. 人类在环门控,用于不可逆、高价值、新颖或模糊情况——目标占总调用5–15%
  5. 审计跟踪,OpenTelemetry追踪、结构化日志、事后审查,用于模式检测和策略迭代

关键缺口在协议层。在MCP引入标准化审批原语——approval/request方法、可信风险注解或正式HITL扩展框架——之前,每个实现都是定制中间件。最有价值的近期贡献是一个MCP规范增强提案,定义客户端、代理和服务器之间的标准审批协商协议。

以下内容由LLM生成,可能包含不准确之处。

背景

此课题处于人机交互、AI安全治理和分布式系统设计的交叉点。随着代理获得执行关键操作(API调用、文件操作、金融交易)的自主权,逐次审批成为攻击面:确认疲劳使人类成为不可靠的把门人。2025–2026年标志着从学术讨论到生产部署的转变,迫使从业者直面大规模监督问题。MCP已成为工具调用的事实标准,但其规范回避执行机制,各客户端不兼容地重造审批流程。

关键见解

确认疲劳是威胁向量,非体验问题。 Rippling将"压倒性HITL"列为威胁T10,类比日均4,484告警、67%被忽略的SOC团队。自动化悖论文献表明自动化程度提高反而削弱人在关键边界情况的能力——恰是监督最重要之时。逐项审批不是安全机制,而是为高风险失败创造条件的负担。

风险比例架构趋同于多层过滤。 Feng等人的自主权等级表明L4"批准者"代理与L5完全自主代理风险相似,笼统审批价值存疑。从Galileo到OpenAI的行业实现采用五层防御:确定性门控→元数据筛选→AI审查→人工审批(约10–15%)→审计。COMPASS表明LLM在策略拒绝查询上失败率80–83%——监督不能仅靠模型合规。

协议级标准化是关键缺口。 FastMCP、Preloop、HumanLayer等中间件可用,但MCP缺乏approval/request原语导致生态碎片化。Claude Code、Cline及各第三方代理实现不兼容的审批语义。无标准协商协议则无互操作性。

未解决问题

渐进式自主权系统应如何传达其赢得的信任以维持校准的监督而非盲目委托——尤其是信任校准研究表明置信界的透明度比准确度本身更重要?可逆性感知门控(不可逆性边界处确认使完成时间缩短13.54%)能否形式化为MCP中可验证而非建议性的元数据?

2026-02-17 21:16:52 -/-

Opinion Formation Through the Voter Model in Network Dynamics通过网络动力学中的投票者模型进行观点形成

The Voter Model

The voter model is a simple mathematical model of opinion formation in which voters are located at the nodes of a network. Each voter holds an opinion (in the simplest case, 0 or 1, but more generally, any of n options), and a randomly chosen voter adopts the opinion of one of its neighbors.

This model can be used to describe phase transition behavior in idealized physical systems and can produce a remarkable amount of structure emerging from seemingly “random” initial conditions. It can be modeled very easily using cellular automata.

In finite networks (as in any real-world model), fluctuations inevitably cause the system to reach an “absorbing” state—one in which all opinions become constant and remain unchanged.

The following content is generated by LLMs and may contain inaccuracies.

Context

The voter model sits at the intersection of statistical physics, social dynamics, and network science. Originally developed to study magnetization in Ising-type systems, it has become a canonical example of how local interactions produce emergent collective behavior. Its relevance today stems from modeling social influence on platforms, consensus formation in distributed systems, and understanding polarization dynamics. The core tension: simple microscopic rules generate complex macroscopic outcomes, yet finite systems inevitably collapse to consensus—seemingly at odds with persistent real-world disagreement.

Key Insights

Dimension-dependent consensus times: On infinite lattices, the voter model exhibits a phase transition based on spatial dimension—1D and 2D systems reach consensus, while dimensions ≥3 allow coexistence. For finite networks, consensus is guaranteed, but time-to-absorption scales dramatically with topology: mean-field networks reach consensus in O(N) steps, while low-dimensional lattices require O(N²). This explains why tightly clustered communities resist opinion shifts longer than well-mixed populations.

Breaking absorbing states: Real opinion systems rarely reach homogeneity because the basic voter model omits crucial mechanisms. Extensions incorporating zealots (inflexible agents) or spontaneous opinion changes (modeling external media) prevent absorption and sustain fragmentation. The Deffuant-Weisbuch bounded confidence model adds realism by limiting influence to similar opinions, producing stable pluralism rather than consensus.

Network topology as leverage: The voter model’s behavior is highly sensitive to degree heterogeneity—hubs disproportionately drive consensus direction in scale-free networks. This suggests network structure, not just initial opinion distribution, determines outcomes, with implications for strategic influence campaigns.

Open Questions

How do temporally varying networks (e.g., evolving social ties) alter absorption dynamics—can consensus time become indefinite when topology co-evolves with opinions? What minimal heterogeneity in update rules (e.g., mixing voter and majority dynamics) is sufficient to transition from guaranteed consensus to sustained coexistence?

投票者模型

投票者模型是一个描述观点形成的简单数学模型,其中投票者位于网络的节点上。每个投票者持有一种观点(最简单的情况是0或1,但更一般地可以是n种选项中的任何一种),而被随机选中的投票者会采纳其邻居之一的观点。

该模型可用于描述理想化物理系统的相变行为,并能从看似"随机"的初始条件中产生大量结构。它可以使用元胞自动机非常容易地建模。

在有限网络中(如同任何真实世界的模型一样),波动总是不可避免地导致系统达到"吸收"态——在这种状态下,所有观点都变得恒定且保持不变。

以下内容由 LLM 生成,可能包含不准确之处。

背景

投票者模型处于统计物理学、社会动力学和网络科学的交叉点。最初为研究Ising型系统中的磁化而开发,它已成为展示局部相互作用如何产生涌现集体行为的典范例子。它今天的相关性源于对平台上社会影响的建模、分布式系统中共识形成的研究,以及对极化动力学的理解。其核心张力在于:简单的微观规则产生复杂的宏观结果,然而有限系统必然坍缩到共识——这似乎与持久的现实世界分歧相悖。

关键见解

维度相关的共识时间:在无限格点上,投票者模型表现出基于空间维度的相变——1维和2维系统达到共识,而维度≥3允许共存。对于有限网络,共识是保证的,但时间吸收尺度随拓扑结构急剧变化:平均场网络在O(N)步内达到共识,而低维格点需要O(N²)。这解释了为什么紧密聚集的社区比良好混合的种群更能抵抗意见转变。

破坏吸收态:真实意见系统很少达到同质性,因为基本投票者模型省略了关键机制。纳入狂热者(不灵活的代理人)或自发意见变化(模拟外部媒体)的扩展可防止吸收并维持分裂。Deffuant-Weisbuch有界信心模型通过将影响限制在相似意见范围内来增加现实性,产生稳定的多元主义而非共识。

网络拓扑作为杠杆:投票者模型的行为对度异质性高度敏感——在无标度网络中,枢纽节点不成比例地驱动共识方向。这表明网络结构而非仅仅初始意见分布决定了结果,对战略性影响活动有启示。

开放问题

时间变化的网络(例如,演化的社会纽带)如何改变吸收动力学——当拓扑与意见共同演化时,共识时间能否变得无限?什么最小异质性的更新规则(例如,混合投票者和多数动力学)足以从保证共识转变为持续共存?

2026-02-17 21:15:28 -/-

Large Language Models as Optimization Tools大型语言模型作为优化工具

LLM as Optimizer:

  • Large Language Models as Optimizers https://arxiv.org/abs/2309.03409
  • When Large Language Models Meet Optimization https://www.sciencedirect.com/science/article/abs/pii/S2210650224002013?via%3Dihub
  • Large Language Models to Enhance Bayesian Optimization https://arxiv.org/abs/2402.03921
  • Cooperative Design Optimization through Natural Language Interaction https://arxiv.org/abs/2508.16077
  • Language-Based Bayesian Optimization Research Assistant (BORA) https://arxiv.org/abs/2501.16224
  • LILO: Bayesian Optimization with Interactive Natural Language Feedback https://arxiv.org/abs/2510.17671
  • Bayesian Optimization of High-dimensional Outputs with Human Feedback https://openreview.net/pdf?id=2fHwkHskpo

The following content is generated by LLMs and may contain inaccuracies.

Context

This collection addresses a paradigm shift in optimization: using Large Language Models not as tools to be optimized, but as optimizers themselves. Traditional derivative-based methods fail when gradients are unavailable or expensive to compute—a common constraint in hyperparameter tuning, experimental design, and real-world engineering. By framing optimization as a natural language reasoning task, researchers are exploring whether LLMs' pattern recognition and contextual understanding can rival or augment classical methods like Bayesian optimization. This matters now because LLMs have demonstrated surprising competence in mathematical reasoning, and their ability to incorporate domain knowledge through prompting offers a potential escape from local optima traps that plague blind search algorithms.

Key Insights

LLMs as meta-optimizers outperform hand-crafted heuristics in prompt engineering. Yang et al.’s OPRO framework demonstrates that LLMs can iteratively refine solutions by conditioning on historical performance—achieving up to 50% improvement over human-designed prompts on reasoning benchmarks. This suggests LLMs excel when the optimization landscape can be encoded linguistically, exploiting their pre-trained semantic knowledge rather than relying solely on numerical gradients.

Hybrid systems combining LLMs with Bayesian optimization show complementary strengths. LLAMBO integrates LLMs for zero-shot warm-starting and surrogate modeling in early search stages, while BORA uses LLMs to inject domain knowledge from literature into experimental design. These approaches address Bayesian optimization’s sample inefficiency in high dimensions by leveraging LLMs' ability to reason about plausible regions—though they inherit LLMs' hallucination risks when proposing scientifically implausible candidates.

Natural language interfaces democratize expert-level optimization but introduce cognitive tradeoffs. Niwa et al.’s cooperative framework enables designers to steer optimization mid-flight through conversational input, matching performance of automated methods with lower cognitive load. However, the explainability gains (LLMs narrating their reasoning) compete with potential over-reliance on plausible-sounding but suboptimal suggestions—a tension between human agency and algorithmic efficiency.

Open Questions

  • When do LLMs' semantic biases help versus harm search? If pre-training data over-represents certain solution types, could LLM-guided optimization systematically miss unconventional optima in scientific discovery tasks?

  • Can we quantify the sample efficiency frontier between pure BO and LLM-augmented methods? Under what dimensionality, evaluation cost, and prior knowledge regimes does linguistic contextualization outweigh the risk of premature convergence to plausible-but-local solutions?

大型语言模型作为优化器:

  • 大型语言模型作为优化器 https://arxiv.org/abs/2309.03409
  • 当大型语言模型遇见优化 https://www.sciencedirect.com/science/article/abs/pii/S2210650224002013?via%3Dihub
  • 增强贝叶斯优化的大型语言模型 https://arxiv.org/abs/2402.03921
  • 通过自然语言交互的协作设计优化 https://arxiv.org/abs/2508.16077
  • 基于语言的贝叶斯优化研究助手 (BORA) https://arxiv.org/abs/2501.16224
  • LILO: 具有交互式自然语言反馈的贝叶斯优化 https://arxiv.org/abs/2510.17671
  • 具有人类反馈的高维输出贝叶斯优化 https://openreview.net/pdf?id=2fHwkHskpo

以下内容由 LLM 生成,可能包含不准确之处。

背景

这个研究集合涉及优化范式的转变:不再将大型语言模型(LLM)作为被优化的工具,而是作为优化工具本身。当梯度不可用或计算成本高昂时,传统基于导数的方法会失效——这在超参数调优、实验设计和现实工程中是常见的约束。通过将优化问题框架化为自然语言推理任务,研究人员正在探索LLM的模式识别和语境理解能力是否能与贝叶斯优化等经典方法相匹敌或增强这些方法。这现在之所以重要,是因为LLM已经展现出令人惊讶的数学推理能力,而它们通过提示词融入领域知识的能力,提供了一条逃离困扰盲搜索算法的局部最优陷阱的潜在途径。

关键见解

作为元优化器的LLM在提示词工程中表现优于手工设计的启发式方法。 Yang等人的OPRO框架证明LLM可以通过以历史表现为条件来迭代精化解决方案——在推理基准上相比人工设计的提示词实现了高达50%的性能改进。这表明当优化景观能够用语言编码时,LLM表现出色,利用其预训练的语义知识,而不是仅依赖数值梯度。

结合LLM与贝叶斯优化的混合系统展现出互补的优势。 LLAMBO在搜索早期阶段利用LLM进行零样本预热启动和代理建模,而BORA使用LLM将文献中的领域知识注入实验设计。这些方法通过利用LLM推理合理区域的能力来解决贝叶斯优化在高维中的样本效率不足问题——尽管当提出科学上不合理的候选方案时,它们会继承LLM的幻觉风险。

自然语言界面使专家级优化民主化,但带来认知权衡。 Niwa等人的协作框架使设计人员能够通过对话输入在优化过程中实时调整方向,在认知负荷较低的情况下达到自动化方法的性能。然而,可解释性收益(LLM叙述其推理过程)与潜在的过度依赖看似合理但次优建议之间存在冲突——这是人类代理权和算法效率之间的张力。

开放问题

  • LLM的语义偏差何时有助于搜索,何时有害? 如果预训练数据过度代表某些解决方案类型,LLM引导的优化是否会在科学发现任务中系统性地错过非常规最优解?

  • 我们能否量化纯贝叶斯优化与LLM增强方法之间的样本效率边界? 在什么维度、评估成本和先验知识范围内,语言语境化的优势会超过过早收敛到看似合理但局部最优解的风险?

2026-02-17 20:35:49 -/-

Comprehensive LLM Courses and Lecture Collections综合LLM课程和讲座合集

LLM Lectures

  • https://cmu-llms.org/schedule/
  • https://www.phontron.com/class/lminference-fall2025/schedule
  • https://llmsystem.github.io/llmsystem2026spring/docs/Syllabus
  • https://llmsystem.github.io/llmsystem2025spring/docs/Syllabus/

The following content is generated by LLMs and may contain inaccuracies.

Context

This collection addresses the growing need for structured educational resources in large language model (LLM) development and deployment. As LLMs transition from research artifacts to production systems, practitioners require deep understanding across the full stack—from GPU programming and transformer architecture to distributed training and inference optimization. These courses from CMU and related institutions represent the maturation of LLM education, bridging theoretical foundations with systems engineering concerns that arise at scale.

Key Insights

  • Curriculum divergence reflects specialization paths: The CMU LLM Applications course emphasizes prompt engineering, RAG systems, and domain-specific applications (healthcare, code generation), while the LLM Systems courses dive into GPU kernel optimization, distributed training strategies (Megatron-LM, ZeRO), and serving infrastructure (vLLM, FlashAttention). This split mirrors industry roles—application engineers who orchestrate LLMs versus systems engineers who make them computationally feasible.

  • Hardware-algorithm co-design emerges as core competency: Multiple syllabi feature guest lectures from creators of foundational systems: Tri Dao on FlashAttention, Woosuk Kwon on vLLM’s PagedAttention, Hao Zhang on DistServe. This signals that modern LLM work requires understanding memory hierarchies and attention mechanisms simultaneously—algorithmic improvements are inseparable from hardware constraints.

  • From monolithic models to modular architectures: The progression from basic transformers to mixture-of-experts (DeepSeek-MoE), disaggregated serving (DistServe), and retrieval augmentation reflects the field’s shift toward composable systems. The LLM Inference course likely extends this toward inference-specific optimizations like speculative decoding and KV cache management.

Open Questions

  • How should curricula balance depth in classical ML theory versus hands-on systems optimization as LLM architectures continue evolving? Will today’s FlashAttention become tomorrow’s deprecated technique?
  • What pedagogical approaches best prepare students for the lag between academic research and production deployment, especially when industry systems (SGLang, vLLM) advance faster than publication cycles?

LLM讲座

  • https://cmu-llms.org/schedule/
  • https://www.phontron.com/class/lminference-fall2025/schedule
  • https://llmsystem.github.io/llmsystem2026spring/docs/Syllabus
  • https://llmsystem.github.io/llmsystem2025spring/docs/Syllabus/

以下内容由 LLM 生成,可能包含不准确之处。

背景

这个资源集合应对了大语言模型(LLM)开发和部署中日益增长的结构化教育资源需求。随着LLM从研究工件过渡到生产系统,从业者需要掌握整个技术栈的深入知识——从GPU编程和变换器架构到分布式训练和推理优化。来自CMU及相关机构的这些课程代表了LLM教育的成熟发展,在理论基础与大规模系统工程问题之间架起了桥梁。

关键洞察

  • 课程分化反映了专业化路径:CMU LLM应用课程强调提示工程、RAG系统和特定领域应用(医疗保健、代码生成),而LLM系统课程深入探讨GPU内核优化、分布式训练策略(Megatron-LM、ZeRO)和服务基础设施(vLLM、FlashAttention)。这种分化反映了行业角色差异——应用工程师编排LLM,而系统工程师使其在计算上可行。

  • 硬件-算法协同设计成为核心能力:多个课程大纲特别邀请了基础系统创始人进行讲座:Tri Dao讲FlashAttention、Woosuk Kwon讲vLLM的PagedAttention、Hao Zhang讲DistServe。这表明现代LLM工作需要同时理解内存层次结构和注意力机制——算法改进与硬件约束密不可分。

  • 从单体模型到模块化架构:从基础变换器到专家混合模型(DeepSeek-MoE)、分解服务(DistServe)和检索增强的进展,反映了该领域向可组合系统的转变。LLM推理课程可能会进一步扩展到推理特定的优化,如推测解码和KV缓存管理。

待解问题

  • 随着LLM架构不断演进,课程应如何平衡经典ML理论的深度与实践系统优化?今天的FlashAttention会成为明天的过时技术吗?
  • 什么样的教学方法能最好地为学生准备应对学术研究与生产部署之间的滞后,特别是当行业系统(SGLang、vLLM)的进度快于发表周期时?
2026-02-17 19:57:20 -/-

The Cost of Staying: Tech Career Timing留任的代价:科技职业时机选择

The Cost of Staying

by Amy Tam https://x.com/amytam01/status/2023593365401636896

Every technical person I know is doing the same math right now. They won’t call it that. They’ll say they’re “exploring options” or “thinking about what’s next.” But underneath, it’s the same calculation: how much is it costing me to stay where I am?

Not in dollars. In time. There’s a feeling in the air that the window for making the right move is shrinking—that every quarter you spend in the wrong seat, the gap between you and the people who moved earlier gets harder to close. A year ago, career decisions in tech felt reversible. Take the wrong job, course correct in eighteen months. That assumption is breaking down. The divergence between people who repositioned early and those still weighing their options is becoming visible, and it’s accelerating.

I see this up close. I’m an investor at Bloomberg Beta, and I spend most of my time with people in transition: leaving roles, finishing programs, deciding what’s next. I’m not a career advisor, but I sit at the intersection of “what are you leaving” and “what are you chasing.”

The valuable skill in tech shifted from “can you solve this problem” to “can you tell which problems are worth solving and which solutions are actually good.” The scarce thing flipped from execution to judgment: can you orchestrate systems, run parallel bets, and have the taste to know which results matter? The people who figured this out early are on one arm of a widening K-curve. Everyone else is getting faster at things that are about to be done for them.

The shift from execution to judgment is happening everywhere, but the cost of staying and the upside of moving look completely different depending on where you’re sitting.

FAANG

Here’s the tradeoff people at big tech companies are running right now: the systems are built, the comp is great, and the work is… fine. You’re increasingly reviewing AI-generated outputs rather than building from scratch. For some people, that’s a gift—it’s leverage, it’s sustainable, it’s a good life. The tradeoff is that “fine” has a cost that doesn’t show up in your paycheck.

The people leaving aren’t unhappy. They’re restless. They describe this specific feeling: the hardest problems aren’t here anymore, and the organization hasn’t caught up to that fact. The ones staying are making a bet that stability and comp are worth more than being close to the frontier. The ones leaving are making a bet that the frontier is where the next decade of career value gets built, and every quarter they wait is a quarter of compounding they miss.

Both bets are rational. But only one of them is time-sensitive.

Quant

Quant still works. Absurd pay, hard problems, immediate feedback. If you’re good, you know you’re good, because the P&L doesn’t lie.

The tradeoff that’s emerging: the entire quant toolkit (ML infrastructure, data obsession, statistical intuition) turns out to be exactly what AI labs and research startups need—same muscle, different problem. The difference is surface area. In quant, you’re optimizing a strategy. In AI, you’re building systems that reason. Even the quant-adjacent world is feeling it: the most interesting work in prediction markets and stablecoins is increasingly an AI infrastructure problem. One has a ceiling. The other doesn’t, or at least nobody’s found it yet.

Most quant people are staying, and they’re not wrong to. But the ones leaving describe something specific: they hit a point where the intellectual challenge of finance felt bounded in a way it didn’t before. They’re not chasing money. They’re chasing the feeling of working on something where the upper bound isn’t visible.

Academia

This is where the tradeoff is most painful, because it shouldn’t be a tradeoff at all.

Publishing novel results used to be the purest form of intellectual prestige. You did the work because the work was beautiful. That hasn’t changed. What changed is that the line between what you can do at a funded startup and what you can do in a university lab is blurring, and not in academia’s favor. A 20-person research startup can now do in a weekend what takes an academic lab a semester, because compute costs money that universities don’t have.

The most ambitious PhD students I talk to aren’t choosing between academia and industry. They’re choosing between theorizing about experiments and actually running them. The pull toward funded startups and labs isn’t about selling out. It’s about wanting to do the science, and the science requires resources that academia can’t provide.

The people staying in academia for the right reasons (open science, long time horizons, genuine intellectual freedom) are admirable. But they should know that the clock is ticking differently for them too: the longer the compute gap widens, the harder it becomes to do competitive work from inside a university.

AI Startups (Application Layer)

If you’re building products on top of models, you already know the feeling: the clever feature you shipped in March gets commoditized by a model update in June. The ground moves every quarter, and your moat evaporates.

The tradeoff here is between chasing what’s exciting and building what’s durable. The founders who are thriving right now stopped caring about model capabilities and started caring about the things models can’t take away: data moats, workflow capture, integration depth. It’s less fun to talk about at a dinner party. It’s where the actual companies get built.

The people making the sharpest moves in this world are the ones who got excited about plumbing—not the demo, not the pitch, not the capability. The ugly, boring infrastructure that makes a product sticky independent of which model sits underneath it.

Research Startups: The New Center of Gravity

This is where the K-curve is most visible.

Prime Intellect, SSI, Humans&—10-30 people doing genuine frontier research that competes with organizations fifty times their size. This would have been impossible three years ago. It’s happening now because the tools got good enough that a small number of people with great judgment can outrun a bureaucracy with more resources.

The daily workflow here is the clearest picture of what the upper arm looks like in practice. You’re kicking off training runs, spinning up experiments, letting things cook overnight. You come back in the morning, and your job isn’t to write code. It’s to know what to do with what came back—to have the taste to distinguish signal from noise when the system hands you a wall of results. It’s passive leverage. You set the experiments in motion, and the compounding happens whether or not you’re at your desk.

The tradeoff people are weighing: these companies are small, unproven, and many will fail. The bet is that being at the center of the frontier, with your judgment directly touching the work, compounds faster than the safety of a bigger organization, even if the specific company doesn’t make it. The skills transfer. The network transfers. The three years you spend reviewing someone else’s outputs at a big company don’t transfer the same way.

Big Model Labs: The Narrowing Frontier

The pitch “we’re building AGI” still works. It might always work on a certain type of person.

But the experience inside has shifted. The most interesting research is concentrated among a small number of senior people. Everyone else is doing important supporting work (evals, infra, product) that doesn’t feel like the frontier they signed up for. You joined to touch the thing, and you’re three layers removed from it.

The tradeoff is prestige versus proximity. A big lab on your resume still opens every door. But the people leaving are making a specific calculation: the resume value of “I was at [top lab]” is depreciating as the labs get bigger and more corporate, while the value of “I did frontier research at a place where my judgment shaped the direction” is appreciating. The window where big-lab pedigree is the best credential is closing, and the people who see it are moving.

The Clock

Every one of these tradeoffs has the same variable hiding inside it: time.

A year ago, you could sit in a comfortable seat and deliberate. The cost of waiting was low because the divergence was slow. That’s no longer true. The tools are compounding. The people who moved early are building on top of what they learned last quarter. The difference between someone who moved six months ago and someone still weighing their options is already compounding.

The upper arm isn’t closed. People are making the jump every week, and the people who are hiring them don’t care where you’ve been. They care whether you can do the work. But the math is directional: the longer you optimize for comfort, the more expensive the switch becomes—not because the opportunities disappear, but because the people who are already there are compounding, and you’re not.

The companies winning the talent war right now aren’t the ones with the best brand or the highest comp. They’re the ones where your judgment has the most surface area, where the distance between your taste and what actually gets built is zero, and where you’re surrounded by people who know things you don’t yet. The best people want to be close to others who have tricks they haven’t learned yet, at places with enough compute to actually run the experiments.

The question isn’t whether you’re smart enough. It’s that you’ve already done the math. You just haven’t acted on it.

The following content is generated by LLMs and may contain inaccuracies.

Context

This piece captures a structural shift in tech labor markets circa 2024–2025, where career optionality is compressing amid accelerating AI capabilities. It sits at the intersection of career dynamics, talent allocation theory, and the sociology of “frontier work.” The tension: traditional signals of career safety (FAANG comp, academic tenure, big lab prestige) are decoupling from proximity to where judgment-building happens. This matters because the shift from execution to orchestration—documented by economist David Autor as “task complementarity”—is happening faster than institutions can adapt, creating winner-take-most dynamics in skill accumulation.

Key Insights

The K-curve is a compounding divergence problem. Unlike previous tech cycles where skills depreciated gradually, generative AI tools create exponential productivity gaps between early adopters and laggards. Research from MIT and Stanford shows consultants using GPT-4 completed tasks 25% faster with 40% higher quality—but the variance between users widened over time. Those developing “judgment about AI outputs” compound that advantage quarterly; those executing manually fall behind non-linearly. The piece’s insight about research startups outrunning labs 50× their size reflects Coase’s theory of firm boundaries inverting: coordination costs collapsed faster than resource advantages matter.

Academia’s compute gap is a resource curse in reverse. The observation about weekend experiments versus semester timelines maps onto Brown et al.’s analysis of compute inequality in AI research. Universities can’t compete on infrastructure, but the piece misses that top labs are increasingly restricting publication to protect competitive moats—academic freedom still trades at a premium for reproducible, open work. The real cost: PhD students now optimize for “access to compute” over “intellectual community,” potentially sacrificing the collaborative serendipity that historically generated breakthrough ideas.

Open Questions

Could the K-curve collapse if AI tool improvements plateau, returning advantage to institutional stability? Or are we seeing a permanent regime change where “taste for orchestrating AI systems” becomes the dominant filter for knowledge work?

If judgment compounds faster than execution devalues, what happens to the bottom 50% of current tech workers—and does this finally force a reckoning with tech’s meritocracy mythology?

留任的代价

作者:Amy Tam https://x.com/amytam01/status/2023593365401636896

我认识的每一位技术人士现在都在做同样的数学计算。他们不会这样说。他们会说自己在"探索选择"或"思考下一步"。但本质上,这是同一个计算:留在原地要花费我多少?

不是金钱。而是时间。有一种感觉在空中弥漫:做出正确选择的窗口在缩小——你在错误岗位上待的每个季度,你和那些早期转身的人之间的差距就变得更难以弥补。一年前,科技行业的职业决策似乎是可逆的。接了个错误的工作,十八个月内调整方向就行。这个假设正在瓦解。早期重新定位的人和仍在权衡选择的人之间的分化变得可见,而且在加速。

我近距离看到这一点。我是Bloomberg Beta的投资者,大部分时间都与处于过渡期的人接触:离职、完成计划、决定下一步。我不是职业顾问,但我坐在"你要离开什么"和"你在追逐什么"的交叉口。

科技行业的宝贵技能从"你能解决这个问题吗"转变为"你能判断哪些问题值得解决,哪些解决方案真正有效吗"。稀缺的东西从执行力翻转到判断力:你能编排系统、并行下注,并具有品味来判断哪些结果重要吗?那些早期弄清楚这一点的人站在不断扩大的K曲线的一臂上。其他所有人都在快速提升那些即将被自动完成的东西的能力。

从执行到判断的转变无处不在,但留任的代价和转身的上升空间看起来完全取决于你所处的位置。

FAANG

这是大科技公司人员现在的权衡:系统已构建,薪酬很好,工作是……还可以。你越来越多地审查AI生成的输出,而不是从零开始构建。对某些人来说,这是礼物——这是杠杆、可持续性、美好生活。权衡是"还可以"有一个不会出现在你薪资单上的代价。

离职的人并不是不开心。他们坐立不安。他们描述这种特定的感觉:最难的问题已经不在这里了,而组织还没有认识到这一点。留下来的人是在打赌稳定性和薪酬比接近前沿更有价值。离开的人是在打赌前沿是下一个十年职业价值的构建之地,他们等待的每个季度都是他们错失的复合增长季度。

两个赌注都是理性的。但只有其中一个具有时间敏感性。

量化投资

量化投资仍然有效。荒谬的薪酬、困难的问题、即时反馈。如果你很优秀,你就知道自己很优秀,因为损益表不会说谎。

正在出现的权衡:整个量化工具包(ML基础设施、数据迷恋、统计直觉)正好是AI实验室和研究初创公司所需的——相同的肌肉、不同的问题。区别在于表面积。在量化投资中,你优化一个策略。在AI中,你构建能够推理的系统。即使是与量化相关的世界也在感受这一点:预测市场和稳定币中最有趣的工作越来越多地是AI基础设施问题。一个有上限。另一个没有,或者至少还没有人找到。

大多数量化人才留了下来,他们没有错。但离开的人描述了一些具体的东西:他们到达了一个点,金融的智力挑战感觉到了界限,这在以前没有。他们不是在追逐金钱。他们在追逐在做某件事的感觉,其中上界是不可见的。

学术界

这是权衡最痛苦的地方,因为根本不应该有权衡。

发表新颖结果曾经是最纯粹的智力声望形式。你做工作是因为工作很美妙。这没有改变。改变的是,你在资金充足的初创公司和大学实验室中能做什么之间的界线变得模糊,而且对学术界不利。一个20人的研究初创公司现在可以在一个周末做的工作,需要一个学术实验室一个学期,因为计算成本高昂,而大学没有这样的资金。

我交谈过的最雄心勃勃的博士生不是在学术界和产业之间选择。他们在理论化实验和实际运行实验之间选择。对资金充足的初创公司和实验室的吸引力不是关于妥协。这是关于想做科学,而科学需要学术界无法提供的资源。

因为正确的原因留在学术界的人(开放科学、长期视野、真正的学术自由)是令人敬佩的。但他们应该知道,时钟对他们的嘀嗒也不同:计算差距越长,从大学内部做有竞争力的工作就越难。

AI初创公司(应用层)

如果你在模型之上构建产品,你已经知道那种感觉:你在三月份推出的聪明功能在六月份被模型更新商品化了。地形每个季度都在移动,你的护城河蒸发了。

这里的权衡是追逐令人兴奋的东西和构建持久的东西之间的权衡。现在蓬勃发展的创始人停止关心模型能力,开始关心模型无法夺走的东西:数据护城河、工作流捕获、集成深度。在宴会上谈论这些就没那么有趣了。这是真正的公司被构建的地方。

在这个世界里做出最尖锐举动的人是那些对管道感到兴奋的人——不是演示、不是宣传、不是能力。丑陋、无聊的基础设施使产品粘性独立于坐在下面的模型。

研究初创公司:重力的新中心

这是K曲线最可见的地方。

Prime Intellect、SSI、Humans&——10-30人进行真正的前沿研究,与规模大五十倍的组织竞争。三年前这是不可能的。现在发生是因为工具足够好,少数具有高明判断力的人可以跑赢拥有更多资源的官僚机构。

这里的日常工作流程是上臂在实践中看起来最清晰的画面。你在启动训练运行、旋转实验、让事情一夜间进行。你早上回来,你的工作不是编写代码。这是知道如何处理返回的东西——当系统给你一堵结果时,具有品味来区分信号和噪音。这是被动杠杆。你设置实验运行,复合增长是否发生,不管你是否在办公桌前。

人们在权衡:这些公司很小、未经证实,许多会失败。打赌是在前沿中心,你的判断直接接触工作,复合速度比大型组织的安全更快,即使特定公司没有成功。技能转移。网络转移。你在大公司审查他人输出花费的三年不会以相同的方式转移。

大模型实验室:前沿变窄

“我们在构建AGI"的宣传仍然有效。它可能对某种类型的人总是有效。

但内部的体验已经转变。最有趣的研究集中在少数高级人员中。其他人都在做重要的支持工作(评估、基础设施、产品),感觉不像他们注册的前沿。你加入是为了接触这件事,你距离它有三层。

权衡是声望对邻近。大实验室在你的简历上仍然可以打开所有大门。但离开的人在做一个具体的计算:“我在[顶级实验室]“的简历价值随着实验室变得更大和更公司化而贬值,而"我在一个我的判断塑造方向的地方进行前沿研究"的价值在升值。大实验室血统是最佳证书的窗口正在关闭,看到它的人在转身。

时钟

这些权衡中的每一个都在其中隐藏着相同的变量:时间。

一年前,你可以坐在舒适的座位上深思熟虑。等待的代价很低,因为分化很慢。那不再是真的了。工具在复合。早期转身的人正在建立他们上个季度学到的东西。有人六个月前转身和有人仍在权衡选择之间的差异已经在复合。

上臂没有关闭。人们每周都在跳跃,雇用他们的人不关心你去过哪里。他们关心你是否能完成工作。但数学是方向性的:你优化舒适的时间越长,转换变得越昂贵——不是因为机会消失,而是因为已经到达那里的人在复合,而你没有。

现在赢得人才战争的公司不是那些品牌最好或薪酬最高的公司。他们是那些你的判断有最大表面积的地方,你的品味和实际构建的距离为零,你被你还没学过技巧的人包围的地方。最优秀的人想靠近其他拥有他们还没学过技巧的人,在有足够计算实际运行实验的地方。

问题不是你是否足够聪明。这是你已经做了数学。你只是还没有采取行动。

以下内容由 LLM 生成,可能包含不准确之处。

背景

这篇文章捕捉了科技劳动力市场在2024-2025年左右的结构性转变,在加速的人工智能能力中职业选择空间在压缩。它位于职业动态、人才配置理论和"前沿工作"社会学的交汇点。核心矛盾在于:传统的职业安全信号(FAANG薪酬、学术终身教职、大型实验室声誉)正在与判断力养成发生的地方脱钩。这很重要,因为从执行到协调的转变——由经济学家大卫·奥特记录为"任务互补性"——正在以制度适应的速度更快地发生,在技能积累中创造赢家通吃的动态。

关键洞见

K形曲线是一个复合性分化问题。 与以往科技周期中技能逐步贬值不同,生成式人工智能工具在早期采用者和落后者之间创造了指数级的生产力差距。麻省理工学院和斯坦福大学的研究表明,使用GPT-4的顾问完成任务的速度快25%,质量高40%——但用户之间的差异随时间扩大。那些开发出"关于人工智能输出判断力"的人每季度都在复合优势;那些手动执行的人落后的速度是非线性的。这篇文章关于研究初创企业超越其规模50倍实验室的观点反映了科斯的企业边界理论的反转:协调成本的下降速度比资源优势重要得多。

学术界的计算能力差距是反向的资源诅咒。 关于周末实验对比学期时间表的观察映射到Brown等人对人工智能研究中计算不平等的分析。大学无法在基础设施上竞争,但这篇文章没有注意到顶级实验室越来越限制出版以保护竞争优势——学术自由仍然对可复现的开放工作享有溢价。真正的代价:博士生现在为"获取计算能力"而不是"知识社群"进行优化,可能牺牲了历史上产生突破性想法的协作意外收获。

开放问题

如果人工智能工具改进进入平台期,K形曲线会崩溃吗,让优势回到机构稳定性?还是我们正在看到一个永久的政权转变,其中"协调人工智能系统的品味"成为知识工作的主导过滤器?

如果判断力的复合速度比执行贬值更快,当前科技工作者中的底部50%会发生什么——这最终是否会迫使对科技的精英统治神话进行清算?

2026-02-17 19:54:22 -/-

Guidelines for Reporting LLM-Integrated Systems in HCI ResearchHCI研究中大语言模型集成系统的报告指南

Useful resources:

  • https://huggingface.co/spaces/transformers-community/Transformers-tenets
  • https://ianarawjo.github.io/Guidelines-for-Reporting-LLM-Integrated-Systems-in-HCI/

The following content is generated by LLMs and may contain inaccuracies.

Context

As LLM-integrated systems proliferate in HCI research, the field faces a reproducibility and transparency crisis. Unlike traditional computational systems, LLMs introduce nondeterminism, opaque prompting strategies, and rapidly shifting model capabilities that make replication difficult. The Guidelines for Reporting LLM-Integrated Systems address this gap by proposing standardized reporting practices for HCI researchers working with language models. This matters now because peer review struggles to evaluate systems where critical implementation details—prompt engineering, model versions, failure modes—are often omitted or underspecified.

Key Insights

1. Methodological Debt in Prompt Engineering
HCI research increasingly treats prompts as implementation details rather than experimental variables. Yet prompt design critically shapes user experience and system behavior. The guidelines advocate reporting not just final prompts but also iteration processes and sensitivity analysis. This aligns with calls in Transformers library development to “maintain the unmaintainable”—documenting messy development realities rather than sanitized outcomes. Without prompt versioning and ablation studies, findings remain unreproducible.

2. The Model Specification Problem
Generic references to “GPT-4” or “Claude” mask enormous variance. Model snapshots, temperature settings, and API versioning produce materially different behaviors. Research on model drift shows performance degradation over time even for fixed model names. The guidelines recommend timestamped model identifiers and capturing API responses for post-hoc analysis—a practice standard in ML benchmarking but rare in HCI evaluation.

3. Failure Mode Documentation as Design Knowledge
Traditional HCI reporting emphasizes successful interactions; LLM systems demand documenting characteristic failures. Hallucinations, context window limitations, and reasoning breakdowns aren’t bugs but inherent properties. Systematic failure taxonomies (as proposed in the guidelines) transform error cases into reusable design knowledge, enabling cumulative progress rather than repeated rediscovery.

Open Questions

  • How can HCI adapt peer review timelines when model obsolescence occurs faster than publication cycles—should we move toward living documentation standards?
  • What minimal reproducibility threshold balances scientific rigor with protecting proprietary prompt engineering that confers competitive advantage in applied contexts?

有用的资源:

  • https://huggingface.co/spaces/transformers-community/Transformers-tenets
  • https://ianarawjo.github.io/Guidelines-for-Reporting-LLM-Integrated-Systems-in-HCI/

以下内容由 LLM 生成,可能包含不准确之处。

背景

随着集成LLM的系统在人机交互研究中越来越普遍,该领域面临着可重复性和透明度危机。与传统计算系统不同,LLM引入了不确定性、不透明的提示策略以及快速变化的模型能力,这些都使得复现变得困难。LLM集成系统报告指南针对这一空白,为从事语言模型研究的人机交互研究人员提出了标准化报告实践。这在当下尤为重要,因为同行评审难以评估那些关键实现细节——提示工程、模型版本、故障模式——往往被省略或描述不清的系统。

主要见解

1. 提示工程中的方法债务
人机交互研究越来越多地将提示视为实现细节而非实验变量。然而,提示设计在很大程度上影响用户体验和系统行为。该指南主张报告不仅仅是最终提示,还要报告迭代过程和敏感性分析。这与Transformers库开发中"维护不可维护之物"的呼声相一致——记录凌乱的开发现实而非经过修饰的成果。没有提示版本控制和消融研究,研究成果将无法重现。

2. 模型规范问题
泛泛地提及"GPT-4"或"Claude"掩盖了巨大的差异。模型快照、温度设置和API版本控制产生的行为差异很大。关于模型漂移的研究表明,即使对于固定的模型名称,性能也会随时间降低。该指南建议使用带时间戳的模型标识符并捕获API响应以进行事后分析——这种做法在机器学习基准测试中是标准做法,但在人机交互评估中很少见。

3. 故障模式文档化作为设计知识
传统的人机交互报告强调成功的交互;LLM系统则需要文档化特征性故障。幻觉、上下文窗口限制和推理断裂不是漏洞,而是固有属性。系统性的故障分类法(如指南中提出的)将错误情况转化为可复用的设计知识,实现累进进步而非反复重新发现。

开放性问题

  • 当模型过时速度超过出版周期时,人机交互如何适应同行评审时间安排——我们是否应该转向活文档标准?
  • 科学严谨性与保护在应用环境中具有竞争优势的专有提示工程之间的最小可重复性阈值应该如何平衡?
2026-02-17 19:13:48 -/-

Bayesian Methods, RLHF, and Building AI Models贝叶斯方法、RLHF和构建AI模型

Interesting books for reading:

  • 贝叶斯方法与科学合理性——对休谟问题的思考: https://book.douban.com/subject/4472081/
  • Reinforcement Learning from Human Feedback - A short introduction to RLHF and post-training focused on language models: https://rlhfbook.com/
  • Build a Reasoning Model (From Scratch): https://www.manning.com/books/build-a-reasoning-model-from-scratch
  • Build a Large Language Model (From Scratch): https://www.manning.com/books/build-a-large-language-model-from-scratch

The following content is generated by LLMs and may contain inaccuracies.

Context

This collection spans foundational epistemology, modern alignment techniques, and practical AI implementation—reflecting the field’s dual nature as both engineering practice and philosophical inquiry. The juxtaposition is telling: Bayesian reasoning connects to centuries-old debates about induction (Hume’s problem), while RLHF and transformer architectures represent cutting-edge methods for aligning LLMs with human preferences. Together, they address AI’s central tension: building systems that are both technically capable and rationally justified in their behavior.

Key Insights

Epistemological foundations matter for alignment. The Chinese work on Bayesian methods and scientific rationality revisits Hume’s problem of induction—how we justify inferring general principles from finite observations. This isn’t purely academic: RLHF implicitly makes Bayesian updates about human preferences from limited feedback. Nathan Lambert’s RLHF book describes how post-training uses reward models trained on human comparisons to steer base models, but rarely interrogates the epistemological validity of learning “values” from sparse signals. The gap matters: if we can’t justify ordinary induction, justifying value alignment from few-shot preference data becomes even more precarious.

From-scratch implementations reveal architectural commitments. Raschka’s LLM book and its reasoning model companion emphasize implementing attention mechanisms and transformers without abstraction layers. This pedagogical approach exposes design choices often hidden in frameworks: why scaled dot-product attention, why layer normalization placement matters, how positional encodings shape what’s learnable. Understanding these details illuminates why certain alignment interventions (like RLHF fine-tuning) work—they exploit specific inductive biases already present in the architecture.

Open Questions

Can we formalize RLHF’s relationship to Bayesian belief updating in a way that makes its epistemological assumptions explicit? Does building reasoning models from scratch reveal fundamental architectural requirements for multi-step inference, or are current approaches contingent engineering solutions?

值得阅读的有趣书籍:

  • 贝叶斯方法与科学合理性——对休谟问题的思考: https://book.douban.com/subject/4472081/
  • Reinforcement Learning from Human Feedback - A short introduction to RLHF and post-training focused on language models: https://rlhfbook.com/
  • Build a Reasoning Model (From Scratch): https://www.manning.com/books/build-a-reasoning-model-from-scratch
  • Build a Large Language Model (From Scratch): https://www.manning.com/books/build-a-large-language-model-from-scratch

以下内容由 LLM 生成,可能包含不准确之处。

背景

这个集合涵盖了基础认识论、现代对齐技术和实际的人工智能实现——反映了该领域作为工程实践和哲学探究的双重性质。这种对比很能说明问题:贝叶斯推理与几个世纪以来关于归纳的辩论相关(休谟问题),而RLHF和Transformer架构代表了将大语言模型与人类偏好对齐的最尖端方法。它们共同解决了人工智能的核心矛盾:构建既在技术上有能力又在行为上有理性依据的系统。

关键见解

认识论基础对对齐很重要。 中文关于贝叶斯方法和科学理性的著作重新审视了休谟的归纳问题——我们如何从有限的观察中证明推断一般原则的合理性。这不是纯学术问题:RLHF隐含地从有限反馈中对人类偏好进行贝叶斯更新。Nathan Lambert的RLHF书籍描述了训练后如何使用基于人类比较训练的奖励模型来引导基础模型,但很少质疑从稀疏信号学习"价值观"的认识论有效性。这个差距很重要:如果我们无法证明普通归纳的合理性,那么从少量偏好数据中证明价值对齐的合理性就变得更加危险。

从零开始的实现揭示了架构承诺。 Raschka的大语言模型书籍及其推理模型伴读书强调在没有抽象层的情况下实现注意力机制和Transformer。这种教学方法暴露了通常隐藏在框架中的设计选择:为什么使用缩放点积注意力、为什么层归一化的位置很重要、位置编码如何塑造可学习的内容。理解这些细节阐明了为什么某些对齐干预(如RLHF微调)有效——它们利用了架构中已经存在的特定归纳偏差。

待解问题

我们能否以使其认识论假设明确的方式,将RLHF与贝叶斯信念更新的关系正式化?从零开始构建推理模型是否揭示了多步推理的基本架构要求,或者当前的方法只是偶然的工程解决方案?

2026-02-17 08:22:37 -/-

Human-in-the-loop approval scaling in agentic systems代理系统中人在回路批准的可扩展性

In agentic applications, to ensure the agent actions executed in a safe and trusted manner, in particular write operations, we often use a human in the loop.

The native practice for the setup with one agentic loop + MCP tool servers, is to add human confirmation per tool call. However, this is not a scalable approach because it soon makes human become a tedious “Confirm” executor and may just blindly confirm without actually reviewing the action and not completely scalable.

One potential mitigation is to use diff based review approach, and providing summary of the ongoing executions and ask for confirm. However this remains non-scalable when agent attempts to do a huge diff (e.g. writing 10k lines of code and wants to commit), where human cannot review the whole thing efficiently.

Another idea is to prepare some sort of intent overview and contract to let user to review, but it seems generally hard to prepare tool call sequence in advance because it is non-deterministic and depending on the context. It can also happen that one tool call is catastrophic but missed.

The following content is generated by LLMs and may contain inaccuracies.

Context

Human-in-the-loop (HITL) approval for agentic systems addresses a fundamental tension in AI safety: balancing autonomy with control. As agents gain write permissions—whether modifying codebases, executing financial transactions, or altering production systems—the risk of cascading failures grows. Traditional per-action approval gates create approval fatigue, degrading the very oversight they’re meant to provide. This challenge intensifies as agents integrate with Model Context Protocol (MCP) tool servers, where tool composition can generate unbounded action sequences.

Key Insights

Hierarchical approval boundaries: Rather than uniform gating, systems could implement trust tiers based on reversibility and blast radius. Anthropic’s Constitutional AI work suggests learned policies can classify actions by consequence severity. Read operations and idempotent writes might auto-approve, while irreversible operations (deletions, external API calls) trigger review. This mirrors capability-based security patterns where permissions are granular rather than binary.

Semantic compression for review: The 10k-line diff problem isn’t unique to agents—code review research tackles this via change impact analysis. Agents could pre-compute intent summaries using formal specifications or property-based testing. Instead of reviewing raw diffs, humans approve high-level invariants (“maintains API compatibility,” “preserves data integrity”). Microsoft’s Copilot Workspace experiments with this by generating editable task plans before execution.

Auditable sandboxing with rollback: Non-determinism makes pre-approval contracts fragile, but post-hoc auditing with cheap rollback changes the calculus. Systems like Deno’s permission model prove that runtime permission prompts can work when paired with clear scope boundaries. For agents, execution in isolated environments with speculative checkpointing lets humans review outcomes rather than intentions, then commit or revert atomically.

Open Questions

  • Can we develop a “differential trust calculus” that dynamically adjusts approval thresholds based on agent track record, action reversibility, and environmental context, similar to credit scoring for automation?
  • What design patterns from transactional databases (two-phase commit, optimistic concurrency) could apply to multi-step agent workflows with deferred human approval gates?

在代理应用中,为了确保代理操作以安全和可信的方式执行,尤其是写操作,我们通常会使用人在回路的方式。

对于一个代理循环加 MCP 工具服务器的配置,最原始的做法是在每次工具调用时添加人工确认。然而这种方式不具备可扩展性,因为它很快就会让人变成一个乏味的"确认"执行者,可能会不加审查地盲目确认,完全无法扩展。

一种可能的缓解方案是使用基于差异的审查方法,提供执行摘要并请求确认。但当代理尝试生成巨大的差异(例如写一万行代码并想要提交)时,这种方式仍然不可扩展,因为人无法高效地审查全部内容。

另一个想法是准备某种意图概览和合约让用户审查,但由于工具调用序列是非确定性的且依赖上下文,通常很难提前准备。也可能出现某个灾难性的工具调用被遗漏的情况。

以下内容由 LLM 生成,可能包含不准确之处。

背景

人在回路(HITL)批准对于代理系统解决了AI安全中的一个基本矛盾:平衡自主性与控制。当代理获得写入权限——无论是修改代码库、执行金融交易还是改变生产系统——级联故障的风险都会增长。传统的逐个操作批准门控会导致批准疲劳,削弱了它们本应提供的监督。当代理与模型上下文协议(MCP)工具服务器集成时,这一挑战会加剧,因为工具组合可以生成无限的操作序列。

关键洞见

分层批准边界:与其采用统一的门控,系统可以基于可逆性和影响范围实现信任层级。Anthropic的宪法AI工作表明,学习策略可以按后果严重程度对操作进行分类。读取操作和幂等写入可能会自动批准,而不可逆操作(删除、外部API调用)会触发审查。这反映了基于能力的安全模式,其中权限是精细化而非二进制的。

用于审查的语义压缩:万行代码差异问题不仅限于代理——代码审查研究通过变更影响分析来解决这个问题。代理可以使用形式化规范或基于属性的测试预先计算意图摘要。与其审查原始差异,人类可以批准高级不变量(“维护API兼容性”、“保留数据完整性”)。微软的Copilot工作区通过在执行前生成可编辑的任务计划来尝试这种方法。

具有回滚功能的可审计沙箱:非确定性使得预批准合同变得脆弱,但带有廉价回滚的事后审计改变了成本效益计算。Deno的权限模型等系统证明,当与清晰的作用域边界配对时,运行时权限提示可以有效。对于代理,在隔离环境中执行并结合推测性检查点让人类审查结果而非意图,然后原子性地提交或回滚。

开放问题

  • 我们能否开发一种"差异信任计算",根据代理的历史记录、操作可逆性和环境背景动态调整批准阈值,类似于自动化的信用评分?
  • 事务数据库中的哪些设计模式(两阶段提交、乐观并发)可以应用于具有延迟人工批准门控的多步代理工作流?
2026-02-17 08:14:44 -/-

Pareto Principles and Incomparability in Infinite Ethics无限伦理学中的帕累托原则与不可比性

Pareto Principles in Infinite Ethics Published: May 01, 2018

Recommended citation: Askell, Amanda. ‘Pareto Principles in Infinite Ethics.’ PhD thesis, New York University (2018). https://askell.io/files/Askell-PhD-Thesis.pdf

Summary: In this thesis I argue that ethical rankings of worlds that contain infinite levels of wellbeing ought to be consistent with the Pareto principle, which says that if two worlds contain the same agents and some agents are better off in the first world than they are in the second and no agents are worse off than they are in the second, then the first world is better than the second. I show that if we accept four axioms – the Pareto principle, transitivity, an axiom stating that populations of worlds can be permuted, and the claim that if the ‘at least as good as’ relation holds between two worlds then it holds between qualitative duplicates of this world pair – then we must conclude that there is ubiquitous incomparability between infinite worlds.

The following content is generated by LLMs and may contain inaccuracies.

Context

Amanda Askell’s thesis addresses a fundamental tension in population ethics: how to compare worlds with infinite populations or infinite welfare distributions. Classical utilitarian aggregation breaks down when summing infinite utilities, yet we still need principles to guide ethical decisions affecting potentially infinite futures. This matters for longtermism, existential risk prioritization, and any framework considering indefinitely large futures.

Key Insights

Incomparability as inevitable, not defective. Askell proves that accepting four seemingly minimal axioms—Pareto, transitivity, permutation invariance, and qualitative invariance—forces “ubiquitous incomparability” between infinite worlds. This isn’t a bug to be fixed through cleverer aggregation, but a structural feature of infinite ethics. The result parallels impossibility theorems in social choice: we cannot have all desirable properties simultaneously. Rather than abandoning comparability entirely, we must accept that some world-pairs lack ordinal rankings.

Pareto remains non-negotiable. Unlike other axioms that might be weakened, Askell defends Pareto as foundational: if world A is identical to world B except some individuals fare better in A and none fare worse, A must be better. Rejecting Pareto permits rankings that ignore individual welfare entirely—a violation of welfarism’s core commitment. This constrains which infinite-ethics frameworks remain viable; approaches that violate Pareto (like some overtaking criteria) lose moral standing even if they avoid incomparability.

Practical implications for decision-making. If incomparability is ubiquitous, how do we act? Askell’s framework suggests adopting permissibility frameworks rather than maximization: multiple infinite futures may be permissible if incomparable. This aligns with recent work on maximality in decision theory under incomplete preferences.

Open Questions

How should we prioritize between finite and infinite considerations when they conflict—does any finite welfare gain justify foregoing incomparably different infinite futures? Can bounded rationality constraints justify practically rejecting Pareto in infinite cases where verification is computationally infeasible?

无限伦理学中的帕累托原则 发表日期:2018年5月1日

推荐引用:Askell, Amanda. ‘Pareto Principles in Infinite Ethics.’ PhD thesis, New York University (2018). https://askell.io/files/Askell-PhD-Thesis.pdf

摘要:在这篇论文中,我论证了对包含无限福祉水平的世界的伦理排序应当与帕累托原则一致。帕累托原则认为,如果两个世界包含相同的主体,而在第一个世界中某些主体的境况优于第二个世界,且没有主体的境况劣于第二个世界,那么第一个世界优于第二个世界。我证明了如果我们接受四条公理——帕累托原则、传递性、一条关于世界人口可置换的公理、以及"至少同样好"关系在两个世界之间成立则它在该世界对的定性副本之间也成立——那么我们必须得出结论:无限世界之间存在普遍的不可比性。

以下内容由 LLM 生成,可能包含不准确之处。

背景

阿曼达·阿斯凯尔的论文论述了种群伦理学中的一个根本矛盾:如何比较具有无限人口或无限福利分布的世界。经典功利主义的聚合方法在对无限效用求和时会失效,然而我们仍然需要原则来指导可能影响无限未来的伦理决策。这对于长期主义、存在风险优先级排序以及任何考虑无限期宏大未来的框架都很重要。

核心洞察

不可比性是必然的,而非缺陷。 阿斯凯尔证明了接受四个看似最小化的公理——帕累托原则、传递性、排列不变性和定性不变性——会导致无限世界之间的"普遍不可比性"。这不是可以通过更巧妙的聚合方法来修复的bug,而是无限伦理学的结构特征。该结果与社会选择中的不可能性定理相似:我们不能同时具备所有理想属性。与其完全放弃可比性,我们必须接受某些世界对缺乏序数排名的事实。

帕累托原则不可协商。 与其他可能被削弱的公理不同,阿斯凯尔将帕累托原则视为基础性的:如果世界A与世界B相同,只是某些个体在A中状况更好,在B中没有人状况更差,那么A必定更优。拒绝帕累托原则会允许完全忽视个人福利的排名——这违反了福利主义的核心承诺。这限制了哪些无限伦理学框架仍然可行;违反帕累托原则的方法(如某些超越标准)即使避免了不可比性,也失去了道德上的立足点。

对决策的实际影响。 如果不可比性是普遍存在的,我们该如何行动?阿斯凯尔的框架建议采用许可性框架而非最大化:如果无限期未来是不可比的,多个可能是许可的。这与最近在不完全偏好下的最大性决策论研究相吻合。

开放问题

当有限和无限考量相冲突时,我们应如何在两者之间排序——任何有限福利收益是否都足以证明放弃无限期不可比未来的合理性?有界理性约束是否能够在无限情况下(其中验证在计算上不可行)为实际上拒绝帕累托原则辩护?

2026-02-16 11:43:36 -/-

The Energy Dilemma and Historical Paradox of Intelligent Optimization智能优化的能量困境与历史悖论

Continuing from my previous thoughts.

We mentioned that the long tail may not necessarily be a bad thing. Similar perspectives have actually been proposed long ago—over-optimization weakens the ability to adapt to mutations, which comes from the theory of antifragility.

Actually, the leaders on the adaptation curve are, in a sense, a group of highly intelligent people. From historical experience, we can see that the more intelligent people become, the more they crave energy. Similarly, we can draw a parallel between this human process and artificial intelligence: once the reproductive cycle of intelligence is broken through, it becomes increasingly intelligent while consuming more and more energy.

At some point in the future, if these superintelligent entities are still alive, we can logically deduce that this form of intelligence would consume all available energy and be unable to continue existing.

From this perspective, what exactly is optimization? Is it necessarily always a good thing?

After sharing these thoughts with someone, they countered me with a question: this kind of technological progress seemingly has never occurred in the long course of history. I was momentarily at a loss and didn’t know how to respond.

The following content is generated by LLMs and may contain inaccuracies.

Context

This reflection touches on an energy sustainability paradox in AI safety research: the tension between optimizing efficiency and long-term adaptability. It connects Nassim Taleb’s antifragility theory (over-optimization reduces system resilience) with the resource consumption dilemma in the AI alignment field. Current large model training energy consumption exhibits exponential growth (GPT-3 training consumed 1287 MWh), transforming this question from philosophical speculation into engineering reality.

Key Insights

  1. Historical Analogy Breaking Points: Human civilization has indeed experienced localized collapses (Easter Island deforestation, Mayan civilization), but never a global “intelligence-energy death spiral.” The critical difference lies in the optimization speed AI might achieve, far exceeding biological evolution — Bostrom’s intelligence explosion theory suggests recursive self-improvement could accomplish in months what took humans a million years of intelligent development, while energy infrastructure response cycles are measured in decades.

  2. Multi-Objective Optimization Dilemma: Single-dimensional optimization (such as reasoning capability) necessarily sacrifices other dimensions (such as energy efficiency and robustness). Pareto efficiency frontier demonstrates that once a system reaches certain optimization limits, any further improvement requires trade-offs. Biological evolution’s retention of “suboptimal” diversity serves as a hedge against uncertainty — the “long-tail populations” you mention may become critical gene pools for species survival during environmental upheaval.

  3. Self-Limiting Energy Bottleneck: Landauer’s principle defines the thermodynamic lower bound of computation (each bit erasure must dissipate at least kT·ln2 energy). Even with perfect algorithmic optimization, physical laws will force intelligent agents to encounter hard limits on the energy-intelligence curve, potentially naturally producing an “optimization ceiling” rather than unlimited expansion.

Open Questions

  • Does an operational definition of “moderate optimization” exist — one that captures intelligence dividends while preserving adaptive redundancy? Can the exploration-exploitation balance from evolutionary algorithms translate into AI governance principles?
  • If future AI breaches energy constraints through discovering new physics (such as controlled nuclear fusion), does the original argument become invalid? Would this mean the issue is fundamentally a race between “optimization speed vs. resource acquisition innovation speed”?

接着上个想法继续说。

我们提到长尾可能不一定是坏事,类似的观点其实很早之前就有人提出过——过度优化会削弱适应突变的能力,这来自反脆弱性的理论。

其实适应性曲线的领头人群在某种意义上属于高智能人群。我们从历史经验可以看到,越是智能的人往往越渴望能量。同样的,我们可以将人类的这个过程类比到机器智能上:一旦突破了智能的繁衍循环,它就会越来越智能,且越来越消耗能量。

在未来某个时间节点上,如果这些超级智能体仍然活着,我们可以从逻辑上推演出这种智能会消耗所有能源而无法继续生存。

从这个角度来看,优化到底是什么?它是不是一定就是好事?

我把这些想法跟人分享之后,有人反问我:这种技术进步似乎在历史长河中从未发生过。我一时语塞,竟不知作何回答。

以下内容由 LLM 生成,可能包含不准确之处。

Context

这个思考触及人工智能安全研究中的能源可持续性悖论:优化效率与长期适应性的张力。它连接了Nassim Taleb的反脆弱性理论(过度优化降低系统韧性)与AI对齐领域的资源消耗难题。当前大模型训练能耗呈指数增长(GPT-3训练消耗1287 MWh),使这个问题从哲学思辨转向工程现实。

Key Insights

  1. 历史类比的断裂点:人类文明确实经历过局部崩溃(复活节岛森林耗竭、玛雅文明),但从未出现全球性"智能-能源死亡螺旋"。关键差异在于AI可能实现的优化速度远超生物演化——Bostrom的智能爆炸理论指出递归自我改进可能在数月内完成人类百万年的智能跃迁,而能源基础设施响应周期以十年计。

  2. 优化的多目标困境:单一维度优化(如推理能力)必然牺牲其他维度(如能效、鲁棒性)。Pareto效率前沿表明:当系统达到某种优化极限时,任何进一步改进都需要权衡取舍。生物进化保留"次优"多样性正是对冲不确定性——你提到的"长尾人群"在环境剧变时可能成为种群延续的关键基因库。

  3. 能源瓶颈的自我限制:Landauer极限定义了计算的热力学下界(每比特擦除至少耗散kT·ln2能量)。即使实现完美算法优化,物理定律也会强制智能体在能源-智能曲线上遭遇硬上限,可能自然产生"优化天花板"而非无限扩张。

Open Questions

  • 是否存在"适度优化"的可操作定义——既获得智能红利又保留适应冗余?进化算法中的exploration-exploitation平衡能否转化为AI治理原则?
  • 如果未来AI通过发现新物理学突破能源约束(如可控核聚变),原论证是否失效?这意味着问题本质是"优化速度 vs 资源获取创新速度"的竞赛?
1 2 3 4 5 6 7 8
New Idea新想法
© 2008 - 2026 Changkun Ou. All rights reserved.保留所有权利。 | PV/UV: /
0%