This is an interesting article. The author argues that AI alignment should not be understood as “making AI maximize human preferences.” The mainstream approach to current AI alignment over-relies on the concept of “preference,” treating preferences as a sufficient expression of human values, the basis of rational behavior, and the target AI should optimize. The author calls this route the preferentist approach and systematically critiques four core assumptions: human behavior can be modeled as maximizing preference satisfaction, rational agents should maximize expected utility, aligning an individual means matching their preferences, and aligning multiple people means aggregating their preferences.
The author’s core alternative claim is: AI systems should align with the normative standards required by their social roles, rather than directly aligning with human preferences. For example, a general-purpose AI assistant should not merely satisfy what users currently want, but should conform to the normative ideal of a “good assistant”: respecting user autonomy, avoiding manipulation, acknowledging that user values may change, and refusing to cooperate when doing so would harm others.
The article’s supporting logic unfolds across roughly four layers:
-
Preferences cannot sufficiently represent human values. Human preferences are dynamic, context-dependent, and influenced by social norms. Moreover, some values are incommensurable and cannot be simply compressed into a reward function or utility function.
-
Expected utility maximization is not the only standard of rationality. The author argues that EUT-style global consistency is neither necessary for rational agents nor necessarily a good design goal for safe AI. Locally consistent, instrumental, and correctable systems may better suit safety and practical needs.
-
Methods like RLHF are commonly mislabeled as “learning human preferences." The author points out that many annotators do not express their genuine preferences as users, but rather make judgments about normative standards like helpfulness, harmlessness, and truthfulness. Thus preferences function more like training signals for learning norms rather than final alignment targets.
-
Multi-agent alignment cannot be solved through simple preference aggregation. Because social values are plural and conflicting, preference aggregation may be neither computable nor fair, and could suppress minority lifestyles. The author advocates a more “contractarian” approach, where stakeholders negotiate the legal, moral, and social norms AI should follow in different social roles.
This article doesn’t claim preferences are useless; rather, it argues preferences can only serve as clues for understanding human values, norms, and reasons—not as the final goal of AI alignment. The true alignment target should be normative standards negotiated through society and suited to the specific AI’s role. The author explicitly states in the conclusion that preferences can function as proxy signals for values but should not become the alignment target itself.
The following content is generated by LLMs and may contain inaccuracies.
Beyond Preferences in AI Alignment — In-Depth Exploration
Source Paper: Tan Zhi-Xuan, Micah Carroll, Matija Franklin & Hal Ashton, Beyond Preferences in AI Alignment, published in Philosophical Studies (Revised November 2024).
Context — Background Positioning
This paper touches on the intersection of AI safety, decision theory, political philosophy, and value pluralism, appearing at a critical historical moment: RLHF (Reinforcement Learning from Human Feedback) has become the industry standard for LLM alignment, yet scholarly reflection on its theoretical foundations has not yet permeated mainstream engineering practice.
The mainstream approach to AI alignment currently presupposes three premises: that preferences can adequately represent human values, that human rationality can be understood as maximizing preference satisfaction, and that AI systems should be aligned to the preferences of one or multiple humans. This presupposed system, which the authors term the preferentist approach, forms the object of their critique.
The tension in this problem lies in: the fundamental gap between operational simplicity (finding out what humans want and optimizing for it) and the authentic complexity of values. As AI systems are deployed in high-stakes domains such as healthcare, education, and law, the cost of this gap ceases to be abstract. Although relevant discussions have accumulated considerable depth (Gabriel 2020, Hadfield-Menell & Hadfield 2018, etc.), mainstream AI alignment practice has yet to genuinely absorb the essence of these critiques.
Key Insights — Core Insights
1. The Four Pillars of the Preferentist Approach and Their Fractures
The authors summarize the preferentist approach as four core propositions: ① rational choice theory as a descriptive framework (human behavior can be modeled as approximately maximizing preference satisfaction, representable as utility/reward functions); ② expected utility theory as a normative standard (rational agents can be characterized as maximizing expected utility, and AI systems should likewise be designed and analyzed accordingly).
The other two pillars are: ③ aligning a single individual means matching their preferences; ④ aligning multiple people means aggregating their preferences.
The authors first examine the limitations of rational choice theory as a descriptive model, pointing out that preferences cannot capture the “thick semantic content” of human values, while utility representation overlooks the possible incommensurability that may exist between these values.
2. Fundamental Limitations of Preference Representation: Incommensurability and Incompleteness
A scalar reward function is structurally incapable of representing preference incompleteness arising from pluralistic value systems. Empirical research shows that preference incompleteness is not merely possible, but is an actual phenomenon. This means a utility function at best is an approximate representation of human preferences, rather than a precise expression.
The authors propose transitioning toward alternative frameworks that better handle “resource-limited human cognition,” “incommensurable values,” and the “constructed nature of preferences.”
As a partial technical alternative, several existing more promising representation methods are available: temporal logics and reward machines can avoid the limitations of traditional reward functions, thereby expressing values with temporal structure.
3. EUT Is Neither the Sole Standard of Rationality Nor Suitable as a Design Goal for Safe AI
The authors criticize the normativity of EUT for both humans and AI, invoking arguments that rational agents need not comply with EUT, and pointing out that EUT remains silent on which preferences are normatively acceptable.
The authors do not deny that ensuring the safety of globally coherent agents is theoretically possible (e.g., by maintaining uncertainty over utility functions, or carefully balancing utilities across different contexts); nor do they argue that incompleteness is a necessary condition for instrumental AI. However, if the goal is to build systems that can safely respect our preferences and values, keeping options open and moving beyond the default assumption of “globally coherent agents” is reasonable.
4. RLHF as Learning Normative Standards Rather Than Genuine Preferences
RLHF faces numerous technical challenges (from preference elicitation, scalable oversight, to overoptimization and training stability), yet the authors' critique is more foundational: any alignment method that uses reward to represent human preferences or values will suffer from the representational limitations discussed above.
Research shows that annotators exercise considerable discretion in interpreting alignment principles (such as helpfulness, harmlessness, and honesty), and these judgments often vary significantly across annotators. This suggests that human judgment in RLHF should be understood more as survey measurement rather than observation of stable underlying preferences—preference modeling is essentially a survey design activity.
An independent critique from a sociotechnical perspective complements this: mainstream RLHF practice lacks explicit definitions of concepts like “helpfulness” and “harmlessness,” leaving these concepts for crowdsourced workers to interpret freely. This stance of evading normative questions leads to inconsistent standards and dilution of ethical norms.
5. Multi-Agent Alignment: The Inherent Dilemmas of Preference Aggregation
Although an increasing number of researchers recognize the insufficiency of directly aggregating preferences (Critch & Krueger 2020, Gabriel 2020, Korinek & Balwit 2022), mainstream alignment techniques continue to tend toward cross-individual preference aggregation, overlooking the competitive and pluralistic nature of human values, while conflating specific normative judgments with overall preferences.
Within the framework of social choice theory, research since Condorcet has discovered numerous “impossibility theorems,” showing that any rule for consistently ranking states based on individual orderings will violate some “quite mild rationality conditions” (Sen 2018).
6. Alternative Approach: Role-Based Norms + Contractualist Negotiation
The authors' core alternative thesis is: AI systems should not be aligned to the preferences of users, developers, or “all humanity,” but rather to normative standards appropriate to their social role, such as that of a general assistant. These standards should be determined through negotiation by all relevant stakeholders, enabling diverse AI systems to serve different purposes and promote mutual benefit while limiting harm against a background of value pluralism.
As a concrete pathway, the authors advocate that contractualist and agreement-based approaches can better handle value disagreement while respecting individuality and pluralism of AI purposes. This reframes the alignment objective as: not aligning a single powerful AI system with “all humanity’s preferences,” but rather aligning diverse AI systems each to the normative systems endorsed by their respective stakeholders.
7. Important Precursors and Parallel Research on This Critique
Iason Gabriel (2020) provides crucial theoretical grounding for this work: the alignment target itself requires clarification—there are significant differences between aligning AI to instructions, intentions, revealed preferences, ideal preferences, interests, and values. Principle-based alignment methods have systematic advantages; the core challenge for theorists is not finding the “true” moral principles for AI, but finding fair principles that can gain reflective endorsement despite widespread disagreement on moral beliefs.
In subsequent developments, Resource-Rational Contractualism (RRC) represents a specific technical operationalization of this paper’s contractualist approach: contractualist alignment grounds decisions in agreements that different stakeholders would endorse under appropriate conditions, but achieving such agreements at scale is costly. RRC proposes that AI systems approximate the agreements rational agents would form through a set of normatively-grounded, cognitively-inspired heuristics, enabling RRC-aligned agents to operate efficiently while dynamically adapting to an evolving human social world.
Additionally, “norm inference” as an independent technical direction also resonates with this work: some research attempts to infer normative principles implicit in preference datasets by recovering the rules that best explain observed annotation patterns.
Open Questions — Open-Ended Problems
1. The “Meta-Alignment” Problem of Normative Standards
If AI systems should be aligned to “normative standards required by their social role,” who decides what those normative standards are themselves? The contractualist framework presupposes a reasonable negotiation process, but AI system deployment often precedes the completion of any such negotiation. Does this mean all currently deployed systems exist in a state of “provisional alignment”? If the normative standards derived from negotiation themselves contain internal contradictions (e.g., privacy protection vs. public safety), how should AI systems handle conflicting normative demands without degenerating into some form of utility maximization?
2. Is Preference as a “Proxy Signal for Values” Self-Contradictory?
This paper ultimately acknowledges that preferences can serve as clues to understanding human values and norms, but should not become the alignment target itself. However, if preference signals are already sufficiently noisy and biased in epistemic terms (RLHF annotators' judgments reflect norms more than personal preferences; preferences become influenced by the AI system itself, etc.), does norm inference using preferences as signals possess a reliable epistemic foundation? Does this constitute a circle: we use noisy preference data to learn norms, while those norms were already embedded in the preference-collection process itself?
这篇文章挺有意思,作者认为 AI 对齐不应被理解为“让 AI 最大化人类偏好”。,当前 AI alignment 的主流做法过度依赖“偏好”概念,把偏好当作人类价值的充分表达、理性行为的基础、以及 AI 应该优化的目标。作者把这种路线称为 preferentist approach,并系统批评它的四个核心假设:人类行为可被建模为最大化偏好满足,理性智能体应最大化期望效用,对齐个人就是匹配个人偏好,对齐多人就是聚合多人偏好。 作者的核心替代主张是:AI 系统应该对齐到其社会角色所要求的规范标准,而不是直接对齐到人的偏好。 例如,一个通用 AI 助手不应只是满足用户当下想要什么,而应符合“好助手”的规范理想:尊重用户自主性、避免操纵、承认用户价值可能变化、在伤害他人时拒绝配合等。 文章的支撑逻辑大致是四层:
-
偏好不能充分代表人类价值。 人的偏好是动态的、情境化的、受社会规范影响的,而且有些价值之间不可通约,无法简单压缩成一个 reward function 或 utility function。
-
期望效用最大化不是理性的唯一标准。 作者认为,EUT 风格的全局一致性既不是理性智能体的必要条件,也不一定是安全 AI 的好设计目标。局部一致、工具型、可纠正的系统可能更符合安全和现实需求。
-
RLHF 等方法常被误称为“学习人类偏好”。 作者指出,许多标注员给出的并不是自己作为用户的真实偏好,而是关于有用性、无害性、真实性等规范标准的判断。因此,偏好更像是学习规范的训练信号,而不是最终对齐目标。
-
多人对齐不能靠简单聚合偏好解决。 因为社会价值是多元且有冲突的,偏好聚合可能既不可计算,也不公平,还可能压制少数人的生活方式。作者主张用更“契约论”的方式,让相关利益方协商出 AI 在不同社会角色中应遵守的法律、道德和社会规范。
这篇文章不是说偏好完全没用,而是说偏好只能作为理解人类价值、规范和理由的线索,不能作为 AI alignment 的最终目标。真正的对齐目标应是经过社会协商、适合具体 AI 角色的规范标准。 作者在结论中也明确说,偏好可以作为价值的代理信号,但不应成为 alignment target 本身。
以下内容由 LLM 生成,可能包含不准确之处。
Beyond Preferences in AI Alignment — 深度探讨
来源论文:Tan Zhi-Xuan, Micah Carroll, Matija Franklin & Hal Ashton,Beyond Preferences in AI Alignment,发表于 Philosophical Studies(2024年11月修订版)。
Context — 背景定位
这篇论文触及 AI 安全、决策理论、政治哲学与价值多元论 的交叉地带,出现在一个关键的历史节点:RLHF(Reinforcement Learning from Human Feedback)已成为 LLM 对齐的行业标准,然而学界对其理论基础的反思尚未渗透进主流工程实践。
当前 AI 对齐的主流做法预设了三项前提:偏好能充分表示人类价值、人类理性可被理解为最大化偏好满足、AI 系统应对齐到一个或多个人类的偏好。这套预设体系,作者将其命名为 preferentist approach(偏好主义路线)。
这一问题的张力在于:操作层面的简化(“找出人类想要什么然后优化它”)与价值的真实复杂性之间的根本鸿沟。 随着 AI 系统被部署进医疗、教育、法律等高风险场域,这条鸿沟的代价不再抽象。尽管相关讨论已有相当积累(Gabriel 2020、Hadfield-Menell & Hadfield 2018 等),主流 AI 对齐实践尚未真正吸纳这些批评的要旨。
Key Insights — 核心洞见
1. Preferentist 路线的四支柱及其裂缝
作者将偏好主义路线概括为四个核心命题:① 理性选择理论作为描述性框架(人类行为可被建模为近似最大化偏好满足,可表示为 utility/reward function);② 期望效用理论作为规范标准(理性智能体可被刻画为最大化期望效用,AI 系统也应据此设计和分析)。
另外两个支柱是:③ 对齐单个人即是匹配其偏好;④ 对齐多人即是聚合多人偏好。
作者首先审视了理性选择理论作为描述性模型的局限性,指出偏好无法捕捉人类价值的"厚语义内容"(thick semantic content),而 utility 表示则忽略了这些价值之间可能存在的不可通约性(incommensurability)。
2. 偏好表示的根本局限:不可通约性与不完备性
标量 reward function 在结构上无法表示因多元价值体系导致的偏好不完备性。实证研究表明,偏好不完备不仅是可能的,更是实际存在的现象。这意味着 utility function 至多是人类偏好的近似表示,而非精确表达。
作者提出,应转向能更好处理"资源受限认知(resource-limited human cognition)"、“不可通约价值(incommensurable values)“以及"偏好的建构性本质(constructed nature of preferences)“的替代框架。
作为技术层面的部分替代方案,现有多种更有前景的表示方式:时序逻辑(temporal logics)和 reward machines 可以避免传统 reward function 的局限,从而表达具有时序结构的价值。
3. EUT 既不是理性的唯一标准,也不适合作为安全 AI 的设计目标
作者批评了 EUT 对人类和 AI 的规范性,援引了理性智能体无需遵守 EUT 的论证,并指出 EUT 对哪些偏好在规范上是可接受的问题保持沉默。
作者并不否认确保全局一致性智能体的安全性在理论上是可能的(如通过对 utility function 保持不确定性,或在不同情境中仔细平衡 utilities);他们也不主张不完备性是工具型 AI 的必要条件。但如果目标是构建能安全尊重我们偏好和价值的系统,保持选项开放、超越默认的"全局一致性智能体"假设是合理的。
4. RLHF 实为学习规范标准,而非真实偏好
RLHF 面临大量技术挑战(从偏好引导、可扩展监督,到过度优化和训练稳定性),而作者的批评更具基础性:凡是采用 reward 表示人类偏好或价值的对齐方法,都将遭受上述表示层面的根本限制。
研究表明,标注员在解读对齐原则(如有用性、无害性、诚实性)时拥有相当大的自由裁量空间,且这些判断在不同标注员之间往往存在显著差异。这提示我们,RLHF 中的人类判断更应被理解为调查测量,而非对稳定底层偏好的观察——偏好建模实质上是一项调查设计活动。
对此,一个独立的批评来自社会技术视角:主流 RLHF 实践对"有用"和"无害"等概念缺乏明确定义,将这些概念留给众包工人自行解读,这种回避规范问题的姿态,会导致标准不一致和伦理标准的稀释。
5. 多人对齐:偏好聚合的内在困境
尽管越来越多的研究者意识到直接聚合偏好的不足(Critch & Krueger 2020、Gabriel 2020、Korinek & Balwit 2022),主流对齐技术仍然倾向于跨个体聚合偏好,忽视了人类价值的竞争性与多元性,同时将特定规范判断与整体性偏好混为一谈。
在社会选择理论的框架下,自 Condorcet 以来的研究已发现大量"不可能定理”,表明任何基于个体排序来一致性地排列状态的规则都将违反某些"非常温和的合理性条件”(Sen 2018)。
6. 替代方案:角色规范 + 契约论协商
作者的核心替代主张是:AI 系统不应对齐到用户、开发者或"全人类"的偏好,而应对齐到适合其社会角色的规范标准(normative standards),例如通用助手的角色。这些标准应由所有相关利益方协商确定,由此使多元 AI 系统能够服务不同目的,在价值多元分歧的背景下促进互利并限制伤害。
作为具体路径,作者主张契约论(contractualist)与基于协议(agreement-based)的方法可以更好地处理价值争议,同时尊重个体性与 AI 用途的多元性。这将对齐目标重新定框为:不是将单一强大 AI 系统与"全人类偏好"对齐,而是将多元 AI 系统分别对齐到各利益方同意的规范体系。
7. 这一批评的重要先驱与平行研究
Iason Gabriel(2020)的工作为本文提供了关键的理论铺垫:对齐目标本身需要被澄清——AI 对齐到指令、意图、显示偏好、理想偏好、利益与价值之间存在重大差异。基于原则的对齐方法有其系统优势;理论家的核心挑战不是找出 AI 的"真正"道德原则,而是找到能获得反思性认可、尽管道德信念存在广泛差异的公平原则。
在后续发展上,**Resource-Rational Contractualism(RRC)**是对本文契约论路线的一个具体技术化尝试:契约论对齐主张将决策植根于不同利益方在适当条件下会认可的协议,但在规模化场景中达成此类协议代价高昂。RRC 提出 AI 系统通过一套有规范基础、受认知启发的启发式方法来近似理性主体会形成的协议,一个 RRC 对齐的智能体不仅能高效运作,还能动态适应不断变化的人类社会世界。
此外,**“规范推断”**作为一个独立的技术方向也与本文呼应:有研究尝试通过从标注模式中恢复最能解释观察到的标注规律的规则,来推断偏好数据集中隐含的规范原则。
Open Questions — 开放性问题
1. 规范标准的"元对齐"难题
如果 AI 系统应对齐到"社会角色所要求的规范标准”,那么谁来决定这些规范标准本身?契约论框架预设了一个合理的协商过程,但 AI 系统的部署往往早于任何此类协商的完成。这是否意味着当下所有已部署系统都处于一种"临时对齐"状态?如果协商得出的规范标准本身存在内部矛盾(例如隐私保护 vs. 公共安全),AI 系统应如何处理冲突的规范要求,而不退化为某种形式的效用最大化?
2. 偏好作为"价值的代理信号"是否自我矛盾?
本文最终承认偏好可以作为理解人类价值和规范的线索(clues),但不应成为对齐目标本身。然而,如果偏好信号在认识论上已经足够嘈杂和有偏(RLHF 标注员的判断更多反映规范而非个人偏好、偏好会受到 AI 系统本身的影响等),那么以偏好为信号进行的规范推断本身是否具有可靠的认识论基础?这是否构成一种循环:我们用有噪声的偏好数据来学习规范,而那些规范本来就是被嵌入到偏好收集过程中的?