Machine Learning from Human Preferences and Active Learning

MLHP/RLHF Lectures

https://web.stanford.edu/class/cs329h/index.html
https://mlhp.stanford.edu/
Multi-Objective Bayesian Optimization with Active Preference Learning https://arxiv.org/abs/2311.13460
Nearly Optimal Active Preference Learning and Its Application to LLM Alignment https://arxiv.org/abs/2602.01581
Personalized Building Climate Control with Contextual Preferential Bayesian Optimization https://arxiv.org/abs/2512.09481
Preference Elicitation and Query Learning https://www.cs.cmu.edu/~sandholm/pref_elicit_query_learning.jmlr04.pdf

The following content is generated by LLMs and may contain inaccuracies.

Context

Learning from human preferences has emerged as a critical bottleneck in deploying AI systems that genuinely serve human values. While traditional supervised learning assumes labeled ground truth, preference learning acknowledges that many real-world objectives—from LLM safety to personalized control systems—lack objective labels and must be inferred from comparative judgments. This matters acutely now as RLHF has become the dominant paradigm for LLM alignment, yet the sample efficiency of preference collection remains poor. The tension: preference data is expensive to collect, but passive collection scales poorly with system complexity.

Key Insights

Active learning can dramatically reduce labeling costs. Recent work by Zhao & Jun (2026) provides the first instance-dependent complexity bounds for active preference learning in LLM alignment, demonstrating that query selection tailored to preference structure (rather than generic experimental design criteria like D-optimality) improves sample efficiency. This challenges the common practice of applying classical active learning objectives without adapting them to the comparative nature of preferences.

Preferences are inherently contextual and heterogeneous. Wang et al. (2025) show that personalized building climate control requires contextual preferential Bayesian optimization to account for both individual differences and environmental factors like outdoor temperature. Similarly, Ozaki et al. (2023) address multi-objective scenarios where decision-maker preferences over Pareto-optimal solutions must be learned interactively. Both works highlight that a single utility function rarely captures real-world complexity.

Preference aggregation introduces normative challenges. The Stanford CS329H curriculum explicitly covers “preference heterogeneity and aggregation” and asks “whose preferences?” These aren’t just technical questions—aggregating preferences involves value judgments about whose feedback counts and how disagreements are resolved, linking machine learning design to social choice theory.

Open Questions

How do we balance exploration-exploitation tradeoffs when the preference model itself is uncertain? Current active learning methods optimize for either objective uncertainty or preference uncertainty, but not both simultaneously in a principled way.
Can we develop theoretically grounded methods for preference aggregation that go beyond majority voting while remaining computationally tractable? The connection to classical impossibility results in social choice (Arrow’s theorem, etc.) suggests fundamental limits that ML practitioners rarely engage with.

MLHP/RLHF 讲座

https://web.stanford.edu/class/cs329h/index.html
https://mlhp.stanford.edu/
具有主动偏好学习的多目标贝叶斯优化 https://arxiv.org/abs/2311.13460
近似最优主动偏好学习及其在大语言模型对齐中的应用 https://arxiv.org/abs/2602.01581
具有上下文偏好贝叶斯优化的个性化建筑气候控制 https://arxiv.org/abs/2512.09481
偏好启发和查询学习 https://www.cs.cmu.edu/~sandholm/pref_elicit_query_learning.jmlr04.pdf

以下内容由 LLM 生成，可能包含不准确之处。

背景

从人类偏好中学习已成为部署真正服务于人类价值观的AI系统的关键瓶颈。虽然传统监督学习假设存在带标签的基础事实，但偏好学习承认许多现实世界的目标——从大语言模型安全到个性化控制系统——缺乏客观标签，必须从比较判断中推断出来。这在当今尤为重要，因为RLHF已成为大语言模型对齐的主导范式，但偏好数据的采集样本效率仍然很低。矛盾之处在于：偏好数据的采集成本很高，但被动采集随着系统复杂性的增加而扩展效果不佳。

关键洞见

主动学习可以显著降低标注成本。 Zhao & Jun (2026)的最新工作为大语言模型对齐中的主动偏好学习提供了首个实例相关的复杂度界，证明了针对偏好结构量身定制的查询选择（而非通用实验设计标准如D最优性）能改进样本效率。这挑战了在没有适应偏好比较性质的情况下应用经典主动学习目标的常见做法。

偏好本质上是上下文相关的且具有异质性。 Wang et al. (2025)表明个性化建筑气候控制需要上下文偏好贝叶斯优化，以考虑个体差异和户外温度等环保因素。类似地，Ozaki et al. (2023)处理多目标场景，其中决策者对帕累托最优解的偏好必须以交互方式学习。这两项工作都表明单一效用函数很少能捕捉现实世界的复杂性。

偏好聚合引入了规范性挑战。 斯坦福CS329H课程明确涵盖"偏好异质性与聚合"，并提出"谁的偏好？“这些不仅仅是技术问题——聚合偏好涉及关于谁的反馈应被计算以及如何解决分歧的价值判断，将机器学习设计与社会选择理论联系起来。

开放问题

当偏好模型本身存在不确定性时，我们如何平衡探索-开发权衡？ 当前的主动学习方法针对目标不确定性或偏好不确定性分别优化，但没有以原则化的方式同时处理两者。
我们能否开发理论上有根据的偏好聚合方法，既超越多数投票，又保持计算可行性？ 与社会选择中经典不可能性结果（Arrow定理等）的联系表明存在基本限制，而ML从业者很少涉及这些限制。