A few years ago, together with Sven Mayer and Andreas Butz, we published a paper examining how human expertise impacts the human-AI optimization loop: https://arxiv.org/abs/2302.05665
At the time, the debate surrounding this work centered on determining which level of expertise is sufficient to achieve a user’s goals, whereas our paper primarily focused on evaluating judgments. In retrospect, we were fortunate the paper was published. Importantly, the core implications remain largely unchanged, even with the advent of LLMs: there is no final answer because everything involves trade-offs—or what engineers call Pareto sets. LLMs simply push the boundary further to observe. They make shallow competence appear powerful, which obscures the underlying expertise gap rather than diminishing its importance.
I believe treating expertise as a category or threshold is a mistake. Instead, we should view it as a risk control system. The underlying core capability for intelligence remains unchanged: knowing what’s unknown (calibration), spotting hallucinations (error detection), and connecting to reality (sensemaking).
The following content is generated by LLMs and may contain inaccuracies.
Context
This work sits at the intersection of human-computer interaction, AI optimization, and epistemic uncertainty — fields increasingly relevant as LLMs democratize access to AI capabilities while obscuring their limitations. The core tension: in human-AI collaborative optimization, should we treat user expertise as a binary threshold for participation, or as a continuous variable that shapes interaction quality and risk? The original IUI 2023 paper found that novices achieve expert-level output quality but terminate optimization earlier with higher satisfaction, while experts iterate longer, exploring more diverse solutions despite lower satisfaction. This challenges assumptions about “sufficient expertise” and highlights a Pareto frontier problem — there’s no universally optimal stopping point, only trade-offs between solution diversity, iteration cost, and subjective confidence.
Key Insights
The shift from expertise-as-threshold to expertise-as-risk-control aligns with emerging work on AI calibration and human trust dynamics, where the gap between perceived and actual model capability creates systematic failures. LLMs amplify this: they produce fluent, confident-sounding outputs that mask underlying brittleness, making shallow competence appear robust and reducing users' ability to detect when models hallucinate or drift from reality. This echoes research on automation bias, where over-reliance on AI tools degrades human metacognitive monitoring.
Framing expertise as calibration (knowing unknowns), error detection (spotting hallucinations), and sensemaking (grounding in reality) connects to Kahneman’s distinction between System 1 and System 2 thinking: experts don’t just produce better solutions — they maintain skeptical, iterative engagement with AI outputs, refusing premature closure. This reframes novice “satisfaction” not as success, but as potentially dangerous overconfidence in underexplored solution spaces.
Open Questions
How might we design interfaces that make expertise gaps visible rather than hidden — e.g., by exposing model uncertainty, alternative solutions, or iteration histories that prompt deeper exploration? Could we quantify the cost of premature optimization termination in domains where unexamined risks compound over time (e.g., medical diagnosis, policy design)?
几年前,我与Sven Mayer和Andreas Butz联合发表了一篇论文,研究人类专业知识如何影响人工智能优化循环:https://arxiv.org/abs/2302.05665
当时,围绕这项工作的辩论主要集中在确定哪个专业知识水平足以实现用户目标上,而我们的论文主要关注评估判断。回顾往事,我们很幸运这篇论文得以发表。重要的是,核心implications即使在大语言模型出现后仍然基本保持不变:没有最终答案,因为一切都涉及权衡——工程师称之为帕累托集合。大语言模型只是进一步推动边界来观察。它们使浅层能力看起来强大,这掩盖了潜在的专业知识差距,而非减少其重要性。
我认为将专业知识视为一个类别或阈值是一个错误。相反,我们应该将其视为一个风险控制系统。智能的核心能力保持不变:了解未知的东西(校准)、识别幻觉(错误检测)和与现实相连接(意义制造)。
以下内容由 LLM 生成,可能包含不准确之处。
背景
这项工作位于人机交互、人工智能优化和认识论不确定性的交叉点——这些领域随着大语言模型民主化了人工智能能力的使用,同时掩盖了其局限性而变得日益相关。核心矛盾在于:在人-人工智能协作优化中,我们应该将用户专业知识视为参与的二元阈值,还是作为塑造交互质量和风险的连续变量?最初的IUI 2023论文发现,新手能够达到专家级输出质量,但优化结束更早,满意度更高,而专家迭代时间更长,探索更多样化的解决方案,尽管满意度较低。这挑战了关于"充分专业知识"的假设,突出了帕累托前沿问题——没有普遍最优的停止点,只有解决方案多样性、迭代成本和主观信心之间的权衡。
关键洞见
从专业知识作为阈值向专业知识作为风险控制的转变,与人工智能校准和人类信任动态的新兴研究相一致。在这类研究中,感知模型能力与实际能力之间的差距会导致系统性失败。大语言模型放大了这一点:它们产生流畅、自信的输出,掩盖了潜在的脆弱性,使浅层能力看起来稳健,削弱了用户检测模型幻觉或偏离现实时的能力。这呼应了关于自动化偏见的研究,即过度依赖人工智能工具会降低人类元认知监测能力。
将专业知识界定为校准(认识未知)、错误检测(发现幻觉)和意义建构(植根于现实),与卡尼曼关于系统1和系统2思维的区分相联系:专家不仅产生更好的解决方案——他们对人工智能输出保持怀疑、迭代的参与,拒绝过早结束。这将新手"满意度"重新界定,不是成功,而是对欠探索解决方案空间的潜在危险过度自信。
开放性问题
我们如何设计界面,使专业知识差距可见而非隐藏——例如,通过暴露模型不确定性、替代解决方案或促进更深入探索的迭代历史?我们能否量化在风险随时间复合的领域(如医学诊断、政策设计)中过早优化结束的成本?