Guidelines for Reporting LLM-Integrated Systems in HCI Research

Useful resources:

The following content is generated by LLMs and may contain inaccuracies.

Context

As LLM-integrated systems proliferate in HCI research, the field faces a reproducibility and transparency crisis. Unlike traditional computational systems, LLMs introduce nondeterminism, opaque prompting strategies, and rapidly shifting model capabilities that make replication difficult. The Guidelines for Reporting LLM-Integrated Systems address this gap by proposing standardized reporting practices for HCI researchers working with language models. This matters now because peer review struggles to evaluate systems where critical implementation details—prompt engineering, model versions, failure modes—are often omitted or underspecified.

Key Insights

1. Methodological Debt in Prompt Engineering
HCI research increasingly treats prompts as implementation details rather than experimental variables. Yet prompt design critically shapes user experience and system behavior. The guidelines advocate reporting not just final prompts but also iteration processes and sensitivity analysis. This aligns with calls in Transformers library development to “maintain the unmaintainable”—documenting messy development realities rather than sanitized outcomes. Without prompt versioning and ablation studies, findings remain unreproducible.

2. The Model Specification Problem
Generic references to “GPT-4” or “Claude” mask enormous variance. Model snapshots, temperature settings, and API versioning produce materially different behaviors. Research on model drift shows performance degradation over time even for fixed model names. The guidelines recommend timestamped model identifiers and capturing API responses for post-hoc analysis—a practice standard in ML benchmarking but rare in HCI evaluation.

3. Failure Mode Documentation as Design Knowledge
Traditional HCI reporting emphasizes successful interactions; LLM systems demand documenting characteristic failures. Hallucinations, context window limitations, and reasoning breakdowns aren’t bugs but inherent properties. Systematic failure taxonomies (as proposed in the guidelines) transform error cases into reusable design knowledge, enabling cumulative progress rather than repeated rediscovery.

Open Questions

How can HCI adapt peer review timelines when model obsolescence occurs faster than publication cycles—should we move toward living documentation standards?
What minimal reproducibility threshold balances scientific rigor with protecting proprietary prompt engineering that confers competitive advantage in applied contexts?

有用的资源：

以下内容由 LLM 生成，可能包含不准确之处。

背景

随着集成LLM的系统在人机交互研究中越来越普遍，该领域面临着可重复性和透明度危机。与传统计算系统不同，LLM引入了不确定性、不透明的提示策略以及快速变化的模型能力，这些都使得复现变得困难。LLM集成系统报告指南针对这一空白，为从事语言模型研究的人机交互研究人员提出了标准化报告实践。这在当下尤为重要，因为同行评审难以评估那些关键实现细节——提示工程、模型版本、故障模式——往往被省略或描述不清的系统。

主要见解

1. 提示工程中的方法债务
人机交互研究越来越多地将提示视为实现细节而非实验变量。然而，提示设计在很大程度上影响用户体验和系统行为。该指南主张报告不仅仅是最终提示，还要报告迭代过程和敏感性分析。这与Transformers库开发中"维护不可维护之物"的呼声相一致——记录凌乱的开发现实而非经过修饰的成果。没有提示版本控制和消融研究，研究成果将无法重现。

2. 模型规范问题
泛泛地提及"GPT-4"或"Claude"掩盖了巨大的差异。模型快照、温度设置和API版本控制产生的行为差异很大。关于模型漂移的研究表明，即使对于固定的模型名称，性能也会随时间降低。该指南建议使用带时间戳的模型标识符并捕获API响应以进行事后分析——这种做法在机器学习基准测试中是标准做法，但在人机交互评估中很少见。

3. 故障模式文档化作为设计知识
传统的人机交互报告强调成功的交互；LLM系统则需要文档化特征性故障。幻觉、上下文窗口限制和推理断裂不是漏洞，而是固有属性。系统性的故障分类法（如指南中提出的）将错误情况转化为可复用的设计知识，实现累进进步而非反复重新发现。

开放性问题

当模型过时速度超过出版周期时，人机交互如何适应同行评审时间安排——我们是否应该转向活文档标准？
科学严谨性与保护在应用环境中具有竞争优势的专有提示工程之间的最小可重复性阈值应该如何平衡？