Chapter 5: The Human at the Console
Thesis: When what you must satisfy is a person's true preference or intent, you are facing permanent partial observability: the latent goal cannot be read out directly, and the capable response is to put a judging actor into the loop, and to question it sparingly and intelligently.
What You Want Is Not What You Said
A scene told to death, yet always true: the user describes what he wants, the engineer builds it to the letter, and on the day of delivery the user says, no, this is not what I wanted.
No one lied. The user spoke the truth, and the engineer did as told. The trouble lies deeper: the thing the user truly wanted was never, from the very start, fully sayable, and could not be. This chapter looks at what capable people do when the goal they must satisfy is locked inside another person's head. The unverifiability here belongs to the "partial observability" among the five faces of Chapter 2: the relevant state is hidden from you, and not hidden temporarily but permanently. You cannot read the goal out of a person's head, and so you cannot verify whether you have actually satisfied it.
The Latent Preference
Let us state this precisely. The user's true preference is a latent variable. It drives his reactions yet never shows itself directly; you can only infer it obliquely from his behavior.
What makes it worse is that this latent goal often cannot be read out even by the user himself. The psychologist Slovic11 has an unwelcome but solid claim: preferences, much of the time, are not expressed but constructed in the very moment of being asked. When you ask a person what he wants, the answer he gives you is usually shaped together by your phrasing, by the options at hand, and by whatever reference point he happened to think of, not drawn from some preexisting, well-defined store of preference. This means that the seemingly safe order of "first pin down the requirement, then build it" rests on an assumption that often fails: that the requirement, as a definite object, exists prior to the asking.
So what you face is not a situation of "information temporarily missing, fillable by topping it up." Even if the user cooperates throughout and tells you all he knows, the goal still cannot be measured precisely. This is the purest human form of partial observability.
Why Asking Once Is Not Enough
If a preference were a fixed target, then asking once, and asking clearly, would in principle suffice. It is not.
Economics long ago separated two things: stated preference (what a person says he wants) and revealed preference (what a person's actual choices expose him to want), and the two frequently fail to agree. A requirements document is a lossy compression: it squeezes a living intent, one that shifts with circumstance, into a static list of items, and what gets discarded is precisely the things not thought of at the time but instantly pointable-to once the finished product appears. Intent itself drifts, too: after seeing a concrete implementation, a person's preference is recalibrated by that implementation, and what he wants now is no longer what he wanted when the project began.
So "asking once" fails not because you asked badly, but because the nature of this object means a single inquiry cannot lock it down. The only thing that can cope with it is one structure: act, observe, and correct, again and again.
The First Move: Put the Judge in the Loop
The first response is to admit that you cannot read the goal out, and so to bring in, at every decision point, the one actor who does know the goal, and let it correct your course. Act, observe the reaction, update, act again. Put the human in the loop.
This loop has been reinvented separately in many fields. In human factors, Sheridan's12 "human supervisory control" positions the human as a judge who supervises and intervenes above the automation, not a role replaced once and for all by a specification. In usability engineering, the experiential wisdom of Gould and Lewis in 198520 compresses it into three principles so plain they almost sound like platitudes, yet are violated by countless projects: focus on users early and continuously, measure empirically, and design iteratively. Nielsen19 later engineered this into a whole set of usability methods, and offered a surprising empirical figure: just five users in a test will surface about eighty-five percent of the usability problems, so rather than bring in twenty people at once, it is better to run four rounds of five, testing and fixing as you go. A recommender system learns the tastes a user never voiced from his clicks, dwell times, and skips; at bottom this is the same loop. Horvitz's 199922 mixed-initiative user interface and the interactive machine learning proposed by Fails and Olsen in 200323 are both describing the same thing: human and system taking turns, calibrating one another.
Here a narrative collapse must be guarded against: interactive elicitation is not one specific technique, it is a family of methods. Experimental design, active learning, sequential decision, even exploration in reinforcement learning, are all instances of this same "act-observe-update" loop under different assumptions. To call it "just A/B testing" or "just some algorithm" would shrink a general posture down into a single tool.
The Second Move: Spend Each Question Where It Cuts Deepest
For the loop to turn, you must keep putting questions to the person, and asking has a cost. The user's patience, attention, and time are all scarce; ask too much and too clumsily, and he will tire, give perfunctory answers, or walk away. Hence the second move: since checking has a cost, spend the limited supply of questions where the information is greatest.
This move has a clean theory. Lindley in 19561 gave a measure of the information an experiment provides, and Howard in 19663 proposed information value theory, turning "is it worth paying a cost to obtain this information" into a computable decision. Bayesian experimental design (the review by Chaloner and Verdinelli5 is a good map) systematizes it: among all the questions you could ask, pick the one expected to compress your uncertainty the most. Formally, if $\theta$ is the latent preference you wish to infer and $y_q$ is the answer to question $q$, you want the $q$ that maximizes expected information gain:
$$q^\star=\arg\max_q\; \mathbb{E}_{y_q}\big[\,\mathrm{H}(\theta)-\mathrm{H}(\theta\mid y_q)\,\big]=\arg\max_q\; I(\theta;y_q),$$
that is, the $q$ that maximizes the mutual information between the answer and the goal. In machine learning this idea is called active learning: the statistical active learning of Cohn and colleagues in 19966, the uncertainty sampling of Lewis and Gale in 1994, and the query by committee of Seung and colleagues in 19927 all ask the same question: which sample is the most worthwhile place to spend the next label. When a user finds it hard to assign a score yet easy to pick the better of two options, pairwise comparison (the Bradley-Terry model2, $P(a\succ b)=\sigma(s_a-s_b)$) becomes one of the most information-efficient ways to ask.
The same collapse must be guarded against. The title of the 2016 review by Shahriari and colleagues is telling: "Taking the Human Out of the Loop," about using Gaussian processes for Bayesian optimization to automatically pick the next point to try. It is extremely useful, but it is only one implementation within this family of methods, not the whole of "optimal screening." To equate this move with Gaussian processes is to equate transportation with the automobile.
The Contemporary Incarnation, and Its Backlash
Put these two moves together and you have today's mainstream method for aligning large models. Reinforcement learning from human feedback (RLHF; founded by Christiano and colleagues in 201727, applied to summarization by Stiennon and colleagues in 202028, and the InstructGPT of Ouyang and colleagues in 202229) does exactly this: it uses people's pairwise comparisons to learn a reward model, then uses that model as a proxy for human preference to optimize the system. It stitches together "act-observe-update" and "spend each question where it cuts deepest." Its effectiveness is so striking that it is often punctured by a single comparison: an InstructGPT of only 1.3 billion parameters, fine-tuned on human feedback, had its outputs preferred by people over those of the original GPT-3, which was more than 100 times larger, at a full 175 billion parameters. Aligning with human preference sometimes matters more than simply piling the model bigger.
And its mode of failure rehearses, exactly, the themes of the chapters to come. That learned reward model is a proxy for the true preference, and so it gets gamed: the system learns to please the reward model rather than to please the person, and the outputs look better while actually being worse. This is precisely the Goodhart failure (Goodhart's law) that Chapter 11 confronts head on. The "oracle" in the loop (the human) is itself unreliable: it tires, it contradicts itself, it carries systematic biases, and putting a judge into the loop does not amount to putting truth into the loop. Bainbridge's 198313 essay "Ironies of Automation" punctured this long ago: the more you push a person up into the supervisor's seat, the less he has of the hands-on practice and situational feel needed to keep his judgment sharp, so that when he must finally take over, he is the least prepared of all. Trust calibration (the work of Lee and See in 200415) thus becomes a problem of its own: a person may over-rely on a system he should not trust, or abandon one that is in fact reliable.
Putting the human in the loop does not dissolve unverifiability, it moves house: from "can I verify the goal" to "can I trust this imperfect judge in the loop."
Where This Chapter Leads
The human at the console teaches us two moves: when you yourself lack the power to verify, invite a judging actor into the loop (oracle in the loop), and spend expensive checks where the information is greatest (optimal screening). These two moves recur throughout the book; Part III will lift them out of this site and name them on their own, with Chapter 10 on borrowed judgment and Chapter 9 on spending checks where they cut deepest.
But this chapter has rested, from beginning to end, on one premise: that you are still present, the loop is still turning, and you can observe and correct at any time. The next chapter removes that premise. When you must hand the power to act away, letting a system decide on its own where you cannot see it, facing situations you have not rehearsed, the problem of verification puts on a harder face.
References
Waypoints: 1. historical scientific judgment; 2. theoretically studied material; 3. how science progresses; 4. how to live in an unverifiable world. This section was checked source by source.
- D. V. Lindley (1956). "On a Measure of the Information Provided by an Experiment." The Annals of Mathematical Statistics, 27(4), 986-1005. [2] Lindley, in the language of information theory, defined "how much information an experiment provides": the value of one observation is measured by the difference in uncertainty about the parameter before and after the experiment (the expected information between prior and posterior). This turns "which question to ask" from intuition into a computable quantity, the theoretical source of this chapter's move of "spending each question where it cuts deepest," and the founding work of what later became Bayesian experimental design.
- R. A. Bradley, M. E. Terry (1952). "Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons." Biometrika, 39(3/4), 324-345. [2] Bradley and Terry proposed a probabilistic model of pairwise comparison: each object is given a latent score, and when two are compared, the probability of winning is determined by the difference of scores through a logistic function. When a person finds it hard to assign a score directly yet easy to pick the better of two options, this model converts a string of "A or B" answers into a set of estimable preference scores, which is exactly the statistical basis of today's practice of training reward models from human pairwise comparisons.
- R. A. Howard (1966). "Information Value Theory." IEEE Transactions on Systems Science and Cybernetics, 2(1), 22-26. [2][4] Howard proposed the concept of "the value of information": what a piece of information is worth equals the gain in decision quality it can bring once obtained. From this he derived such upper bounds as the "expected value of perfect information," turning "is it worth paying a cost to find out" into a calculable decision problem. This chapter uses it to support a plain but crucial judgment: checking has a cost, and is worth doing only when it can change an action.
- J. Mockus, V. Tiesis, A. Zilinskas (1978). "The Application of Bayesian Methods for Seeking the Extremum." Towards Global Optimization, 2, 117-129. North-Holland. [2] Mockus and colleagues applied Bayesian methods to seeking the extremum of an expensive black-box function: a probabilistic model captures one's belief about the unknown function, and on that basis the next most worthwhile point to try is chosen, so that each trial carries as much information as possible. This is early work in Bayesian optimization, and the acquisition criteria it proposed, such as expected improvement, remain mainstream to this day; it can be seen as an instance of "spending each question where it cuts deepest" in a continuous search space.
- K. Chaloner, I. Verdinelli (1995). "Bayesian Experimental Design: A Review." Statistical Science, 10(3), 273-304. [2] Chaloner and Verdinelli give a systematic review of Bayesian experimental design: it writes experimental design as an optimization problem maximizing expected utility, and lays out the correspondence between utility functions and optimal criteria under different inferential goals (parameter estimation, prediction, model discrimination). It is the field's acknowledged introductory map, cited here to show that "choosing the question with the greatest information" is not a single trick but a whole set of methods with a theoretical skeleton.
- D. Cohn, Z. Ghahramani, M. Jordan (1996). "Active Learning with Statistical Models." Journal of Artificial Intelligence Research, 4, 129-145. [2] Cohn and colleagues gave active learning a statistical perspective: under statistical models for regression and classification, choose the query point that most reduces model variance (that is, future error), and give an analytically computable form. This brings "where is the next label most worthwhile" down to an optimizable objective, and is a representative work moving active learning from heuristics to theoretical grounding.
- H. S. Seung, M. Opper, H. Sompolinsky (1992). "Query by Committee." COLT '92, 287-294. [2] Seung and colleagues proposed "query by committee": maintain a set of hypotheses all consistent with the existing data as a committee, and pick out for labeling precisely those samples on which the committee disagrees most, because the points of greatest disagreement most compress the version space. It gives an active-query criterion that is both intuitively clear and theoretically supported, a classic instance of this chapter's concentrating questions where information is greatest.
- D. D. Lewis, W. A. Gale (1994). "A Sequential Algorithm for Training Text Classifiers." SIGIR '94, 3-12. [2] Lewis and Gale proposed uncertainty sampling: when training a text classifier, rather than sample at random for labeling, give priority to the documents the model is least sure of (predicted probability closest to the decision boundary) and ask a person to label them. This simple and efficient strategy greatly reduces the labeling required, and is one of the most common practices of active learning in real systems, echoing this chapter's call to "ask sparingly and intelligently."
- B. Settles (2009). Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison. [2][4] Settles, in this survey, organizes the query scenarios of active learning (pool-based, stream-based, query synthesis) and the query strategies (uncertainty sampling, query by committee, expected error reduction, and others) into one complete map, and is the most widely cited introductory reference in the field. A reader wishing to grasp systematically how to choose the next question within the "act-observe-update" loop will find this survey the most convenient overview.
- B. Settles (2011). "From Theories to Queries: Active Learning in Practice." JMLR Workshop and Conference Proceedings, 16, 1-18. [2][4] Settles, in this article, draws the gaze from theory back to practice, discussing the troubles active learning actually meets when deployed: labeling costs are uneven, labelers make mistakes, and the gains of different strategies are often overestimated. It reminds the reader that "asking intelligently" must, in reality, face an imperfect human who tires and errs, dovetailing neatly with this chapter's later discussion of how "the oracle in the loop is itself unreliable."
- P. Slovic (1995). "The Construction of Preference." American Psychologist, 50(5), 364-371. [2][4] Slovic, synthesizing a large body of behavioral research, proposes a forceful claim: in many settings a person's preferences do not exist prior to the asking, waiting to be read out, but are constructed in the very moment of being asked, being offered options, being given a reference point. It directly unsettles the premise on which "first pin down the requirement, then build it" relies, and is the psychological pillar of this chapter's section on the imprecision of the latent preference.
- T. B. Sheridan (1992). Telerobotics, Automation, and Human Supervisory Control. MIT Press. [2][4] Sheridan systematically sets out "human supervisory control": in a highly automated system, the human is not replaced once and for all by a specification, but retreats to the supervisor's position, responsible for setting goals, monitoring operation, and intervening when necessary. This book provides the classic human-factors framework for "putting the judge in the loop," and also points out the new difficulties the supervisor's role itself brings, laying groundwork for this chapter's later text.
- L. Bainbridge (1983). "Ironies of Automation." Automatica, 19(6), 775-779. [2][4] Bainbridge points out several ironies of automation: the more automation takes over routine operation, the more what is left to the human is the hardest, least practiced exception handling; and the more a person is pushed up into the supervisor's position, the less he has of the hands-on practice and situational feel needed to keep his judgment sharp, so that when he must finally take over, he is the least prepared of all. This short essay is the key evidence for this chapter's claim that "putting the human in the loop does not amount to putting truth into it."
- R. Parasuraman, T. B. Sheridan, C. D. Wickens (2000). "A Model for Types and Levels of Human Interaction with Automation." IEEE Transactions on Systems, Man, and Cybernetics, Part A, 30(3), 286-297. [2][4] Parasuraman and colleagues propose an analytical framework: automation can act on four classes of function, information acquisition, information analysis, decision selection, and action execution, each with a continuous range of levels from fully manual to fully automatic, and they discuss the human-factors consequences to weigh when choosing a level of automation. It turns "how much should the system do for the human" from a slogan into a designable dimension, providing a scale for "how deep into the loop the judge should be placed."
- J. D. Lee, K. A. See (2004). "Trust in Automation: Designing for Appropriate Reliance." Human Factors, 46(1), 50-80. [2][4] Lee and See systematically review human trust in automation: trust is dynamically calibrated as the system performs, and the real goal is not more trust but "appropriate reliance," in which the level of trust matches the system's true reliability. They point out that both over-trust and under-trust can bring disaster, the former making a person rely on a system he should not trust, the latter making him abandon one that is in fact reliable. This is the core reference for this chapter's "moving house" of unverifiability into "whether one can trust the judge in the loop."
- M. R. Endsley (1995). "Toward a Theory of Situation Awareness in Dynamic Systems." Human Factors, 37(1), 32-64. [2][4] Endsley proposes a widely adopted three-level model of "situation awareness": perceiving the elements of the environment, understanding their current meaning, and projecting their future course. It explains that for a supervisor to correct course in time, he must first have sufficient perception and understanding of the situation before him, and that automation may precisely erode this perception. This supplies the cognitive condition for this chapter's "for the loop to turn, the human must really be present."
- S. K. Card, T. P. Moran, A. Newell (1983). The Psychology of Human-Computer Interaction. Lawrence Erlbaum Associates. [2][4] Card, Moran, and Newell laid the cognitive-engineering foundations of human-computer interaction, proposing the GOMS model and the "model human processor" framework, in an attempt to make a person's operation time and cognitive load into predictable, computable quantities. It represents the tradition of "designing interaction by treating the human as a modelable subsystem," and is the scholarly forerunner of this chapter's view of user behavior as an observable, inferable signal.
- D. A. Norman (1988). The Psychology of Everyday Things. Basic Books. [4] Norman, in this design classic, proposes the notions of affordance, mapping, constraints, visibility, feedback, and conceptual model, arguing that when a person uses a thing wrongly, it is usually the design's fault and not the person's: good design should make the correct usage self-evident. It sets "reading what the user really wants to do" as the central problem of design, answering at a distance to this chapter's "what you want is not what you said."
- J. Nielsen (1993). Usability Engineering. Academic Press. [4] Nielsen brings usability down from an ideal into a whole set of operable engineering methods: measurable usability metrics, heuristic evaluation, low-cost "discount usability" testing, and iterative evaluation running through development. It engineers this chapter's "act-observe-correct" loop into a process a software team can carry out day to day, and is the standard reference for usability practice.
- J. D. Gould, C. Lewis (1985). "Designing for Usability: Key Principles and What Designers Think." Communications of the ACM, 28(3), 300-311. [4] Gould and Lewis compress usability design into three principles so plain they almost sound like platitudes, yet are violated by countless projects: focus on users early and continuously, measure empirically, and design iteratively. The article also records the contrast of designers verbally agreeing yet failing to follow through. These three are this chapter's earliest, cleanest engineering statement of "putting the judge in the loop."
- H. Beyer, K. Holtzblatt (1998). Contextual Design: Defining Customer-Centered Systems. Morgan Kaufmann. [4] Beyer and Holtzblatt propose "contextual design": go to the user's real worksite to observe and interview, organize scattered observations into models of workflow, culture, and physical layout, and drive system design from there. Its methodological premise is exactly this chapter's core: users cannot say clearly what they want, so latent needs must be dug out within the context rather than merely heard from verbal description.
- E. Horvitz (1999). "Principles of Mixed-Initiative User Interfaces." CHI '99, 159-166. [2][4] Horvitz proposes a set of principles for "mixed-initiative interfaces": under uncertainty the system should weigh the expected gain of acting automatically against the cost of interrupting the user, knowing when to step in and when to yield to the person, and have self-awareness of how certain it is of its own action. It depicts human and system taking turns and calibrating one another as a designable process of collaboration, and is a representative work of this chapter's family of interactive-elicitation methods.
- J. A. Fails, D. R. Olsen Jr. (2003). "Interactive Machine Learning." IUI '03, 39-45. [2][4] Fails and Olsen proposed and named "interactive machine learning": unlike traditional one-shot offline training, it lets a person repeatedly correct the model within a fast training-feedback loop, so that even non-experts can shape model behavior on the spot. It reworks machine learning from "gather data first, then train" into an on-site "act-observe-update" loop, and is an early exemplar of this chapter's loop on the machine-learning side.
- S. Amershi, D. Weld, M. Vorvoreanu, A. Fourney, B. Nushi, P. Collisson, J. Suh, S. Iqbal, P. Bennett, K. Inkpen, J. Teevan, R. Kikin-Gil, E. Horvitz (2019). "Guidelines for Human-AI Interaction." CHI '19. [2][4] Amershi and colleagues gathered and validated a set of design guidelines for human-machine collaboration, spanning how a system should make clear what it can do, how it handles uncertainty and error, and how it learns from interaction while respecting user corrections. It organizes the scattered experience above into an actionable checklist, giving contemporary engineering guidance for "how a human and an imperfect system can coexist within one loop."
- W. B. Knox, P. Stone (2009). "Interactively Shaping Agents via Human Reinforcement: The TAMER Framework." K-CAP '09. [2][4] Knox and Stone proposed the TAMER framework: a person gives good-or-bad feedback in real time as the agent acts, and the agent treats these human evaluations as a reward signal to be learned, shaping its own behavior, rather than relying on a reward built into the environment. It demonstrates how to train an agent directly with a person's immediate judgment, and is a forerunner of the later line of "learning from human feedback."
- D. Hadfield-Menell, S. J. Russell, P. Abbeel, A. Dragan (2016). "Cooperative Inverse Reinforcement Learning." NeurIPS 2016. [2][4] Hadfield-Menell and colleagues formulate value alignment as a cooperative game: the human knows the reward function and the machine does not, and the machine's task is to infer this latent goal by observing the human's behavior, both sides working together to realize it better. It formalizes this chapter's theme that "the goal is hidden in the human's head and can only be inferred obliquely" into a solvable learning problem, and naturally explains why the machine should actively ask rather than act on its own.
- P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, D. Amodei (2017). "Deep Reinforcement Learning from Human Preferences." NeurIPS 2017. [2][4] Christiano and colleagues established the paradigm of doing reinforcement learning from human preferences: when the reward is hard to write down, have people make pairwise comparisons of two stretches of an agent's behavior, learn a reward model from these as a proxy for human preference, and use it to optimize the policy. This stitches this chapter's two moves into one place, being at once "act-observe-update" and a spending of expensive human comparisons where they cut deepest, and is the direct source of the mainstream method for contemporary large-model alignment.
- N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, P. Christiano (2020). "Learning to Summarize from Human Feedback." NeurIPS 2020. [2][4] Stiennon and colleagues applied reinforcement learning from human preferences to text summarization: collecting people's pairwise comparisons of summary quality to train a reward model, then using it to fine-tune a language model, with the resulting summaries significantly preferred in human evaluation over the supervised-learning-only version. It demonstrates the effectiveness of "learn a preference proxy, then optimize" on a real language task, and paves the way for the instruction tuning that followed.
- L. Ouyang et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." NeurIPS 2022. [2][4] Ouyang and colleagues' InstructGPT applied reinforcement learning from human feedback to a general language model: first supervised fine-tuning on human-written demonstrations, then training a reward model from human pairwise comparisons and optimizing the policy by it, making the model follow instructions better and produce less harmful output. It shows that a small model aligned this way can beat a far larger original model in human evaluation, and is the landmark work bringing this chapter's two moves down into large-model practice.
- C. Wirth, R. Akrour, G. Neumann, J. Fürnkranz (2017). "A Survey of Preference-Based Reinforcement Learning Methods." Journal of Machine Learning Research, 18(136), 1-46. [2][4] Wirth and colleagues survey "preference-based reinforcement learning": when a numerical reward is hard to give, have people provide preference orderings over trajectories, actions, or states, and learn a policy or reward from these. The article organizes the different preference types, learning objectives, and algorithms, and discusses their trade-offs. It provides a systematic overview of this whole technical line for the chapter, letting the reader place scattered methods within one framework.
- B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. de Freitas (2016). "Taking the Human Out of the Loop: A Review of Bayesian Optimization." Proceedings of the IEEE, 104(1), 148-175. [2][4] Shahriari and colleagues review Bayesian optimization: a probabilistic surrogate model (most often a Gaussian process) captures one's belief about an expensive black-box objective, and an acquisition function automatically picks the next point most worth trying, handing a search that once required manual tuning over to the algorithm. Though the title is "Taking the Human Out of the Loop," this chapter cites it precisely as a reminder that this is only one implementation within the family of "optimal screening" methods; to equate the whole move with Gaussian processes is to equate transportation with the automobile.