Chapter 9: Compressing the Unknown

Thesis: Two moves attack uncertainty head-on. On a slice you can inspect, deliver a guaranteed bound (a certificate); spend your limited checks where they dissolve the most uncertainty (optimal screening).

Part II embedded the moves inside four concrete sites and let them play out, entangled with one another. From this part onward the angle changes: each move is lifted out on its own, washed clean of its domain's jargon, presented in pure form, and then laid across all the sites at once. This is the real payload of the book, that "table of the same move under its many vocabularies."

The eight moves pair off, two by two, into four chapters. The pairing is not a convenience; the pairing is itself a claim: each pair pulls on the same, more basic lever, a point that comes due in Chapter 13. This chapter's pair, certificate and optimal screening, together attack the same thing, namely uncertainty. They compress the unknown from opposite ends: at one end, on a slice you can manage to check, you prove a guaranteed bound; at the other, you spend your limited checks where they dissolve the most uncertainty.

And, by the iron rule laid down in Chapter 4, every move proposed and every cross-domain juxtaposition made must be interrogated once more: is this transfer substantive (same mechanism, same failure mode, same trade-off), or merely a pretty metaphor?

Certificate: Prove a Bound on a Slice

The first move in pure form: rather than verify the whole, produce a bounded, independently recheckable local guarantee. What you deliver is not "all of it is right" but "within this range, it is wrong by at most this much," accompanied by a voucher anyone can quickly verify.

Its cross-domain forms are astonishingly consistent.

In machine learning it is called a generalization bound. Valiant's 1984 PAC framework¹ and the VC dimension of Vapnik and Chervonenkis (brought in by Blumer and colleagues in 1989⁴) give a guarantee of the form "with probability at least $1-\delta$, the true error does not exceed the empirical error plus a complexity penalty":

$$R(h)\ \le\ \hat R(h)+\sqrt{\frac{d\big(\ln(2n/d)+1\big)+\ln(4/\delta)}{n}}.$$

You cannot verify the model's behavior on all future data (that is the open world), but on the slice of "samples already seen" you can prove a bound, with a confidence level, that holds for unseen data. Hoeffding's inequality¹⁷ is its probabilistic engine; PAC-Bayes (McAllester⁶) is its refinement.

In software it is called types and proofs. A type system does not prove a program "all correct," only that one property holds (for instance, that it will not treat an integer as a pointer), and in exchange it buys a decidable, mechanically recheckable check (Pierce⁷). The Curry-Howard correspondence (Howard 1980⁸) sets an equals sign between "proof" and "program," and Necula's 1997 proof-carrying code⁹ pushes the move to its limit: untrusted code carries its own safety proof, and the host need only quickly check the certificate, without re-deriving anything itself. Leroy's formally verified CompCert compiler of 2009¹⁰ and the Z3 solver of de Moura and Bjørner in 2008¹¹ are the industrialization of the same idea.

In numerical computation it is called an error bound. Higham's 2002 backward error analysis¹² and Moore's 1966 interval arithmetic¹³ let you compute carrying "an interval guaranteed to contain the true value," so that what you finally deliver is not a floating-point number that may deceive you, but a guaranteed range. In mathematics it is the zeros verified up to height $T$ from Chapter 7: a bound, not a theorem.

The unifying idea is this: a certificate is a local, bounded, independently recheckable guarantee. Its finest feature is that it exploits the verification asymmetry from Chapter 2: producing a certificate may be extremely expensive, while checking one is extremely cheap. Proof-carrying code, the solution to an NP problem, a mathematical proof, all feed on this same dividend.

It has only one standard failure mode, but a common one: the vacuous bound. A guarantee that is true yet useless, such as "the error is at most one hundred percent" or "this model's generalization error is finite," is logically unimpeachable and operationally worthless. The value of a bound lies not in holding, but in being tight enough to act on.

Optimal Screening: Spend Your Checks Where They Count

The second move in pure form: information has a cost, so allocate your limited checks to where, at the margin, they compress the most uncertainty.

Its cross-domain forms are equally tidy. In statistics and science it is called design of experiments: Fisher's 1935 The Design of Experiments¹⁹ and Box and colleagues' Statistics for Experimenters²⁰ teach how to wring the most information from the fewest trials; Lindley in 1956 gave a measure of "the information an experiment provides"²¹, and Chaloner and Verdinelli systematized Bayesian experimental design²²; Wald's 1945 sequential test²³ lets you decide whether to continue as the data come in. Shannon's 1948 information theory¹⁴ is the underlying currency of all of it. In machine learning it is called active learning: which sample is most worth labeling next (Cohn and colleagues³³, Settles³⁴). In software testing it is called fuzzing: which input to throw compute at to trigger a crash (Miller and colleagues' pioneering 1990 experiment³⁵). The modern scale of this move is striking. Since 2016 Google's OSS-Fuzz has continuously fed vast quantities of malformed input into thousands of open-source projects, and has by now uncovered tens of thousands of defects and vulnerabilities. No human testing team could test exhaustively at that scale; it relies precisely on pouring compute, without pause, toward the places most likely to crash. In auditing it is called sampling: which transactions to check to most likely find a problem. In interfaces it is which question to ask the user (Chapter 5).

Behind all of these lies the same optimization problem: maximize the expected information gain between the answer and the unknown,

$$q^\star=\arg\max_q\ I(\theta;y_q).$$

And when the checking itself must be performed repeatedly, and used even as it proceeds, it grows into the tension of exploration and exploitation, that is, the multi-armed bandit problem. Thompson's 1933 sampling²⁴, Robbins's 1952 founding²⁵, Lai and Robbins's 1985 optimal allocation²⁶, and Auer and colleagues' 2002 finite-time analysis²⁷ give the optimal solution to "how many trials to spend reducing the uncertainty about which option"; its regret grows only logarithmically over time,

$$\mathrm{Regret}(T)=O(\ln T).$$

Here one must guard against a narrative path dependence. Optimal screening is a family of methods: design of experiments, active learning, audit sampling, fuzzing, the bandit, all of it. Kushner, Mockus, through to Jones's 1998 efficient global optimization³⁰, Srinivas and colleagues' 2010 GP-UCB³², and on to the Gaussian-process Bayesian optimization in Shahriari's 2016 review³⁶, are all extremely useful, but they are only one implementation within the family, not the whole of "screening." To equate this move with Gaussian processes is to shrink a universal posture down to a single tool.

It too has only one standard failure mode: optimizing a misspecified information measure. You gather information with great efficiency, but about the wrong question, or the "information" you are maximizing simply does not track what you actually care about. The cleverer the screening, the faster a misspecified measure will lead you astray.

Why the Two Moves Pair, and Where They Lead

Set the two moves side by side: the certificate compresses uncertainty on one slice into a guaranteed bound, while screening spends an information budget to observe the slice that most compresses uncertainty. One is "prove it tight where you can check"; the other is "spend the checking where it most needs checking." They squeeze the same enemy from two ends, namely the unknown. This is also the lever they share: under an information budget, manage where, and with how much force, you cut uncertainty. Chapter 13 will formally name this lever.

But sometimes, no matter how you compress or how you screen, it is not enough, because you simply lack the capacity to make the judgment at all. At that point you can no longer shrink the unknown by yourself; you must borrow the judgment from elsewhere. The next pair of moves is precisely about this.

Next chapter: 10. Borrowed Judgment →← 8. The Organization Blind to Itself

References

Waypoints: 1. historical scientific judgment; 2. theoretically studied material; 3. how science progresses; 4. how to live in an unverifiable world. This section was checked source by source.

L. Valiant (1984). "A Theory of the Learnable." Communications of the ACM, 27(11), 1134-1142. [2] Valiant here proposed the "probably approximately correct" (PAC) learning framework, rigorously defining "learning a concept" as obtaining, with high probability and within polynomial time and samples, a hypothesis whose error is small enough. This paper gave the first provable language for "can it be learned, and how many samples does it take," and is the source of this chapter's "generalization bound" move.
V. Vapnik and A. Chervonenkis (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities." Theory of Probability & Its Applications, 16(2), 264-280. [2] This founding work proved the conditions under which empirical frequencies converge uniformly to true probabilities, and from it grew the VC dimension, later named after the two authors, used to characterize the "capacity" of a family of functions. It explains why an error bound proven on finite samples can hold for unseen data, and is the probabilistic and combinatorial bedrock beneath generalization bounds.
V. Vapnik (1995). The Nature of Statistical Learning Theory. Springer. [2] In this book Vapnik organized statistical learning theory into a complete system: with structural risk minimization at its core, it trades off empirical error against model complexity and is thereby led to the support vector machine. It is the standard reading for understanding why a "complexity penalty" appears in a generalization bound, laying out the intuition of the first move clearly and coherently.
A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth (1989). "Learnability and the Vapnik-Chervonenkis Dimension." Journal of the ACM, 36(4), 929-965. [2] This paper formally brought the VC dimension into the PAC framework, proving that a concept class is PAC-learnable if and only if its VC dimension is finite, and gave a sample-complexity bound depending on the VC dimension. It is the direct source of the generalization-bound formula in the main text, nailing down the criterion that "finite capacity makes it learnable."
A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth (1987). "Occam's Razor." Information Processing Letters, 24(6), 377-380. [2] This short paper gives a learning-theoretic version of "Occam's razor": a hypothesis that can compress the training data short enough will, with high probability, generalize. It turns "simplicity makes it learnable" from a philosophical maxim into a provable proposition, providing a clean footnote to this chapter's motif of "compressing the unknown."
D. McAllester (1999). "PAC-Bayesian Model Averaging." Proceedings of the 12th Annual Conference on Computational Learning Theory (COLT), 164-170. [2] McAllester here proposed the PAC-Bayes bound: it gives a generalization guarantee for a posterior distribution over a family of hypotheses, with the penalty measured by the KL divergence between posterior and prior. It is a refinement of the PAC bound, which is what the main text means in calling it the "refinement" of the first move, and it often gives tighter results than the classical VC bound.
B. Pierce (2002). Types and Programming Languages. MIT Press. [2] This textbook of Pierce's systematically explains the theory and construction of type systems, centered on the two properties of type safety, "progress" and "preservation," and on how they are mechanically checked. It is precisely the standard basis for the main text's line that "a type system does not prove a program all correct, only one property, in exchange for something decidable and recheckable," and is the introductory book for understanding the form the certificate move takes in software.
W. Howard (1980). "The Formulae-as-Types Notion of Construction." In J. Seldin and J. Hindley (eds.), To H. B. Curry: Essays on Combinatory Logic, Lambda Calculus and Formalism, 479-490. Academic Press. [2] Howard's widely circulated manuscript (written in 1969, formally published in 1980) established the correspondence "formulae as types, proofs as programs": the propositions of intuitionistic logic correspond one-to-one with types, and proofs with terms. It is the classic text of the Curry-Howard correspondence, setting an equals sign between "checking the type of a program" and "checking a proof," which is exactly the logical core of the certificate move.
G. Necula (1997). "Proof-Carrying Code." Conference Record of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 106-119. [2][4] Necula proposed "proof-carrying code": an untrusted program carries its own formal proof of its safety, and the host need only quickly verify this certificate, without trusting the code's origin or re-deriving anything. It pushes the verification asymmetry of Chapter 2 to its limit, and is the purest engineering embodiment of this chapter's certificate concept.
X. Leroy (2009). "Formal Verification of a Realistic Compiler." Communications of the ACM, 52(7), 107-115. [2][3] Leroy reported the results of CompCert: a C compiler formally verified in Coq, whose generated code being semantically consistent with the source program is something machine-proven. It shows that "real software with recheckable guarantees" is no fantasy, and is the landmark case of industrializing the certificate move.
L. de Moura and N. Bjørner (2008). "Z3: An Efficient SMT Solver." Tools and Algorithms for the Construction and Analysis of Systems (TACAS), LNCS 4963, 337-340. Springer. [2][4] This paper introduced Z3, an efficient SMT solver that can decide the satisfiability of logical formulas with theories such as arithmetic and arrays, and is widely used in program verification and symbolic execution. It made "automatically producing recheckable certificates" into a ready-to-hand industrial tool, and is one of the engines by which the certificate move spread in practice.
N. Higham (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM. [2] This authoritative work of Higham's systematically treats the error analysis of numerical algorithms, especially backward error analysis: rather than asking "how far the answer departs from the true value," ask "of exactly which perturbed problem this answer is the exact solution." It is the standard reference for the main text's "error bound" move, teaching how to equip floating-point computation with a trustworthy guarantee.
R. Moore (1966). Interval Analysis. Prentice-Hall. [2] This pioneering work of Moore's established interval arithmetic: each quantity participates in computation carrying an interval guaranteed to contain the true value, so the output is not a number that may deceive but a guaranteed range. It gave numerical computation the idea of "delivering a bound rather than a point," which is exactly the embodiment of the certificate move in that field.
C. Shannon (1948). "A Mathematical Theory of Communication." Bell System Technical Journal, 27(3), 379-423; 27(4), 623-656. [1][2] This paper of Shannon's, founding information theory, measures uncertainty with entropy and gives the fundamental limits of source coding and channel capacity. It is the ultimate source of the notion that "information has a cost and can be measured," which the main text calls the "underlying currency" of optimal screening; this chapter's entire discussion of information gain and compression takes it as its unit.
J. Rissanen (1978). "Modeling by Shortest Data Description." Automatica, 14(5), 465-471. [2] Rissanen proposed the minimum description length (MDL) principle: the best model is the one that encodes the data together with the model itself most shortly. It turns "compression is understanding" into an operable model-selection criterion, resonating precisely with this chapter's motif of "compressing the unknown," and is also an information-theoretic realization of Occam's razor.
M. Li and P. Vitányi (2008). An Introduction to Kolmogorov Complexity and Its Applications (3rd ed.). Springer. [2] This standard textbook systematically treats Kolmogorov complexity: the complexity of an object equals the length of the shortest program that can generate it. This is the uncomputable ideal of "compression," contrasted with the operable approximation that is MDL; it provides the theoretical ceiling for this chapter's compression motif, showing that optimal compression is itself a kind of unverifiable limit.
W. Hoeffding (1963). "Probability Inequalities for Sums of Bounded Random Variables." Journal of the American Statistical Association, 58(301), 13-30. [2] Hoeffding here gives an exponential upper bound on the probability that a sum of bounded random variables deviates from its mean. This inequality is the basic tool for compressing the gap between "the empirical average" and "the true expectation" into a confidence bound, which the main text calls the "probabilistic engine" of generalization bounds; most of the certificate move's concentration arguments begin from it.
E. Candès, J. Romberg, and T. Tao (2006). "Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information." IEEE Transactions on Information Theory, 52(2), 489-509. [2] This founding paper of compressed sensing proves that as long as a signal is sparse enough, it can be reconstructed exactly through convex optimization from far fewer measurements than the classical sampling theorem requires. It is a mathematical exemplar of "spending checks on the cutting edge, wringing all the information from very few observations," echoing this chapter's two threads of compression and optimal screening.
R. Fisher (1935). The Design of Experiments. Oliver and Boyd. [1][3] This classic of Fisher's established the basic principles of modern experimental design: randomization, replication, and blocking, along with the famous "lady tasting tea" thought experiment. It teaches how to wring the most credible information from the fewest trials, and is the source reading for the optimal-screening move in statistics and science.
G. Box, W. Hunter, and J. Hunter (1978). Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. John Wiley & Sons. [3][4] This popular practical work by Box and colleagues explains experimental design, data analysis, and model building to the people who actually run experiments, stressing factorial design and the iterative rhythm of sequential learning. It brings Fisher's principles down to the engineering site, and is a practical guide to understanding "how to arrange the checking most economically."
D. Lindley (1956). "On a Measure of the Information Provided by an Experiment." The Annals of Mathematical Statistics, 27(4), 986-1005. [2] Lindley used information theory to give a measure of "how much information an experiment provides," namely the expected information gain between prior and posterior. This is precisely the theoretical prototype of the main text's optimal-screening objective $\arg\max_q I(\theta;y_q)$, turning "which experiment to run" into a maximizable quantity.
K. Chaloner and I. Verdinelli (1995). "Bayesian Experimental Design: A Review." Statistical Science, 10(3), 273-304. [2] This review systematically surveys Bayesian experimental design: with a utility function (often the expected information gain) as the objective, it uniformly derives various optimal-design criteria. It integrates Lindley's measure into a complete framework, and is the reader's entry point for quickly grasping the screening core of "maximizing expected information gain."
A. Wald (1945). "Sequential Tests of Statistical Hypotheses." The Annals of Mathematical Statistics, 16(2), 117-186. [1][2] Wald proposed the sequential probability ratio test: judge as the data come in, and once the evidence is strong enough, stop and draw a conclusion, thereby saving much on average over a fixed-sample test. It turns "whether to keep checking" itself into an optimal decision, and is the forerunner of the "use it as you check" branch of optimal screening.
W. Thompson (1933). "On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples." Biometrika, 25(3-4), 285-294. [1][2] Thompson here proposed the sampling method later named after him: select an option at random according to the posterior probability that "it really is the best," naturally balancing exploration and exploitation. This is one of the earliest solutions to the multi-armed bandit problem, and remains both a simple and a strong strategy for it to this day.
H. Robbins (1952). "Some Aspects of the Sequential Design of Experiments." Bulletin of the American Mathematical Society, 58(5), 527-535. [1][2] This paper of Robbins's formally established the multi-armed bandit problem as a mathematical object and proposed the earliest sequential allocation strategy, founding the research direction of "how to trade off exploration and exploitation." It is the starting point of the whole later bandit literature, and this chapter's discussion of "how many trials to spend reducing which uncertainty" begins here.
T. Lai and H. Robbins (1985). "Asymptotically Efficient Adaptive Allocation Rules." Advances in Applied Mathematics, 6(1), 4-22. [2] Lai and Robbins proved the regret lower bound for the multi-armed bandit: the cumulative regret of any reasonable strategy grows at least logarithmically over time, and they constructed asymptotically optimal allocation rules attaining this lower bound. It established that the main text's $O(\ln T)$ is an insurmountable limit, drawing a ceiling for the entire class of problems.
P. Auer, N. Cesa-Bianchi, and P. Fischer (2002). "Finite-time Analysis of the Multiarmed Bandit Problem." Machine Learning, 47(2-3), 235-256. [1][2] This paper gives simple algorithms such as UCB1, based on "optimism in the face of uncertainty," and proves they have logarithmic regret bounds in finite time (not merely asymptotically). It turns Lai and Robbins's asymptotic result into a concrete, directly usable, analyzable strategy, and is the standard citation for the "upper confidence bound" class of methods.
H. Kushner (1964). "A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise." Journal of Basic Engineering, 86(1), 97-106. [1][2] This early paper of Kushner's characterizes an unknown noisy curve with a probabilistic model, and on that basis selects the next sampling point to seek the maximum, an early form of the Bayesian-optimization idea. It shows that "where to measure once more" can be treated as an optimal decision, and is a pioneering work of optimal screening in global optimization.
J. Mockus, V. Tiesis, and A. Žilinskas (1978). "The Application of Bayesian Methods for Seeking the Extremum." In L. Dixon and G. Szegő (eds.), Towards Global Optimization 2, 117-129. North-Holland. [2] Mockus and colleagues systematically developed Bayesian global optimization and proposed the acquisition function of "expected improvement": under a probabilistic model, select the point most likely to bring improvement for evaluation. It turned Kushner's intuition into a general method, and is the direct predecessor of today's Bayesian optimization.
D. Jones, M. Schonlau, and W. Welch (1998). "Efficient Global Optimization of Expensive Black-Box Functions." Journal of Global Optimization, 13(4), 455-492. [2][4] This paper proposed the EGO algorithm, which uses a Gaussian process to build a surrogate model for an expensive black-box function and then picks the next evaluation point by the expected-improvement criterion, drastically reducing the number of evaluations. It is the landmark work generalizing Bayesian optimization, but the main text also cautions: it is only one implementation within the optimal-screening family of methods, not the whole of it.
C. Rasmussen and C. Williams (2006). Gaussian Processes for Machine Learning. MIT Press. [2][4] This standard textbook systematically treats Gaussian processes: a nonparametric Bayesian method that places a prior on functions themselves and can give predictive uncertainty. It is the theoretical foundation of the surrogate model on which Bayesian optimization depends, and is the core reference for readers who want to understand "how uncertainty is modeled and used to decide where to check next."
N. Srinivas, A. Krause, S. Kakade, and M. Seeger (2010). "Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design." Proceedings of the 27th International Conference on Machine Learning (ICML), 1015-1022. [2] Srinivas and colleagues proposed the GP-UCB algorithm, carrying the "upper confidence bound" idea from the bandit over to Gaussian-process optimization, and gave provable bounds for its regret. It stitches Bayesian optimization together with bandit theory, demonstrating precisely that this chapter's two ends, proving a bound and spending the checking, were one lever all along.
D. Cohn, Z. Ghahramani, and M. Jordan (1996). "Active Learning with Statistical Models." Journal of Artificial Intelligence Research, 4, 129-145. [2][4] Cohn and colleagues give a statistical framework for active learning: under a statistical model, select for labeling the sample that most reduces the model's variance (or predictive uncertainty). It turns "which sample is most worth labeling next" into a computable criterion, and is the representative work establishing active learning as a branch of optimal screening.
B. Settles (2009). Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison. [2][4] This widely cited survey of Settles's systematically organizes the various query strategies of active learning, such as uncertainty sampling, query by committee, and expected error reduction, and compares their applicable settings. It is the standard entry point for a quick overview of "how to spend the labeling budget," setting the various implementations of this move side by side for comparison.
B. Miller, L. Fredriksen, and B. So (1990). "An Empirical Study of the Reliability of UNIX Utilities." Communications of the ACM, 33(12), 32-44. [3][4] Miller and colleagues fed randomly generated input into various UNIX utilities, and a fair number of programs crashed or hung as a result, which is the pioneering experiment of fuzzing. It shows that "throwing compute at random or suspect inputs to shake out a fault" is a cheap and effective way of checking, and is the starting point of optimal screening in software testing.
B. Shahriari, K. Swersky, Z. Wang, R. Adams, and N. de Freitas (2016). "Taking the Human Out of the Loop: A Review of Bayesian Optimization." Proceedings of the IEEE, 104(1), 148-175. [2][4] This review comprehensively surveys Gaussian-process-based Bayesian optimization: the surrogate model, the acquisition function, and their applications in settings such as hyperparameter tuning. It is the standard reading for understanding the current state of the field, but the main text uses it to remind readers not to shrink the universal posture of "optimal screening" down to the single tool of "Gaussian processes."

← 8. The Organization Blind to Itself10. Borrowed Judgment →

Chapter 9: Compressing the Unknown

Certificate: Prove a Bound on a Slice

Optimal Screening: Spend Your Checks Where They Count

Why the Two Moves Pair, and Where They Lead

References

Open with WeChat