Benchmark Testing: The Extras

Many people have written benchmark tests. In Go Nightly Reading Episode 83, Reliable Performance Testing for Go Programs, we shared how to use tools like benchstat and perflock for rigorous and reliable performance testing. That session briefly discussed the measurement methodology and implementation principles of benchmarks, but due to time constraints, the coverage wasn’t deep enough. So today, let’s further share two details that weren’t covered in Episode 83, but are easily overlooked in certain strict testing scenarios:

When running benchmarks, the code under test is executed more times than b.N. As discussed previously, the testing package runs the code multiple times, gradually predicting how many times the code can be executed consecutively within the required time range (e.g., 1 second, resulting in, say, 100,000 iterations). But there’s an implementation detail: why doesn’t it incrementally accumulate execution times across multiple runs such that t1+t2+…+tn ≈ 1s, and instead searches for the maximum b.N where the total loop time ≈ 1s? The reason is that incremental runs introduce more systematic measurement error. Benchmarks are typically unstable in early iterations (e.g., cache misses), and accumulating results from multiple incremental runs would further amplify this error. In contrast, finding the maximum b.N where the total consecutive execution time satisfies the required range amortizes (rather than accumulates) this systematic error across each test.
Does this mean the testing package’s implementation is perfect, and all we need to do as users is write benchmarks, run under perflock, and use benchstat to eliminate statistical errors? Things aren’t that simple, because the testing package’s measurement program itself also has systematic error, which in extreme scenarios can introduce significant bias. Explaining this requires more space, so here’s an additional article for further reading: Eliminating A Source of Measurement Errors in Benchmarks. In this article, you can learn more about what this intrinsic systematic measurement error is, and several reliable approaches to eliminate it when you need to benchmark such scenarios.

很多人都编写过 Benchmark 测试程序，在 Go 夜读第 83 期对 Go 程序进行可靠的性能测试 (https://talkgo.org/t/topic/102) 分享中也跟大家分享过如何利用 benchstat, perflock 等工具进行严谨可靠的性能测试。在那个分享中也曾简单的讨论过基准测试程序的测量方法及其实现原理，但由于内容较多时间有限对性能基准测试的原理还不够深入。因此，今天跟大家进一步分享两个未在第 83 期覆盖，但在进行某些严格测试时较容易被忽略的细节问题:

进行基准测试时，被测量的代码片段会的执行次数通常大于 b.N 次。在此前的分享中我们谈到，testing 包会通过多次运行被测代码片段，逐步预测在要求的时间范围内（例如 1 秒）能够连续执行被测代码的次数（例如 100000 次）。但这里有一个实现上的细节问题: 为什么不是逐步多次的累积执行被测代码的执行时间，使得t1+t2+…+tn ≈ 1s，而是通过多次运行被测代码寻找最大的 b.N 使得 b.N 次循环的总时间 ≈ 1s？原因是逐步运行基准测试会产生更多的测量系统误差。基准测试在执行的初期通常很不稳定（例如，cache miss），将多个增量运行的结果进行累积会进一步放大这种误差。相反，通过寻找最大的 b.N 使得循环的总时间尽可能的满足要求范围的连续执行能够很好的在每个测试上均摊（而非累积）这一系统误差。
那么是不是可以说 testing 包中的实现方式就非常完美，作为用户的我们只需写出基准测试、在 perflock 下运行、使用 benchstat 消除统计误差后我们不需要做任何额外的操心了呢？事情也并没有这么简单，因为 testing 包的测量程序本身也存在系统误差，在极端场景下这种误差会对测量程序的结果产生相当大的偏差。但要讲清楚这个问题就需要更多额外的篇幅了，所以这里再额外分享了一篇文章 Eliminating A Source of Measurement Errors in Benchmarks（https://github.com/golang-design/research/blob/master/bench-time.md），以供你进一步阅读。在这篇文章里你可以进一步了解这种测量程序内在的系统测量误差是什么，以及当你需要对这种场景进行基准测试时，几种消除这类误差源的可靠应对方案。