Wrote a new tool called bench, which integrates and wraps best practices for benchmark testing.
新写了一个叫做 bench 的工具,主要对进行基准测试中的实践进行了整合与封装。
Sharing and recording scattered thoughts and writings.
在这里分享并记录一些零散的想法及写作。
Wrote a new tool called bench, which integrates and wraps best practices for benchmark testing.
新写了一个叫做 bench 的工具,主要对进行基准测试中的实践进行了整合与封装。
Is there any way to make these two functions run faster?
|
|
Here’s a straightforward optimization: lookup table + linear interpolation:
|
|
The benchmark shows approximately 98% runtime performance improvement after optimization.
|
|
有什么办法能够让这两个函数跑得更快吗?
|
|
这里介绍一个很平凡的优化方案: lookup table + 线性插值:
|
|
基准测试显示,优化后的运行时性能提升约为 98%。
|
|
Guess which add implementation has better performance, vec1 or vec2?
|
|
The answer is pass-by-value is faster. The reason is inlining optimization, not escape analysis as many might guess. The pointer implementation returns a pointer solely to support method chaining — the returned pointer is already on the stack, so there’s no escape. Test results:
|
|
A practical example: changing from pass-by-pointer to pass-by-value brought a 6–8% performance improvement in a simple rasterizer (see https://github.com/changkun/ddd/commit/60fba104c574f54e11ffaedba7eaa91c8401bce4).
Furthermore, we might ask: is pass-by-value still faster without inlining? We can try adding the //go:noinline compiler directive to both add methods. The results without inlining (old) compared with inlining (new) are:
|
|
So the next question is: without inlining, why is the pointer version faster? Read more at https://changkun.de/blog/posts/pointers-might-not-be-ideal-for-parameters/
猜猜 vec1 和 vec2 实现的 add 哪个性能更好?
|
|
答案是传值更快。原因是内联优化,而非很多人猜测的逃逸。原因是指针实现的方式虽然返回了指针,但却只是为了能够支持链式调用而设计的,返回的指针本身就已经在栈上,不存在逃逸一说。测试结果:
|
|
一个实际的例子是,将传指针改为传值方式在一个简单的光栅器中带来了 6-8% 的性能提升(见 https://github.com/changkun/ddd/commit/60fba104c574f54e11ffaedba7eaa91c8401bce4)。
除此之外,我们可能会问,如果没有内联的话,还是传值更快么?我们可以试着给两个加法方法增加 //go:noinline 编译标记,最终的结果(old)跟有内联的结果(new)对比如下所示:
|
|
那么问题又来了,在没有内联的情况下,为什么指针更快呢?请阅读 https://changkun.de/blog/posts/pointers-might-not-be-ideal-for-parameters/
In Go 1.14, time.Timer was optimized from a global heap to per-P heaps, and during task switching in the scheduling loop, each P was solely responsible for checking and running timers that could be woken up. However, in that implementation, the work-stealing process didn’t check timer heaps on Ps that were currently executing (bound to an M) — meaning if a P found itself with nothing to do, even if timers on other Ps needed to be woken up, the idle P would still go to sleep. Fortunately, this was fixed in 1.15. But was that the end of it?
Unfortunately, the per-P heap approach fundamentally still relies on async preemption to force-switch goroutines that monopolize an M, so that timers can always be scheduled within a bounded time. But what is the upper bound? In other words, how high is the wake-up latency of time.Timer?
Obviously, the current async preemption implementation relies on the system monitor (sysmon), whose wake-up period is on the order of 10–20 milliseconds, meaning in the worst case, services with strict real-time requirements (such as live streaming) would be seriously affected.
In the upcoming 1.16, a new fix reduces this tens-of-milliseconds latency directly to the microsecond level — very exciting. The benchmark below shows how to systematically quantify timer latency using average and worst-case latency metrics, along with a comparison of the improved timer latency against results from 1.14 and 1.15.
|
|
Go 1.14 中,time.Timer 曾从全局堆优化到了 per-P 堆,并在调度循环进行任务切换时,独自负责检查并运行可被唤醒的 timer。但在当时的实现中,偷取过程并没有检查那些位于正在执行(与 M 绑定)的 P 上的 timer 堆,即如果某个 P 发现自己无事可做,即便其他 P 上的 timer 需要被唤醒,这个无事可做的 P 也会进一步休眠;好在该问题在 1.15 得到了解决。但这就万事大吉了吗?
可惜的是,per-P 堆方法的本质仍然上是在依赖异步抢占来强制切换那些长期霸占 M 的 G,进而 timer 总能在有界的时间内被调度。但这个界的上限是多少?换句话说,time.Timer 的唤醒延迟到底有多高?
显然,现在异步抢占的实现依赖系统监控,而系统监控的唤醒周期是 10 至 20 毫秒级的,这也就意味着在最坏情况下,将对一些对实时性要求极高的服务(如实时流媒体)会产生严重的干扰。
在即将到来的 1.16 中,一项新的修复将这种数十毫秒级的延迟直接干到了微秒级,非常的 exciting。下面的基准测试展示了如何系统的通过平均延迟以及最坏延迟两个指标对 timer 的延迟进行量化,并附上了进一步改进后的 timer 延迟与 1.14, 1.15 中结果的对比。
|
|
Let’s talk about the history of operator precedence design in C. In a reminiscence email from Dennis Ritchie, the father of C, he recalled why some operator precedences in today’s C are “wrong” (e.g., both & and && have lower precedence than ==, whereas in Go, & is higher than ==).
From a type system perspective, the final result of expressions involving operators in if/while contexts is a boolean. For the bitwise operator &, the input is numeric and the output is numeric, while == must accept two numeric values to produce a boolean — therefore & must have higher precedence than ==. Similarly, == must be higher than &&.
However, early C didn’t distinguish between & and && or | and || — there were only & and |. At that time, & was interpreted as a logical operator in if and while statements, and as bitwise in expressions. So &, which could be treated as a logical operator, was designed to have lower precedence than ==, e.g., if(a==b & c==d) would execute == first then &.
Later, when && was introduced as a logical operator to split this ambiguous behavior, C already had a user base. Even though raising the precedence of & above == would have been better, such a change was no longer possible — it would silently break existing code behavior (b&c would first compute some value, then compare with a and d using ==). The only option was to place &&’s precedence after &, without correcting & (obviously, Go as a successor could easily make the right design since the distinction between & and && was already well established). But has Go’s design always been flawless? There’s a recent counterexample.
In the upcoming Go 1.16, there’s a similar “historical episode”: after introducing io/fs, the restructured os package added a new File.ReadDir method whose functionality is nearly identical to the existing File.Readdir (note the capitalization difference). Having functions with such similar functionality and names seems to contradict Go’s design philosophy of orthogonal features. Removing the old File.Readdir would make it more intuitive for users, but this faces the same dilemma as C — for compatibility guarantees, any breaking change is unacceptable. Both methods were ultimately kept.
今天来聊聊 C 语言算符优先级设计的历史吧。在C语言之父 Dennis Ritchie 的回忆邮件 (https://www.lysator.liu.se/c/dmr-on-or.html) 中曾提起过为什么今天 C 语言里有些运算符的优先级是 “错误” 的(比如,& 和 && 的优先级都比 == 低,但 Go 的 & 比 == 高)。
从类型系统的角度考虑,if while 环境下算符参与的表达式的最终结果是布尔值。对于位运算符 & 而言,位算符的输入是数值、输出是数值,而 == 则必须接受两个数值才能得到一个布尔值,因此 & 的优先级必须高于 ==。同样的原因 == 必须高于 && 。
可是,早年的 C 并没有 & 和 && 或者 | 和 || 算符的区分,只有 & 和 |。那时 & 在 if 和 while 语句中被解释为逻辑算符,并在表达式中作为位运算进行解释。所以能被视为逻辑算符的 & 被设计为低于 == 算符,例如 if(a==b & c==d) 将先执行 == 再判断 &。
后来在引入 && 作为逻辑算符将这种二义行为进行拆分时,C 已经有一定用户了,即便将 & 其优先级提升到 == 之前更好,也已经无法再做这种级别的改动了,因为这将在没有任何感知的情况下破坏现有用户的代码行为(b&c 将先取得某个值,并依次与 a、d 做 == 比较),只能无奈的将 && 的优先级放到 & 之后,却不能对 & 做任何修正(显然 Go 作为后继,& 和 && 的区别已经司空见惯,也就很容易做出正确的设计)。但 Go 的设计就一直都很完美无暇吗?最近就有一个反例。
在即将到来的 Go 1.16 中同样也有这样的"历史插曲": 在引入 io/fs 后,重新调整的 os 包中,增加了一个新的 File.ReadDir 方法,功能与已有的 File.Readdir (注意字母大小写)几乎完全一致,这种功能、名字都高度相似的情况,似乎与 Go 注重特性垂直独立的设计哲学相违背,删除老旧的 File.Readdir 固然能够让用户更加直观的理解应该使用哪个 API,但实际上这与当年的 C 面临的是同样的困境,即为了兼容性保障,任何破坏性的改动都是不可取的。他们最终都得到了保留。
Back in Go 1.14, the testing package introduced a t.Cleanup method that allows registering multiple callback functions in test code, which are executed in reverse order of registration after the test completes. Looking at its implementation, can you register another Cleanup callback nested inside a Cleanup callback? As of Go 1.15, you cannot.
|
|
早在 Go 1.14 中,testing 包就引入过一个 t.Cleanup 的方法,允许在测试代码中注册多个回调函数,并以注册顺序的逆序在测试结束后被执行。从其实现来看,你能在一个 Cleanup 里注册的回调中,嵌套注册另一个 Cleanup 吗?现在(1.15)还不能。
|
|
In the upcoming Go 1.16, we will be able to embed resource files directly into compiled binaries. How is it implemented? What is the representation of embedded files? Starting from a broader abstraction, we need an in-memory file system. This further inspires us to think about file system abstraction: what are the minimum requirements for a file system? What operations must a file carry? All the answers to these questions are condensed here.
io/fs.FS:
|
|
embed.FS:
|
|
在即将到来的 Go 1.16 中,我们将允许将资源文件直接嵌入到编译后的二进制文件中。它是怎么实现的?嵌入后的文件表示是什么? 从更广泛的问题抽象出发,我们需要一个 in-memory 的文件系统。于是这又进一步启发我们对文件系统抽象的思考,文件系统的所需最低要求是什么?文件系统承载的文件又必须要求哪些操作?所有这些问题的答案都浓缩在了这里。
io/fs.FS:
|
|
embed.FS:
|
|
Possibly the fastest implementation for getting a goroutine ID across all Go versions with Go 1 compatibility guarantee.
|
|
可能是具有 Go 1 兼容性保障的全版本获取 gorountine ID 的最快的实现
|
|
Many people have written benchmark tests. In Go Nightly Reading Episode 83, Reliable Performance Testing for Go Programs, we shared how to use tools like benchstat and perflock for rigorous and reliable performance testing. That session briefly discussed the measurement methodology and implementation principles of benchmarks, but due to time constraints, the coverage wasn’t deep enough. So today, let’s further share two details that weren’t covered in Episode 83, but are easily overlooked in certain strict testing scenarios:
b.N. As discussed previously, the testing package runs the code multiple times, gradually predicting how many times the code can be executed consecutively within the required time range (e.g., 1 second, resulting in, say, 100,000 iterations). But there’s an implementation detail: why doesn’t it incrementally accumulate execution times across multiple runs such that t1+t2+…+tn ≈ 1s, and instead searches for the maximum b.N where the total loop time ≈ 1s? The reason is that incremental runs introduce more systematic measurement error. Benchmarks are typically unstable in early iterations (e.g., cache misses), and accumulating results from multiple incremental runs would further amplify this error. In contrast, finding the maximum b.N where the total consecutive execution time satisfies the required range amortizes (rather than accumulates) this systematic error across each test.testing package’s implementation is perfect, and all we need to do as users is write benchmarks, run under perflock, and use benchstat to eliminate statistical errors? Things aren’t that simple, because the testing package’s measurement program itself also has systematic error, which in extreme scenarios can introduce significant bias. Explaining this requires more space, so here’s an additional article for further reading: Eliminating A Source of Measurement Errors in Benchmarks. In this article, you can learn more about what this intrinsic systematic measurement error is, and several reliable approaches to eliminate it when you need to benchmark such scenarios.很多人都编写过 Benchmark 测试程序,在 Go 夜读第 83 期 对 Go 程序进行可靠的性能测试 (https://talkgo.org/t/topic/102) 分享中也跟大家分享过如何利用 benchstat, perflock 等工具进行严谨可靠的性能测试。在那个分享中也曾简单的讨论过基准测试程序的测量方法及其实现原理,但由于内容较多时间有限对性能基准测试的原理还不够深入。因此,今天跟大家进一步分享两个未在第 83 期覆盖,但在进行某些严格测试时较容易被忽略的细节问题:
Hello world!
你好世界!