Pointers Might Not Be Ideal for Parameters指针参数未必是最优选择

Published at发布于:： 2020-11-05 | Reading阅读:： 16 min

We are aware that using pointers for passing parameters can avoid data copy, which will benefit the performance. Nevertheless, there are always some edge cases we might need concern. Let’s take this as an example: 1 2 3 4 5 6 7 8 9 10 11 12 13 // vec.go type vec struct { x, y, z, w float64 } func (v vec) addv(u vec) vec { return vec{v.x + u.

Trading Space for Time空间换时间

Is there any way to make these two functions run faster?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


// linear2sRGB is a sRGB encoder
func linear2sRGB(v float64) float64 {
	if v <= 0.0031308 {
		v *= 12.92
	} else {
		v = 1.055*math.Pow(v, 1/2.4) - 0.055
	}
	return v
}

// sRGB2linear is a sRGB decoder
func sRGB2linear(v float64) float64 {
	if v <= 0.04045 {
		v /= 12.92
	} else {
		v = math.Pow((v+0.055)/1.055, 2.4)
	}
	return v
}

Here’s a straightforward optimization: lookup table + linear interpolation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


// Linear2sRGB converts linear inputs to sRGB space.
func Linear2sRGB(v float64) float64 {
	i := v * lutSize
	ifloor := int(i) & (lutSize - 1)
	v0 := lin2sRGBLUT[ifloor]
	v1 := lin2sRGBLUT[ifloor+1]
	i -= float64(ifloor)
	return v0*(1.0-i) + v1*i
}

func linear2sRGB(v float64) float64 {
	if v <= 0.0031308 {
		v *= 12.92
	} else {
		v = 1.055*math.Pow(v, 1/2.4) - 0.055
	}
	return v
}

const lutSize = 1024 // keep a power of 2

var lin2sRGBLUT [lutSize + 1]float64

func init() {
	for i := range lin2sRGBLUT[:lutSize] {
		lin2sRGBLUT[i] = linear2sRGB(float64(i) / lutSize)
	}
	lin2sRGBLUT[lutSize] = lin2sRGBLUT[lutSize-1]
}

func BenchmarkLinear2sRGB(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for j := 0.0; j <= 1.0; j += 0.01 {
			convert.Linear2sRGB(j)
		}
	}
}

The benchmark shows approximately 98% runtime performance improvement after optimization.

1
2


name           old time/op  new time/op  delta
Linear2sRGB-6  6.38µs ± 0%  0.14µs ± 0%  -97.87%  (p=0.000 n=10+8)

有什么办法能够让这两个函数跑得更快吗？

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


// linear2sRGB is a sRGB encoder
func linear2sRGB(v float64) float64 {
	if v <= 0.0031308 {
		v *= 12.92
	} else {
		v = 1.055*math.Pow(v, 1/2.4) - 0.055
	}
	return v
}

// sRGB2linear is a sRGB decoder
func sRGB2linear(v float64) float64 {
	if v <= 0.04045 {
		v /= 12.92
	} else {
		v = math.Pow((v+0.055)/1.055, 2.4)
	}
	return v
}

这里介绍一个很平凡的优化方案: lookup table + 线性插值:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


// Linear2sRGB converts linear inputs to sRGB space.
func Linear2sRGB(v float64) float64 {
	i := v * lutSize
	ifloor := int(i) & (lutSize - 1)
	v0 := lin2sRGBLUT[ifloor]
	v1 := lin2sRGBLUT[ifloor+1]
	i -= float64(ifloor)
	return v0*(1.0-i) + v1*i
}

func linear2sRGB(v float64) float64 {
	if v <= 0.0031308 {
		v *= 12.92
	} else {
		v = 1.055*math.Pow(v, 1/2.4) - 0.055
	}
	return v
}

const lutSize = 1024 // keep a power of 2

var lin2sRGBLUT [lutSize + 1]float64

func init() {
	for i := range lin2sRGBLUT[:lutSize] {
		lin2sRGBLUT[i] = linear2sRGB(float64(i) / lutSize)
	}
	lin2sRGBLUT[lutSize] = lin2sRGBLUT[lutSize-1]
}

func BenchmarkLinear2sRGB(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for j := 0.0; j <= 1.0; j += 0.01 {
			convert.Linear2sRGB(j)
		}
	}
}

基准测试显示，优化后的运行时性能提升约为 98%。

1
2


name           old time/op  new time/op  delta
Linear2sRGB-6  6.38µs ± 0%  0.14µs ± 0%  -97.87%  (p=0.000 n=10+8)

idea想法 2020-11-04 00:00:00

Pass by Value vs. Pass by Pointer传值与传指针

Guess which add implementation has better performance, vec1 or vec2?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


type vec struct {
	x, y, z, w float64
}

func (v vec) addv(u vec) vec {
	return vec{v.x + u.x, v.y + u.y, v.z + u.z, v.w + u.w}
}

func (v *vec) addp(u *vec) *vec {
	v.x, v.y, v.z, v.w = v.x+u.x, v.y+u.y, v.z+u.z, v.w+u.w
	return v
}

func BenchmarkVec(b *testing.B) {
	b.Run("addv", func(b *testing.B) {
		v1 := vec{1, 2, 3, 4}
		v2 := vec{4, 5, 6, 7}
		b.ReportAllocs()
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v1 = v1.addv(v2)
			} else {
				v2 = v2.addv(v1)
			}
		}
	})
	b.Run("addp", func(b *testing.B) {
		v1 := &vec{1, 2, 3, 4}
		v2 := &vec{4, 5, 6, 7}
		b.ReportAllocs()
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v1 = v1.addp(v2)
			} else {
				v2 = v2.addp(v1)
			}
		}
	})
}

The answer is pass-by-value is faster. The reason is inlining optimization, not escape analysis as many might guess. The pointer implementation returns a pointer solely to support method chaining — the returned pointer is already on the stack, so there’s no escape. Test results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


$ perflock -governor 80% go test -v -run=none -bench=. -count=10 | tee new.txt
$ benchstat new.txt

name         time/op
Vec/addv-16  0.25ns ± 2%
Vec/addp-16  2.20ns ± 0%

name         alloc/op
Vec/addv-16   0.00B
Vec/addp-16   0.00B

name         allocs/op
Vec/addv-16    0.00
Vec/addp-16    0.00

A practical example: changing from pass-by-pointer to pass-by-value brought a 6–8% performance improvement in a simple rasterizer (see https://github.com/changkun/ddd/commit/60fba104c574f54e11ffaedba7eaa91c8401bce4).

Furthermore, we might ask: is pass-by-value still faster without inlining? We can try adding the //go:noinline compiler directive to both add methods. The results without inlining (old) compared with inlining (new) are:

1
2
3
4
5


$ perflock -governor 80% go test -v -run=none -bench=. -count=10 | tee old.txt
$ benchstat old.txt new.txt
name         old time/op    new time/op    delta
Vec/addv-16    4.99ns ± 1%    0.25ns ± 2%  -95.05%  (p=0.000 n=9+10)
Vec/addp-16    3.35ns ± 1%    2.20ns ± 0%  -34.37%  (p=0.000 n=10+8)

So the next question is: without inlining, why is the pointer version faster? Read more at https://changkun.de/blog/posts/pointers-might-not-be-ideal-for-parameters/

猜猜 vec1 和 vec2 实现的 add 哪个性能更好？

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


type vec struct {
	x, y, z, w float64
}

func (v vec) addv(u vec) vec {
	return vec{v.x + u.x, v.y + u.y, v.z + u.z, v.w + u.w}
}

func (v *vec) addp(u *vec) *vec {
	v.x, v.y, v.z, v.w = v.x+u.x, v.y+u.y, v.z+u.z, v.w+u.w
	return v
}

func BenchmarkVec(b *testing.B) {
	b.Run("addv", func(b *testing.B) {
		v1 := vec{1, 2, 3, 4}
		v2 := vec{4, 5, 6, 7}
		b.ReportAllocs()
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v1 = v1.addv(v2)
			} else {
				v2 = v2.addv(v1)
			}
		}
	})
	b.Run("addp", func(b *testing.B) {
		v1 := &vec{1, 2, 3, 4}
		v2 := &vec{4, 5, 6, 7}
		b.ReportAllocs()
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v1 = v1.addp(v2)
			} else {
				v2 = v2.addp(v1)
			}
		}
	})
}

答案是传值更快。原因是内联优化，而非很多人猜测的逃逸。原因是指针实现的方式虽然返回了指针，但却只是为了能够支持链式调用而设计的，返回的指针本身就已经在栈上，不存在逃逸一说。测试结果:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


$ perflock -governor 80% go test -v -run=none -bench=. -count=10 | tee new.txt
$ benchstat new.txt

name         time/op
Vec/addv-16  0.25ns ± 2%
Vec/addp-16  2.20ns ± 0%

name         alloc/op
Vec/addv-16   0.00B
Vec/addp-16   0.00B

name         allocs/op
Vec/addv-16    0.00
Vec/addp-16    0.00

一个实际的例子是，将传指针改为传值方式在一个简单的光栅器中带来了 6-8% 的性能提升（见 https://github.com/changkun/ddd/commit/60fba104c574f54e11ffaedba7eaa91c8401bce4）。

除此之外，我们可能会问，如果没有内联的话，还是传值更快么？我们可以试着给两个加法方法增加 //go:noinline 编译标记，最终的结果（old）跟有内联的结果（new）对比如下所示:

1
2
3
4
5


$ perflock -governor 80% go test -v -run=none -bench=. -count=10 | tee old.txt
$ benchstat old.txt new.txt
name         old time/op    new time/op    delta
Vec/addv-16    4.99ns ± 1%    0.25ns ± 2%  -95.05%  (p=0.000 n=9+10)
Vec/addp-16    3.35ns ± 1%    2.20ns ± 0%  -34.37%  (p=0.000 n=10+8)

那么问题又来了，在没有内联的情况下，为什么指针更快呢？请阅读 https://changkun.de/blog/posts/pointers-might-not-be-ideal-for-parameters/

idea想法 2020-11-03 00:00:00

A Timer OptimizationTimer 的一枚优化

In Go 1.14, time.Timer was optimized from a global heap to per-P heaps, and during task switching in the scheduling loop, each P was solely responsible for checking and running timers that could be woken up. However, in that implementation, the work-stealing process didn’t check timer heaps on Ps that were currently executing (bound to an M) — meaning if a P found itself with nothing to do, even if timers on other Ps needed to be woken up, the idle P would still go to sleep. Fortunately, this was fixed in 1.15. But was that the end of it?

Unfortunately, the per-P heap approach fundamentally still relies on async preemption to force-switch goroutines that monopolize an M, so that timers can always be scheduled within a bounded time. But what is the upper bound? In other words, how high is the wake-up latency of time.Timer?

Obviously, the current async preemption implementation relies on the system monitor (sysmon), whose wake-up period is on the order of 10–20 milliseconds, meaning in the worst case, services with strict real-time requirements (such as live streaming) would be seriously affected.

In the upcoming 1.16, a new fix reduces this tens-of-milliseconds latency directly to the microsecond level — very exciting. The benchmark below shows how to systematically quantify timer latency using average and worst-case latency metrics, along with a comparison of the improved timer latency against results from 1.14 and 1.15.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67


// Benchmark timer latency when the thread that creates the timer is busy with
// other work and the timers must be serviced by other threads.
// https://golang.org/issue/38860
//
//                                        go14.time.bench  go15.time.bench  fix.time.bench
// ParallelTimerLatency-8 \ avg-late-ns      17.3M ± 3%        7.9M ± 0%       0.2M ± 3%
// ParallelTimerLatency-8 \ max-late-ns      18.3M ± 1%        8.2M ± 0%       0.5M ±12%
func BenchmarkParallelTimerLatency(b *testing.B) {
	// allocate memory now to avoid GC interference later.
	timerCount := runtime.GOMAXPROCS(0) - 1
	stats := make([]struct {
		sum   float64
		max   time.Duration
		count int64
		_     [5]int64 // cache line padding
	}, timerCount)
	... // environment guarantees are omitted here

	b.ResetTimer()

	const delay = time.Millisecond
	var wg sync.WaitGroup
	var count int32
	for i := 0; i < b.N; i++ {
		wg.Add(timerCount)
		atomic.StoreInt32(&count, 0)
		for j := 0; j < timerCount; j++ {
			j := j
			expectedWakeup := time.Now().Add(delay)
			time.AfterFunc(delay, func() {
				late := time.Since(expectedWakeup) // actual wakeup time
				stats[j].count++
				stats[j].sum += float64(late.Nanoseconds())
				if late > stats[j].max {
					stats[j].max = late
				}
				atomic.AddInt32(&count, 1)
				for atomic.LoadInt32(&count) < int32(timerCount) { // wait other timers
				}
				wg.Done()
			})
		}

		// spin until all timers fired
		for atomic.LoadInt32(&count) < int32(timerCount) {
		}
		wg.Wait()

		// do work: spin a bit to let other threads go idle before the next round
		now := time.Now()
		for time.Since(now) < time.Millisecond {
		}
	}
	var total float64
	var samples float64
	max := time.Duration(0)
	for _, s := range stats {
		if s.max > max {
			max = s.max
		}
		total += s.sum
		samples += float64(s.count)
	}
	b.ReportMetric(0, "ns/op")
	b.ReportMetric(total/samples, "avg-late-ns")
	b.ReportMetric(float64(max.Nanoseconds()), "max-late-ns")
}

Go 1.14 中，time.Timer 曾从全局堆优化到了 per-P 堆，并在调度循环进行任务切换时，独自负责检查并运行可被唤醒的 timer。但在当时的实现中，偷取过程并没有检查那些位于正在执行（与 M 绑定）的 P 上的 timer 堆，即如果某个 P 发现自己无事可做，即便其他 P 上的 timer 需要被唤醒，这个无事可做的 P 也会进一步休眠；好在该问题在 1.15 得到了解决。但这就万事大吉了吗？

可惜的是，per-P 堆方法的本质仍然上是在依赖异步抢占来强制切换那些长期霸占 M 的 G，进而 timer 总能在有界的时间内被调度。但这个界的上限是多少？换句话说，time.Timer 的唤醒延迟到底有多高？

显然，现在异步抢占的实现依赖系统监控，而系统监控的唤醒周期是 10 至 20 毫秒级的，这也就意味着在最坏情况下，将对一些对实时性要求极高的服务（如实时流媒体）会产生严重的干扰。

在即将到来的 1.16 中，一项新的修复将这种数十毫秒级的延迟直接干到了微秒级，非常的 exciting。下面的基准测试展示了如何系统的通过平均延迟以及最坏延迟两个指标对 timer 的延迟进行量化，并附上了进一步改进后的 timer 延迟与 1.14, 1.15 中结果的对比。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67


// Benchmark timer latency when the thread that creates the timer is busy with
// other work and the timers must be serviced by other threads.
// https://golang.org/issue/38860
//
//                                        go14.time.bench  go15.time.bench  fix.time.bench
// ParallelTimerLatency-8 \ avg-late-ns      17.3M ± 3%        7.9M ± 0%       0.2M ± 3%
// ParallelTimerLatency-8 \ max-late-ns      18.3M ± 1%        8.2M ± 0%       0.5M ±12%
func BenchmarkParallelTimerLatency(b *testing.B) {
	// allocate memory now to avoid GC interference later.
	timerCount := runtime.GOMAXPROCS(0) - 1
	stats := make([]struct {
		sum   float64
		max   time.Duration
		count int64
		_     [5]int64 // cache line padding
	}, timerCount)
	... // environment guarantees are omitted here

	b.ResetTimer()

	const delay = time.Millisecond
	var wg sync.WaitGroup
	var count int32
	for i := 0; i < b.N; i++ {
		wg.Add(timerCount)
		atomic.StoreInt32(&count, 0)
		for j := 0; j < timerCount; j++ {
			j := j
			expectedWakeup := time.Now().Add(delay)
			time.AfterFunc(delay, func() {
				late := time.Since(expectedWakeup) // actual wakeup time
				stats[j].count++
				stats[j].sum += float64(late.Nanoseconds())
				if late > stats[j].max {
					stats[j].max = late
				}
				atomic.AddInt32(&count, 1)
				for atomic.LoadInt32(&count) < int32(timerCount) { // wait other timers
				}
				wg.Done()
			})
		}

		// spin until all timers fired
		for atomic.LoadInt32(&count) < int32(timerCount) {
		}
		wg.Wait()

		// do work: spin a bit to let other threads go idle before the next round
		now := time.Now()
		for time.Since(now) < time.Millisecond {
		}
	}
	var total float64
	var samples float64
	max := time.Duration(0)
	for _, s := range stats {
		if s.max > max {
			max = s.max
		}
		total += s.sum
		samples += float64(s.count)
	}
	b.ReportMetric(0, "ns/op")
	b.ReportMetric(total/samples, "avg-late-ns")
	b.ReportMetric(float64(max.Nanoseconds()), "max-late-ns")
}

idea想法 2020-11-02 00:00:00

Operator Precedence运算符的优先级

Let’s talk about the history of operator precedence design in C. In a reminiscence email from Dennis Ritchie, the father of C, he recalled why some operator precedences in today’s C are “wrong” (e.g., both & and && have lower precedence than ==, whereas in Go, & is higher than ==).

From a type system perspective, the final result of expressions involving operators in if/while contexts is a boolean. For the bitwise operator &, the input is numeric and the output is numeric, while == must accept two numeric values to produce a boolean — therefore & must have higher precedence than ==. Similarly, == must be higher than &&.

However, early C didn’t distinguish between & and && or | and || — there were only & and |. At that time, & was interpreted as a logical operator in if and while statements, and as bitwise in expressions. So &, which could be treated as a logical operator, was designed to have lower precedence than ==, e.g., if(a==b & c==d) would execute == first then &.

Later, when && was introduced as a logical operator to split this ambiguous behavior, C already had a user base. Even though raising the precedence of & above == would have been better, such a change was no longer possible — it would silently break existing code behavior (b&c would first compute some value, then compare with a and d using ==). The only option was to place &&’s precedence after &, without correcting & (obviously, Go as a successor could easily make the right design since the distinction between & and && was already well established). But has Go’s design always been flawless? There’s a recent counterexample.

In the upcoming Go 1.16, there’s a similar “historical episode”: after introducing io/fs, the restructured os package added a new File.ReadDir method whose functionality is nearly identical to the existing File.Readdir (note the capitalization difference). Having functions with such similar functionality and names seems to contradict Go’s design philosophy of orthogonal features. Removing the old File.Readdir would make it more intuitive for users, but this faces the same dilemma as C — for compatibility guarantees, any breaking change is unacceptable. Both methods were ultimately kept.

今天来聊聊 C 语言算符优先级设计的历史吧。在C语言之父 Dennis Ritchie 的回忆邮件 (https://www.lysator.liu.se/c/dmr-on-or.html) 中曾提起过为什么今天 C 语言里有些运算符的优先级是 “错误” 的（比如，& 和 && 的优先级都比 == 低，但 Go 的 & 比 == 高）。

从类型系统的角度考虑，if while 环境下算符参与的表达式的最终结果是布尔值。对于位运算符 & 而言，位算符的输入是数值、输出是数值，而 == 则必须接受两个数值才能得到一个布尔值，因此 & 的优先级必须高于 ==。同样的原因 == 必须高于 && 。

可是，早年的 C 并没有 & 和 && 或者 | 和 || 算符的区分，只有 & 和 |。那时 & 在 if 和 while 语句中被解释为逻辑算符，并在表达式中作为位运算进行解释。所以能被视为逻辑算符的 & 被设计为低于 == 算符，例如 if(a==b & c==d) 将先执行 == 再判断 &。

后来在引入 && 作为逻辑算符将这种二义行为进行拆分时，C 已经有一定用户了，即便将 & 其优先级提升到 == 之前更好，也已经无法再做这种级别的改动了，因为这将在没有任何感知的情况下破坏现有用户的代码行为（b&c 将先取得某个值，并依次与 a、d 做 == 比较），只能无奈的将 && 的优先级放到 & 之后，却不能对 & 做任何修正（显然 Go 作为后继，& 和 && 的区别已经司空见惯，也就很容易做出正确的设计）。但 Go 的设计就一直都很完美无暇吗？最近就有一个反例。

在即将到来的 Go 1.16 中同样也有这样的"历史插曲": 在引入 io/fs 后，重新调整的 os 包中，增加了一个新的 File.ReadDir 方法，功能与已有的 File.Readdir （注意字母大小写）几乎完全一致，这种功能、名字都高度相似的情况，似乎与 Go 注重特性垂直独立的设计哲学相违背，删除老旧的 File.Readdir 固然能够让用户更加直观的理解应该使用哪个 API，但实际上这与当年的 C 面临的是同样的困境，即为了兼容性保障，任何破坏性的改动都是不可取的。他们最终都得到了保留。

idea想法 2020-11-01 00:00:00

Nesting Issues with t.Cleanupt.Cleanup 的嵌套问题

Back in Go 1.14, the testing package introduced a t.Cleanup method that allows registering multiple callback functions in test code, which are executed in reverse order of registration after the test completes. Looking at its implementation, can you register another Cleanup callback nested inside a Cleanup callback? As of Go 1.15, you cannot.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


package testing

// T is a type passed to Test functions to manage test state.
type T struct {
	mu          sync.RWMutex
	cleanup     func() // optional function to be called at the end of the test
	...
}

// Cleanup registers a function to be called when the test and all its
// subtests complete. Cleanup functions will be called in last added,
// first called order.
func (t *T) Cleanup(f func()) {
	t.mu.Lock()
	defer t.mu.Unlock()
	oldCleanup := t.cleanup
	t.cleanup = func() {
		if oldCleanup != nil {
			defer func() {
				...
				oldCleanup()
			}()
		}
		...
		f()
	}
	...
}

// runCleanup is called at the end of the test
func (t *T) runCleanup(ph panicHandling) (panicVal interface{}) {
	t.mu.Lock()
	cleanup := t.cleanup
	t.cleanup = nil
	t.mu.Unlock()
	if cleanup == nil {
		return nil
	}
	...

	cleanup()
	return nil
}

早在 Go 1.14 中，testing 包就引入过一个 t.Cleanup 的方法，允许在测试代码中注册多个回调函数，并以注册顺序的逆序在测试结束后被执行。从其实现来看，你能在一个 Cleanup 里注册的回调中，嵌套注册另一个 Cleanup 吗？现在（1.15）还不能。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


package testing

// T is a type passed to Test functions to manage test state.
type T struct {
	mu          sync.RWMutex
	cleanup     func() // optional function to be called at the end of the test
	...
}

// Cleanup registers a function to be called when the test and all its
// subtests complete. Cleanup functions will be called in last added,
// first called order.
func (t *T) Cleanup(f func()) {
	t.mu.Lock()
	defer t.mu.Unlock()
	oldCleanup := t.cleanup
	t.cleanup = func() {
		if oldCleanup != nil {
			defer func() {
				...
				oldCleanup()
			}()
		}
		...
		f()
	}
	...
}

// runCleanup is called at the end of the test
func (t *T) runCleanup(ph panicHandling) (panicVal interface{}) {
	t.mu.Lock()
	cleanup := t.cleanup
	t.cleanup = nil
	t.mu.Unlock()
	if cleanup == nil {
		return nil
	}
	...

	cleanup()
	return nil
}

idea想法 2020-10-31 00:00:00

A First Look at io/fs初窥 io/fs

In the upcoming Go 1.16, we will be able to embed resource files directly into compiled binaries. How is it implemented? What is the representation of embedded files? Starting from a broader abstraction, we need an in-memory file system. This further inspires us to think about file system abstraction: what are the minimum requirements for a file system? What operations must a file carry? All the answers to these questions are condensed here.

io/fs.FS:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


package fs

// An FS provides access to a hierarchical file system.
// The FS interface is the minimum implementation required of the file system.
type FS interface {
    // Open opens the named file.
    Open(name string) (File, error)
}

// A File provides access to a single file.
// The File interface is the minimum implementation required of the file.
type File interface {
    Stat() (FileInfo, error)
    Read([]byte) (int, error)
    Close() error
}

embed.FS:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45


package embed

// An FS is a read-only collection of files, usually initialized with a
// //go:embed directive.
//
// FS implements fs.FS, so it can be used with any package that understands
// file system interfaces, including net/http, text/template, and html/template.
type FS struct {
    // The files list is sorted by name but not by simple string comparison.
    files *[]file
}

// Open opens the named file for reading and returns it as an fs.File.
func (f FS) Open(name string) (fs.File, error) {
    file := f.lookup(name) // returns the named file, or nil if it is not present.
    if file == nil || file.IsDir() {
        ...
    }
    return &openFile{file, 0}, nil
}

// An openFile is a regular file open for reading.
type openFile struct {
    f      *file // the file itself
    offset int64 // current read offset
}

func (f *openFile) Close() error               { return nil }
func (f *openFile) Stat() (fs.FileInfo, error) { return f.f, nil }
func (f *openFile) Read(b []byte) (int, error) {
    if f.offset >= int64(len(f.f.data)) {
        return 0, io.EOF
    }
    ...
    n := copy(b, f.f.data[f.offset:])
    f.offset += int64(n)
    return n, nil
}

// A file is a single file in the FS.
type file struct {
    name string
    data string
    hash [16]byte
}

在即将到来的 Go 1.16 中，我们将允许将资源文件直接嵌入到编译后的二进制文件中。它是怎么实现的？嵌入后的文件表示是什么？从更广泛的问题抽象出发，我们需要一个 in-memory 的文件系统。于是这又进一步启发我们对文件系统抽象的思考，文件系统的所需最低要求是什么？文件系统承载的文件又必须要求哪些操作？所有这些问题的答案都浓缩在了这里。

io/fs.FS:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


package fs

// An FS provides access to a hierarchical file system.
// The FS interface is the minimum implementation required of the file system.
type FS interface {
    // Open opens the named file.
    Open(name string) (File, error)
}

// A File provides access to a single file.
// The File interface is the minimum implementation required of the file.
type File interface {
    Stat() (FileInfo, error)
    Read([]byte) (int, error)
    Close() error
}

embed.FS:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45


package embed

// An FS is a read-only collection of files, usually initialized with a
// //go:embed directive.
//
// FS implements fs.FS, so it can be used with any package that understands
// file system interfaces, including net/http, text/template, and html/template.
type FS struct {
    // The files list is sorted by name but not by simple string comparison.
    files *[]file
}

// Open opens the named file for reading and returns it as an fs.File.
func (f FS) Open(name string) (fs.File, error) {
    file := f.lookup(name) // returns the named file, or nil if it is not present.
    if file == nil || file.IsDir() {
        ...
    }
    return &openFile{file, 0}, nil
}

// An openFile is a regular file open for reading.
type openFile struct {
    f      *file // the file itself
    offset int64 // current read offset
}

func (f *openFile) Close() error               { return nil }
func (f *openFile) Stat() (fs.FileInfo, error) { return f.f, nil }
func (f *openFile) Read(b []byte) (int, error) {
    if f.offset >= int64(len(f.f.data)) {
        return 0, io.EOF
    }
    ...
    n := copy(b, f.f.data[f.offset:])
    f.offset += int64(n)
    return n, nil
}

// A file is a single file in the FS.
type file struct {
    name string
    data string
    hash [16]byte
}

idea想法 2020-10-30 00:00:00

Getting Goroutine ID获取 Goroutine ID

Possibly the fastest implementation for getting a goroutine ID across all Go versions with Go 1 compatibility guarantee.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


// Get returns the ID of current goroutine.
//
// This implementation based on the facts that
// runtime.Stack gives information like:
//
//   goroutine 18446744073709551615 [running]:
//   github.com/changkun/goid.Get...
//
// This format stands for more than 10 years.
// Since commit 4dfd7fdde5957e4f3ba1a0285333f7c807c28f03,
// a goroutine id ends with a white space.
//
// Go 1 compatability promise garantees all
// versions of Go can use this function.
func Get() (id uint64) {
	var buf [30]byte
	runtime.Stack(buf[:], false)
	for i := 10; buf[i] != ' '; i++ {
		id = id*10 + uint64(buf[i]&15)
	}
	return id
}

可能是具有 Go 1 兼容性保障的全版本获取 gorountine ID 的最快的实现

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


// Get returns the ID of current goroutine.
//
// This implementation based on the facts that
// runtime.Stack gives information like:
//
//   goroutine 18446744073709551615 [running]:
//   github.com/changkun/goid.Get...
//
// This format stands for more than 10 years.
// Since commit 4dfd7fdde5957e4f3ba1a0285333f7c807c28f03,
// a goroutine id ends with a white space.
//
// Go 1 compatability promise garantees all
// versions of Go can use this function.
func Get() (id uint64) {
	var buf [30]byte
	runtime.Stack(buf[:], false)
	for i := 10; buf[i] != ' '; i++ {
		id = id*10 + uint64(buf[i]&15)
	}
	return id
}

idea想法 2020-10-19 00:00:00

Benchmark Testing: The Extras基准测试的番外

Many people have written benchmark tests. In Go Nightly Reading Episode 83, Reliable Performance Testing for Go Programs, we shared how to use tools like benchstat and perflock for rigorous and reliable performance testing. That session briefly discussed the measurement methodology and implementation principles of benchmarks, but due to time constraints, the coverage wasn’t deep enough. So today, let’s further share two details that weren’t covered in Episode 83, but are easily overlooked in certain strict testing scenarios:

When running benchmarks, the code under test is executed more times than b.N. As discussed previously, the testing package runs the code multiple times, gradually predicting how many times the code can be executed consecutively within the required time range (e.g., 1 second, resulting in, say, 100,000 iterations). But there’s an implementation detail: why doesn’t it incrementally accumulate execution times across multiple runs such that t1+t2+…+tn ≈ 1s, and instead searches for the maximum b.N where the total loop time ≈ 1s? The reason is that incremental runs introduce more systematic measurement error. Benchmarks are typically unstable in early iterations (e.g., cache misses), and accumulating results from multiple incremental runs would further amplify this error. In contrast, finding the maximum b.N where the total consecutive execution time satisfies the required range amortizes (rather than accumulates) this systematic error across each test.
Does this mean the testing package’s implementation is perfect, and all we need to do as users is write benchmarks, run under perflock, and use benchstat to eliminate statistical errors? Things aren’t that simple, because the testing package’s measurement program itself also has systematic error, which in extreme scenarios can introduce significant bias. Explaining this requires more space, so here’s an additional article for further reading: Eliminating A Source of Measurement Errors in Benchmarks. In this article, you can learn more about what this intrinsic systematic measurement error is, and several reliable approaches to eliminate it when you need to benchmark such scenarios.

很多人都编写过 Benchmark 测试程序，在 Go 夜读第 83 期对 Go 程序进行可靠的性能测试 (https://talkgo.org/t/topic/102) 分享中也跟大家分享过如何利用 benchstat, perflock 等工具进行严谨可靠的性能测试。在那个分享中也曾简单的讨论过基准测试程序的测量方法及其实现原理，但由于内容较多时间有限对性能基准测试的原理还不够深入。因此，今天跟大家进一步分享两个未在第 83 期覆盖，但在进行某些严格测试时较容易被忽略的细节问题:

进行基准测试时，被测量的代码片段会的执行次数通常大于 b.N 次。在此前的分享中我们谈到，testing 包会通过多次运行被测代码片段，逐步预测在要求的时间范围内（例如 1 秒）能够连续执行被测代码的次数（例如 100000 次）。但这里有一个实现上的细节问题: 为什么不是逐步多次的累积执行被测代码的执行时间，使得t1+t2+…+tn ≈ 1s，而是通过多次运行被测代码寻找最大的 b.N 使得 b.N 次循环的总时间 ≈ 1s？原因是逐步运行基准测试会产生更多的测量系统误差。基准测试在执行的初期通常很不稳定（例如，cache miss），将多个增量运行的结果进行累积会进一步放大这种误差。相反，通过寻找最大的 b.N 使得循环的总时间尽可能的满足要求范围的连续执行能够很好的在每个测试上均摊（而非累积）这一系统误差。
那么是不是可以说 testing 包中的实现方式就非常完美，作为用户的我们只需写出基准测试、在 perflock 下运行、使用 benchstat 消除统计误差后我们不需要做任何额外的操心了呢？事情也并没有这么简单，因为 testing 包的测量程序本身也存在系统误差，在极端场景下这种误差会对测量程序的结果产生相当大的偏差。但要讲清楚这个问题就需要更多额外的篇幅了，所以这里再额外分享了一篇文章 Eliminating A Source of Measurement Errors in Benchmarks（https://github.com/golang-design/research/blob/master/bench-time.md），以供你进一步阅读。在这篇文章里你可以进一步了解这种测量程序内在的系统测量误差是什么，以及当你需要对这种场景进行基准测试时，几种消除这类误差源的可靠应对方案。

idea想法 2020-10-01 00:00:00

Hello你好

Hello world!

你好世界！