Ideas

Sharing and recording scattered thoughts and writings.

在这里分享并记录一些零散的想法及写作。

10 / 76 ideas

2020-11-26 00:00:00 -/-

A Summary of Error Handling Proposals错误提案的总结

To see how tedious error handling can be, just look at this very thorough summary:

https://seankhliao.com/blog/12020-11-23-go-error-handling-proposals/

错误处理有多无聊看看这个非常相近的总结就知道了

https://seankhliao.com/blog/12020-11-23-go-error-handling-proposals/

2020-11-15 00:00:00 -/-

Data Races and the Memory Model数据竞争和内存模型

Does this code have a data race?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


// from https://golang.org/issue/42598
type module struct {
	v int
}

func foo() {
	mods := []*module{
		&module{v: 0},
		&module{v: 1},
		&module{v: 2},
	}
	type token struct{}
	sem := make(chan token, runtime.GOMAXPROCS(0))
	for _, m := range mods {
		add := func(m *module) {
			sem <- token{}
			go func() {
				*m = module{v: 42} // write
				<-sem
			}()
		}
		add(m) // read
	}
	// Fill semaphore channel to wait for all tasks to finish.
	for n := cap(sem); n > 0; n-- {
		sem <- token{}
	}
}

An issue submitted yesterday appears to point out a bug in the current Go memory model. Further reading:

这段代码有 data race 吗？

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


// from https://golang.org/issue/42598
type module struct {
	v int
}

func foo() {
	mods := []*module{
		&module{v: 0},
		&module{v: 1},
		&module{v: 2},
	}
	type token struct{}
	sem := make(chan token, runtime.GOMAXPROCS(0))
	for _, m := range mods {
		add := func(m *module) {
			sem <- token{}
			go func() {
				*m = module{v: 42} // write
				<-sem
			}()
		}
		add(m) // read
	}
	// Fill semaphore channel to wait for all tasks to finish.
	for n := cap(sem); n > 0; n-- {
		sem <- token{}
	}
}

昨天提交的一个 issue 似乎指出了目前 Go 内存模型中的一个错误。进一步阅读:

2020-11-14 00:00:00 -/-

Further Thoughts on Go Error Handling关于 Go 错误处理的进一步看法

Continued: In my view, the reason many people are dissatisfied with error handling is a lack of patience to understand how Go approaches problem-solving. An important lesson Jonathan drew is that errors are inherently domain-specific — some domains focus on better tracing of error origins, though stack traces themselves are sometimes not that useful; some domains focus on more flexible aggregation of multiple error messages, but many people just want to get the happy path right and throw a unified error at the end, and so on. In his Q&A session, he also mentioned that he does not recommend using xerrors, etc. It should be (not so) obvious that only solutions tailored to the specific problem are the best ones. Developers should take the time to think carefully about how to design error handling for a particular problem. Complaining about the lack of try/catch at the syntax level, or crying about how ugly if err is everywhere, is just as meaningless and life-wasting as debating which brackets to use for generics.

续: 很多人不满错误处理的原因在我看来是没有耐心去理解 Go 里处理问题的方式，Jonathan 总结得到的一个重要教训就是错误本身就是领域特定的，有些领域关注如何更好的追踪错误来源，但堆栈信息本身有时候也不那么有用；有些领域关注如何更加灵活的对多个错误信息进行整合，但很多人可能只想把正常逻辑给写对了然后统一扔一个错误出去等等，后续他的QA中还提到不建议使用xerrors等。（不那么）显然，只有针对问题本身给出的方案才是最好的，开发者应该静下心来思考怎么对某个具体问题设计错误处理，吐槽什么语法层面有没有 try/catch 、 if err 满天飞丑到哭泣就跟讨论泛型用什么括号一样没有意义且浪费生命。

2020-11-13 00:00:00 -/-

The Regret of Go 1.13 Error Values ProposalGo 1.13 错误值提案的遗憾

At today’s GopherCon 2020, the author of the Go 1.13 error values proposal mentioned in hindsight that he regrets the lack of error formatting support, and that there will be no further improvement plans for many years to come. One of the reasons he gave is that error handling is a domain-specific problem, and it is simply beyond his ability to produce a solution that satisfies everyone. Nevertheless, at the end of his talk, he offered some advice on error wrapping — namely, implementing fmt.Formatter. Below is a simple example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56


type DetailError struct {
	msg, detail string
	err         error
}

func (e *DetailError) Unwrap() error { return e.err }

func (e *DetailError) Error() string {
	if e.err == nil {
		return e.msg
	}
	return e.msg + ": " + e.err.Error()
}

func (e *DetailError) Format(s fmt.State, c rune) {
	if s.Flag('#') && c == 'v' {
		type nomethod DetailError
		fmt.Fprintf(s, "%#v", (*nomethod)(e))
		return
	}
	if !s.Flag('+') || c != 'v' {
		fmt.Fprintf(s, spec(s, c), e.Error())
		return
	}
	fmt.Fprintln(s, e.msg)
	if e.detail != "" {
		io.WriteString(s, "\t")
		fmt.Fprintln(s, e.detail)
	}
	if e.err != nil {
		if ferr, ok := e.err.(fmt.Formatter); ok {
			ferr.Format(s, c)
		} else {
			fmt.Fprintf(s, spec(s, c), e.err)
			io.WriteString(s, "\n")
		}
	}
}

func spec(s fmt.State, c rune) string {
	buf := []byte{'%'}
	for _, f := range []int{'+', '-', '#', ' ', '0'} {
		if s.Flag(f) {
			buf = append(buf, byte(f))
		}
	}
	if w, ok := s.Width(); ok {
		buf = strconv.AppendInt(buf, int64(w), 10)
	}
	if p, ok := s.Precision(); ok {
		buf = append(buf, '.')
		buf = strconv.AppendInt(buf, int64(p), 10)
	}
	buf = append(buf, byte(c))
	return string(buf)
}

今天的 GopherCon2020 上，Go 1.13 错误值提案的作者事后提及他对目前错误格式化的缺失表示遗憾，而且在未来很长的好几年内都不会有任何进一步改进计划。对此他本人给出的原因之一是对于错误处理这一领域特定的问题，在他的能力范围内实在是无法给出一个令所有人都满意的方案。尽管如此，在他演讲的最后，还是给出了一些关于错误嵌套的建议，即实现 fmt.Formatter，下面给出了一个简单的例子。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56


type DetailError struct {
	msg, detail string
	err         error
}

func (e *DetailError) Unwrap() error { return e.err }

func (e *DetailError) Error() string {
	if e.err == nil {
		return e.msg
	}
	return e.msg + ": " + e.err.Error()
}

func (e *DetailError) Format(s fmt.State, c rune) {
	if s.Flag('#') && c == 'v' {
		type nomethod DetailError
		fmt.Fprintf(s, "%#v", (*nomethod)(e))
		return
	}
	if !s.Flag('+') || c != 'v' {
		fmt.Fprintf(s, spec(s, c), e.Error())
		return
	}
	fmt.Fprintln(s, e.msg)
	if e.detail != "" {
		io.WriteString(s, "\t")
		fmt.Fprintln(s, e.detail)
	}
	if e.err != nil {
		if ferr, ok := e.err.(fmt.Formatter); ok {
			ferr.Format(s, c)
		} else {
			fmt.Fprintf(s, spec(s, c), e.err)
			io.WriteString(s, "\n")
		}
	}
}

func spec(s fmt.State, c rune) string {
	buf := []byte{'%'}
	for _, f := range []int{'+', '-', '#', ' ', '0'} {
		if s.Flag(f) {
			buf = append(buf, byte(f))
		}
	}
	if w, ok := s.Width(); ok {
		buf = strconv.AppendInt(buf, int64(w), 10)
	}
	if p, ok := s.Precision(); ok {
		buf = append(buf, '.')
		buf = strconv.AppendInt(buf, int64(p), 10)
	}
	buf = append(buf, byte(c))
	return string(buf)
}

2020-11-09 00:00:00 -/-

Getting CPU Clock Frequency on macOSmacOS 下获取时钟频率

How to get CPU clock frequency on macOS.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67


/*
#cgo LDFLAGS:
#include <stdlib.h>
#include <limits.h>
#include <sys/sysctl.h>
#include <sys/mount.h>
#include <mach/mach_init.h>
#include <mach/mach_host.h>
#include <mach/host_info.h>
#if TARGET_OS_MAC
#include <libproc.h>
#endif
#include <mach/processor_info.h>
#include <mach/vm_map.h>
*/
import "C"

func getcpu() error {
	var (
		count   C.mach_msg_type_number_t
		cpuload *C.processor_cpu_load_info_data_t
		ncpu    C.natural_t
	)

	status := C.host_processor_info(C.host_t(C.mach_host_self()),
		C.PROCESSOR_CPU_LOAD_INFO,
		&ncpu,
		(*C.processor_info_array_t)(unsafe.Pointer(&cpuload)),
		&count)

	if status != C.KERN_SUCCESS {
		return fmt.Errorf("host_processor_info error=%d", status)
	}

	// jump through some cgo casting hoops and ensure we properly free
	// the memory that cpuload points to
	target := C.vm_map_t(C.mach_task_self_)
	address := C.vm_address_t(uintptr(unsafe.Pointer(cpuload)))
	defer C.vm_deallocate(target, address, C.vm_size_t(ncpu))

	// the body of struct processor_cpu_load_info
	// aka processor_cpu_load_info_data_t
	var cpuTicks [C.CPU_STATE_MAX]uint32

	// copy the cpuload array to a []byte buffer
	// where we can binary.Read the data
	size := int(ncpu) * binary.Size(cpuTicks)
	buf := (*[1 << 30]byte)(unsafe.Pointer(cpuload))[:size:size]

	bbuf := bytes.NewBuffer(buf)

	for i := 0; i < int(ncpu); i++ {
		err := binary.Read(bbuf, binary.LittleEndian, &cpuTicks)
		if err != nil {
			return err
		}
		for k, v := range map[string]int{
			"user":   C.CPU_STATE_USER,
			"system": C.CPU_STATE_SYSTEM,
			"nice":   C.CPU_STATE_NICE,
			"idle":   C.CPU_STATE_IDLE,
		} {
			... // do something with float64(cpuTicks[v])/ClocksPerSec
		}
	}
	return nil
}

macOS 下获取 CPU 时钟频率的方法

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67


/*
#cgo LDFLAGS:
#include <stdlib.h>
#include <limits.h>
#include <sys/sysctl.h>
#include <sys/mount.h>
#include <mach/mach_init.h>
#include <mach/mach_host.h>
#include <mach/host_info.h>
#if TARGET_OS_MAC
#include <libproc.h>
#endif
#include <mach/processor_info.h>
#include <mach/vm_map.h>
*/
import "C"

func getcpu() error {
	var (
		count   C.mach_msg_type_number_t
		cpuload *C.processor_cpu_load_info_data_t
		ncpu    C.natural_t
	)

	status := C.host_processor_info(C.host_t(C.mach_host_self()),
		C.PROCESSOR_CPU_LOAD_INFO,
		&ncpu,
		(*C.processor_info_array_t)(unsafe.Pointer(&cpuload)),
		&count)

	if status != C.KERN_SUCCESS {
		return fmt.Errorf("host_processor_info error=%d", status)
	}

	// jump through some cgo casting hoops and ensure we properly free
	// the memory that cpuload points to
	target := C.vm_map_t(C.mach_task_self_)
	address := C.vm_address_t(uintptr(unsafe.Pointer(cpuload)))
	defer C.vm_deallocate(target, address, C.vm_size_t(ncpu))

	// the body of struct processor_cpu_load_info
	// aka processor_cpu_load_info_data_t
	var cpuTicks [C.CPU_STATE_MAX]uint32

	// copy the cpuload array to a []byte buffer
	// where we can binary.Read the data
	size := int(ncpu) * binary.Size(cpuTicks)
	buf := (*[1 << 30]byte)(unsafe.Pointer(cpuload))[:size:size]

	bbuf := bytes.NewBuffer(buf)

	for i := 0; i < int(ncpu); i++ {
		err := binary.Read(bbuf, binary.LittleEndian, &cpuTicks)
		if err != nil {
			return err
		}
		for k, v := range map[string]int{
			"user":   C.CPU_STATE_USER,
			"system": C.CPU_STATE_SYSTEM,
			"nice":   C.CPU_STATE_NICE,
			"idle":   C.CPU_STATE_IDLE,
		} {
			... // do something with float64(cpuTicks[v])/ClocksPerSec
		}
	}
	return nil
}

2020-11-08 00:00:00 -/-

Detach a Context分离一个 Context

How do you construct a context that retains all values from the parent context but does not participate in the cancellation propagation chain?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


// Detach returns a context that keeps all the values of its parent context
// but detaches from the cancellation and error handling.
func Detach(ctx context.Context) context.Context { return detachedContext{ctx} }

type detachedContext struct{ parent context.Context }

func (v detachedContext) Deadline() (time.Time, bool)       { return time.Time{}, false }
func (v detachedContext) Done() <-chan struct{}             { return nil }
func (v detachedContext) Err() error                        { return nil }
func (v detachedContext) Value(key interface{}) interface{} { return v.parent.Value(key) }

func TestDetach(t *testing.T) {
	type ctxKey string
	var key = ctxKey("key")

	ctx, cancel := context.WithTimeout(context.Background(), 200*time.Millisecond)
	defer cancel()
	ctx = context.WithValue(ctx, key, "value")
	dctx := Detach(ctx)
	// Detached context has the same values.
	got, ok := dctx.Value(key).(string)
	if !ok || got != "value" {
		t.Errorf("Value: got (%v, %t), want 'value', true", got, ok)
	}
	// Detached context doesn't time out.
	time.Sleep(500 * time.Millisecond)
	if err := ctx.Err(); err != context.DeadlineExceeded {
		t.Fatalf("original context Err: got %v, want DeadlineExceeded", err)
	}
	if err := dctx.Err(); err != nil {
		t.Errorf("detached context Err: got %v, want nil", err)
	}
}

如何构造一个保留所有 parent context 所有值但不参与取消传播链条的 context？

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


// Detach returns a context that keeps all the values of its parent context
// but detaches from the cancellation and error handling.
func Detach(ctx context.Context) context.Context { return detachedContext{ctx} }

type detachedContext struct{ parent context.Context }

func (v detachedContext) Deadline() (time.Time, bool)       { return time.Time{}, false }
func (v detachedContext) Done() <-chan struct{}             { return nil }
func (v detachedContext) Err() error                        { return nil }
func (v detachedContext) Value(key interface{}) interface{} { return v.parent.Value(key) }

func TestDetach(t *testing.T) {
	type ctxKey string
	var key = ctxKey("key")

	ctx, cancel := context.WithTimeout(context.Background(), 200*time.Millisecond)
	defer cancel()
	ctx = context.WithValue(ctx, key, "value")
	dctx := Detach(ctx)
	// Detached context has the same values.
	got, ok := dctx.Value(key).(string)
	if !ok || got != "value" {
		t.Errorf("Value: got (%v, %t), want 'value', true", got, ok)
	}
	// Detached context doesn't time out.
	time.Sleep(500 * time.Millisecond)
	if err := ctx.Err(); err != context.DeadlineExceeded {
		t.Fatalf("original context Err: got %v, want DeadlineExceeded", err)
	}
	if err := dctx.Err(); err != nil {
		t.Errorf("detached context Err: got %v, want nil", err)
	}
}

2020-11-07 00:00:00 -/-

Tool: bench工具 bench

Wrote a new tool called bench, which integrates and wraps best practices for benchmark testing.

Usage: https://golang.design/s/bench

新写了一个叫做 bench 的工具，主要对进行基准测试中的实践进行了整合与封装。

用法参见: https://golang.design/s/bench

2020-11-05 00:00:00 -/-

Trading Space for Time空间换时间

Is there any way to make these two functions run faster?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


// linear2sRGB is a sRGB encoder
func linear2sRGB(v float64) float64 {
	if v <= 0.0031308 {
		v *= 12.92
	} else {
		v = 1.055*math.Pow(v, 1/2.4) - 0.055
	}
	return v
}

// sRGB2linear is a sRGB decoder
func sRGB2linear(v float64) float64 {
	if v <= 0.04045 {
		v /= 12.92
	} else {
		v = math.Pow((v+0.055)/1.055, 2.4)
	}
	return v
}

Here’s a straightforward optimization: lookup table + linear interpolation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


// Linear2sRGB converts linear inputs to sRGB space.
func Linear2sRGB(v float64) float64 {
	i := v * lutSize
	ifloor := int(i) & (lutSize - 1)
	v0 := lin2sRGBLUT[ifloor]
	v1 := lin2sRGBLUT[ifloor+1]
	i -= float64(ifloor)
	return v0*(1.0-i) + v1*i
}

func linear2sRGB(v float64) float64 {
	if v <= 0.0031308 {
		v *= 12.92
	} else {
		v = 1.055*math.Pow(v, 1/2.4) - 0.055
	}
	return v
}

const lutSize = 1024 // keep a power of 2

var lin2sRGBLUT [lutSize + 1]float64

func init() {
	for i := range lin2sRGBLUT[:lutSize] {
		lin2sRGBLUT[i] = linear2sRGB(float64(i) / lutSize)
	}
	lin2sRGBLUT[lutSize] = lin2sRGBLUT[lutSize-1]
}

func BenchmarkLinear2sRGB(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for j := 0.0; j <= 1.0; j += 0.01 {
			convert.Linear2sRGB(j)
		}
	}
}

The benchmark shows approximately 98% runtime performance improvement after optimization.

1
2


name           old time/op  new time/op  delta
Linear2sRGB-6  6.38µs ± 0%  0.14µs ± 0%  -97.87%  (p=0.000 n=10+8)

有什么办法能够让这两个函数跑得更快吗？

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


// linear2sRGB is a sRGB encoder
func linear2sRGB(v float64) float64 {
	if v <= 0.0031308 {
		v *= 12.92
	} else {
		v = 1.055*math.Pow(v, 1/2.4) - 0.055
	}
	return v
}

// sRGB2linear is a sRGB decoder
func sRGB2linear(v float64) float64 {
	if v <= 0.04045 {
		v /= 12.92
	} else {
		v = math.Pow((v+0.055)/1.055, 2.4)
	}
	return v
}

这里介绍一个很平凡的优化方案: lookup table + 线性插值:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


// Linear2sRGB converts linear inputs to sRGB space.
func Linear2sRGB(v float64) float64 {
	i := v * lutSize
	ifloor := int(i) & (lutSize - 1)
	v0 := lin2sRGBLUT[ifloor]
	v1 := lin2sRGBLUT[ifloor+1]
	i -= float64(ifloor)
	return v0*(1.0-i) + v1*i
}

func linear2sRGB(v float64) float64 {
	if v <= 0.0031308 {
		v *= 12.92
	} else {
		v = 1.055*math.Pow(v, 1/2.4) - 0.055
	}
	return v
}

const lutSize = 1024 // keep a power of 2

var lin2sRGBLUT [lutSize + 1]float64

func init() {
	for i := range lin2sRGBLUT[:lutSize] {
		lin2sRGBLUT[i] = linear2sRGB(float64(i) / lutSize)
	}
	lin2sRGBLUT[lutSize] = lin2sRGBLUT[lutSize-1]
}

func BenchmarkLinear2sRGB(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for j := 0.0; j <= 1.0; j += 0.01 {
			convert.Linear2sRGB(j)
		}
	}
}

基准测试显示，优化后的运行时性能提升约为 98%。

1
2


name           old time/op  new time/op  delta
Linear2sRGB-6  6.38µs ± 0%  0.14µs ± 0%  -97.87%  (p=0.000 n=10+8)

2020-11-04 00:00:00 -/-

Pass by Value vs. Pass by Pointer传值与传指针

Guess which add implementation has better performance, vec1 or vec2?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


type vec struct {
	x, y, z, w float64
}

func (v vec) addv(u vec) vec {
	return vec{v.x + u.x, v.y + u.y, v.z + u.z, v.w + u.w}
}

func (v *vec) addp(u *vec) *vec {
	v.x, v.y, v.z, v.w = v.x+u.x, v.y+u.y, v.z+u.z, v.w+u.w
	return v
}

func BenchmarkVec(b *testing.B) {
	b.Run("addv", func(b *testing.B) {
		v1 := vec{1, 2, 3, 4}
		v2 := vec{4, 5, 6, 7}
		b.ReportAllocs()
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v1 = v1.addv(v2)
			} else {
				v2 = v2.addv(v1)
			}
		}
	})
	b.Run("addp", func(b *testing.B) {
		v1 := &vec{1, 2, 3, 4}
		v2 := &vec{4, 5, 6, 7}
		b.ReportAllocs()
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v1 = v1.addp(v2)
			} else {
				v2 = v2.addp(v1)
			}
		}
	})
}

The answer is pass-by-value is faster. The reason is inlining optimization, not escape analysis as many might guess. The pointer implementation returns a pointer solely to support method chaining — the returned pointer is already on the stack, so there’s no escape. Test results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


$ perflock -governor 80% go test -v -run=none -bench=. -count=10 | tee new.txt
$ benchstat new.txt

name         time/op
Vec/addv-16  0.25ns ± 2%
Vec/addp-16  2.20ns ± 0%

name         alloc/op
Vec/addv-16   0.00B
Vec/addp-16   0.00B

name         allocs/op
Vec/addv-16    0.00
Vec/addp-16    0.00

A practical example: changing from pass-by-pointer to pass-by-value brought a 6–8% performance improvement in a simple rasterizer (see https://github.com/changkun/ddd/commit/60fba104c574f54e11ffaedba7eaa91c8401bce4).

Furthermore, we might ask: is pass-by-value still faster without inlining? We can try adding the //go:noinline compiler directive to both add methods. The results without inlining (old) compared with inlining (new) are:

1
2
3
4
5


$ perflock -governor 80% go test -v -run=none -bench=. -count=10 | tee old.txt
$ benchstat old.txt new.txt
name         old time/op    new time/op    delta
Vec/addv-16    4.99ns ± 1%    0.25ns ± 2%  -95.05%  (p=0.000 n=9+10)
Vec/addp-16    3.35ns ± 1%    2.20ns ± 0%  -34.37%  (p=0.000 n=10+8)

So the next question is: without inlining, why is the pointer version faster? Read more at https://changkun.de/blog/posts/pointers-might-not-be-ideal-for-parameters/

猜猜 vec1 和 vec2 实现的 add 哪个性能更好？

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


type vec struct {
	x, y, z, w float64
}

func (v vec) addv(u vec) vec {
	return vec{v.x + u.x, v.y + u.y, v.z + u.z, v.w + u.w}
}

func (v *vec) addp(u *vec) *vec {
	v.x, v.y, v.z, v.w = v.x+u.x, v.y+u.y, v.z+u.z, v.w+u.w
	return v
}

func BenchmarkVec(b *testing.B) {
	b.Run("addv", func(b *testing.B) {
		v1 := vec{1, 2, 3, 4}
		v2 := vec{4, 5, 6, 7}
		b.ReportAllocs()
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v1 = v1.addv(v2)
			} else {
				v2 = v2.addv(v1)
			}
		}
	})
	b.Run("addp", func(b *testing.B) {
		v1 := &vec{1, 2, 3, 4}
		v2 := &vec{4, 5, 6, 7}
		b.ReportAllocs()
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v1 = v1.addp(v2)
			} else {
				v2 = v2.addp(v1)
			}
		}
	})
}

答案是传值更快。原因是内联优化，而非很多人猜测的逃逸。原因是指针实现的方式虽然返回了指针，但却只是为了能够支持链式调用而设计的，返回的指针本身就已经在栈上，不存在逃逸一说。测试结果:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


$ perflock -governor 80% go test -v -run=none -bench=. -count=10 | tee new.txt
$ benchstat new.txt

name         time/op
Vec/addv-16  0.25ns ± 2%
Vec/addp-16  2.20ns ± 0%

name         alloc/op
Vec/addv-16   0.00B
Vec/addp-16   0.00B

name         allocs/op
Vec/addv-16    0.00
Vec/addp-16    0.00

一个实际的例子是，将传指针改为传值方式在一个简单的光栅器中带来了 6-8% 的性能提升（见 https://github.com/changkun/ddd/commit/60fba104c574f54e11ffaedba7eaa91c8401bce4）。

除此之外，我们可能会问，如果没有内联的话，还是传值更快么？我们可以试着给两个加法方法增加 //go:noinline 编译标记，最终的结果（old）跟有内联的结果（new）对比如下所示:

1
2
3
4
5


$ perflock -governor 80% go test -v -run=none -bench=. -count=10 | tee old.txt
$ benchstat old.txt new.txt
name         old time/op    new time/op    delta
Vec/addv-16    4.99ns ± 1%    0.25ns ± 2%  -95.05%  (p=0.000 n=9+10)
Vec/addp-16    3.35ns ± 1%    2.20ns ± 0%  -34.37%  (p=0.000 n=10+8)

那么问题又来了，在没有内联的情况下，为什么指针更快呢？请阅读 https://changkun.de/blog/posts/pointers-might-not-be-ideal-for-parameters/

2020-11-03 00:00:00 -/-

A Timer OptimizationTimer 的一枚优化

In Go 1.14, time.Timer was optimized from a global heap to per-P heaps, and during task switching in the scheduling loop, each P was solely responsible for checking and running timers that could be woken up. However, in that implementation, the work-stealing process didn’t check timer heaps on Ps that were currently executing (bound to an M) — meaning if a P found itself with nothing to do, even if timers on other Ps needed to be woken up, the idle P would still go to sleep. Fortunately, this was fixed in 1.15. But was that the end of it?

Unfortunately, the per-P heap approach fundamentally still relies on async preemption to force-switch goroutines that monopolize an M, so that timers can always be scheduled within a bounded time. But what is the upper bound? In other words, how high is the wake-up latency of time.Timer?

Obviously, the current async preemption implementation relies on the system monitor (sysmon), whose wake-up period is on the order of 10–20 milliseconds, meaning in the worst case, services with strict real-time requirements (such as live streaming) would be seriously affected.

In the upcoming 1.16, a new fix reduces this tens-of-milliseconds latency directly to the microsecond level — very exciting. The benchmark below shows how to systematically quantify timer latency using average and worst-case latency metrics, along with a comparison of the improved timer latency against results from 1.14 and 1.15.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67


// Benchmark timer latency when the thread that creates the timer is busy with
// other work and the timers must be serviced by other threads.
// https://golang.org/issue/38860
//
//                                        go14.time.bench  go15.time.bench  fix.time.bench
// ParallelTimerLatency-8 \ avg-late-ns      17.3M ± 3%        7.9M ± 0%       0.2M ± 3%
// ParallelTimerLatency-8 \ max-late-ns      18.3M ± 1%        8.2M ± 0%       0.5M ±12%
func BenchmarkParallelTimerLatency(b *testing.B) {
	// allocate memory now to avoid GC interference later.
	timerCount := runtime.GOMAXPROCS(0) - 1
	stats := make([]struct {
		sum   float64
		max   time.Duration
		count int64
		_     [5]int64 // cache line padding
	}, timerCount)
	... // environment guarantees are omitted here

	b.ResetTimer()

	const delay = time.Millisecond
	var wg sync.WaitGroup
	var count int32
	for i := 0; i < b.N; i++ {
		wg.Add(timerCount)
		atomic.StoreInt32(&count, 0)
		for j := 0; j < timerCount; j++ {
			j := j
			expectedWakeup := time.Now().Add(delay)
			time.AfterFunc(delay, func() {
				late := time.Since(expectedWakeup) // actual wakeup time
				stats[j].count++
				stats[j].sum += float64(late.Nanoseconds())
				if late > stats[j].max {
					stats[j].max = late
				}
				atomic.AddInt32(&count, 1)
				for atomic.LoadInt32(&count) < int32(timerCount) { // wait other timers
				}
				wg.Done()
			})
		}

		// spin until all timers fired
		for atomic.LoadInt32(&count) < int32(timerCount) {
		}
		wg.Wait()

		// do work: spin a bit to let other threads go idle before the next round
		now := time.Now()
		for time.Since(now) < time.Millisecond {
		}
	}
	var total float64
	var samples float64
	max := time.Duration(0)
	for _, s := range stats {
		if s.max > max {
			max = s.max
		}
		total += s.sum
		samples += float64(s.count)
	}
	b.ReportMetric(0, "ns/op")
	b.ReportMetric(total/samples, "avg-late-ns")
	b.ReportMetric(float64(max.Nanoseconds()), "max-late-ns")
}

Go 1.14 中，time.Timer 曾从全局堆优化到了 per-P 堆，并在调度循环进行任务切换时，独自负责检查并运行可被唤醒的 timer。但在当时的实现中，偷取过程并没有检查那些位于正在执行（与 M 绑定）的 P 上的 timer 堆，即如果某个 P 发现自己无事可做，即便其他 P 上的 timer 需要被唤醒，这个无事可做的 P 也会进一步休眠；好在该问题在 1.15 得到了解决。但这就万事大吉了吗？

可惜的是，per-P 堆方法的本质仍然上是在依赖异步抢占来强制切换那些长期霸占 M 的 G，进而 timer 总能在有界的时间内被调度。但这个界的上限是多少？换句话说，time.Timer 的唤醒延迟到底有多高？

显然，现在异步抢占的实现依赖系统监控，而系统监控的唤醒周期是 10 至 20 毫秒级的，这也就意味着在最坏情况下，将对一些对实时性要求极高的服务（如实时流媒体）会产生严重的干扰。

在即将到来的 1.16 中，一项新的修复将这种数十毫秒级的延迟直接干到了微秒级，非常的 exciting。下面的基准测试展示了如何系统的通过平均延迟以及最坏延迟两个指标对 timer 的延迟进行量化，并附上了进一步改进后的 timer 延迟与 1.14, 1.15 中结果的对比。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67


// Benchmark timer latency when the thread that creates the timer is busy with
// other work and the timers must be serviced by other threads.
// https://golang.org/issue/38860
//
//                                        go14.time.bench  go15.time.bench  fix.time.bench
// ParallelTimerLatency-8 \ avg-late-ns      17.3M ± 3%        7.9M ± 0%       0.2M ± 3%
// ParallelTimerLatency-8 \ max-late-ns      18.3M ± 1%        8.2M ± 0%       0.5M ±12%
func BenchmarkParallelTimerLatency(b *testing.B) {
	// allocate memory now to avoid GC interference later.
	timerCount := runtime.GOMAXPROCS(0) - 1
	stats := make([]struct {
		sum   float64
		max   time.Duration
		count int64
		_     [5]int64 // cache line padding
	}, timerCount)
	... // environment guarantees are omitted here

	b.ResetTimer()

	const delay = time.Millisecond
	var wg sync.WaitGroup
	var count int32
	for i := 0; i < b.N; i++ {
		wg.Add(timerCount)
		atomic.StoreInt32(&count, 0)
		for j := 0; j < timerCount; j++ {
			j := j
			expectedWakeup := time.Now().Add(delay)
			time.AfterFunc(delay, func() {
				late := time.Since(expectedWakeup) // actual wakeup time
				stats[j].count++
				stats[j].sum += float64(late.Nanoseconds())
				if late > stats[j].max {
					stats[j].max = late
				}
				atomic.AddInt32(&count, 1)
				for atomic.LoadInt32(&count) < int32(timerCount) { // wait other timers
				}
				wg.Done()
			})
		}

		// spin until all timers fired
		for atomic.LoadInt32(&count) < int32(timerCount) {
		}
		wg.Wait()

		// do work: spin a bit to let other threads go idle before the next round
		now := time.Now()
		for time.Since(now) < time.Millisecond {
		}
	}
	var total float64
	var samples float64
	max := time.Duration(0)
	for _, s := range stats {
		if s.max > max {
			max = s.max
		}
		total += s.sum
		samples += float64(s.count)
	}
	b.ReportMetric(0, "ns/op")
	b.ReportMetric(total/samples, "avg-late-ns")
	b.ReportMetric(float64(max.Nanoseconds()), "max-late-ns")
}

New Idea新想法