Changkun's Blog

Science and art, life in between.


  • Home

  • Ideas

  • Archives

  • Tags

  • Bio

Pointers Might Not Be Ideal for Parameters

Published at: 2020-11-05   |   Reading: 1814 words ~9min   |   PV/UV: /

We are aware that using pointers for passing parameters can avoid data copy, which will benefit the performance. Nevertheless, there are always some edge cases we might need concern.

Let’s take this as an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// vec.go
type vec struct {
	x, y, z, w float64
}

func (v vec) addv(u vec) vec {
	return vec{v.x + u.x, v.y + u.y, v.z + u.z, v.w + u.w}
}

func (v *vec) addp(u *vec) *vec {
	v.x, v.y, v.z, v.w = v.x+u.x, v.y+u.y, v.z+u.z, v.w+u.w
	return v
}

Which vector addition runs faster?

Intuitively, we might consider that vec.addp is faster than vec.addv because its parameter u uses pointer form. There should be no copies of the data, whereas vec.addv involves data copy both when passing and returning.

However, if we do a micro-benchmark:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
func BenchmarkVec(b *testing.B) {
	b.Run("addv", func(b *testing.B) {
		v1 := vec{1, 2, 3, 4}
		v2 := vec{4, 5, 6, 7}
		b.ReportAllocs()
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v1 = v1.addv(v2)
			} else {
				v2 = v2.addv(v1)
			}
		}
	})
	b.Run("addp", func(b *testing.B) {
		v1 := &vec{1, 2, 3, 4}
		v2 := &vec{4, 5, 6, 7}
		b.ReportAllocs()
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v1 = v1.addp(v2)
			} else {
				v2 = v2.addp(v1)
			}
		}
	})
}

And run as follows:

1
2
$ perflock -governor 80% go test -v -run=none -bench=. -count=10 | tee new.txt
$ benchstat new.txt

The benchstat will give you the following result:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
name         time/op
Vec/addv-16  0.25ns ± 2%
Vec/addp-16  2.20ns ± 0%

name         alloc/op
Vec/addv-16   0.00B     
Vec/addp-16   0.00B     

name         allocs/op
Vec/addv-16    0.00     
Vec/addp-16    0.00

How is this happening?

Inlining Optimization

This is all because of compiler optimization, and mostly because of inlining.

If we disable inline from the addv and addp:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
//go:noinline
func (v vec) addv(u vec) vec {
	return vec{v.x + u.x, v.y + u.y, v.z + u.z, v.w + u.w}
}

//go:noinline
func (v *vec) addp(u *vec) *vec {
	v.x, v.y, v.z, v.w = v.x+u.x, v.y+u.y, v.z+u.z, v.w+u.w
	return v
}

Then run the benchmark and compare the perf with the previous one:

1
2
3
4
5
$ perflock -governor 80% go test -v -run=none -bench=. -count=10 | tee old.txt
$ benchstat old.txt new.txt
name         old time/op    new time/op    delta
Vec/addv-16    4.99ns ± 1%    0.25ns ± 2%  -95.05%  (p=0.000 n=9+10)
Vec/addp-16    3.35ns ± 1%    2.20ns ± 0%  -34.37%  (p=0.000 n=10+8)

The inline optimization transforms the vec.addv:

1
2
3
v1 := vec{1, 2, 3, 4}
v2 := vec{4, 5, 6, 7}
v1 = v1.addv(v2)

to a direct assign statement:

1
2
3
v1 := vec{1, 2, 3, 4}
v2 := vec{4, 5, 6, 7}
v1 = vec{1+4, 2+5, 3+6, 4+7}

And for the vec.addp’s case:

1
2
3
v1 := &vec{1, 2, 3, 4}
v2 := &vec{4, 5, 6, 7}
v1 = v1.addp(v2)

to a direct manipulation:

1
2
3
v1 := vec{1, 2, 3, 4}
v2 := vec{4, 5, 6, 7}
v1.x, v1.y, v1.z, v1.w = v1.x+v2.x, v1.y+v2.y, v1.z+v2.z, v1.w+v2.w 

Addressing Modes

If we check the compiled assembly, the reason reveals quickly:

1
$ mkdir asm && go tool compile -S vec.go > asm/vec.s

The dumped assumbly code is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
"".vec.addv STEXT nosplit size=89 args=0x60 locals=0x0 funcid=0x0
	0x0000 00000 (vec.go:7)	TEXT	"".vec.addv(SB), NOSPLIT|ABIInternal, $0-96
	0x0000 00000 (vec.go:7)	FUNCDATA	$0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0000 00000 (vec.go:7)	FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0000 00000 (vec.go:8)	MOVSD	"".u+40(SP), X0
	0x0006 00006 (vec.go:8)	MOVSD	"".v+8(SP), X1
	0x000c 00012 (vec.go:8)	ADDSD	X1, X0
	0x0010 00016 (vec.go:8)	MOVSD	X0, "".~r1+72(SP)
	0x0016 00022 (vec.go:8)	MOVSD	"".u+48(SP), X0
	0x001c 00028 (vec.go:8)	MOVSD	"".v+16(SP), X1
	0x0022 00034 (vec.go:8)	ADDSD	X1, X0
	0x0026 00038 (vec.go:8)	MOVSD	X0, "".~r1+80(SP)
	0x002c 00044 (vec.go:8)	MOVSD	"".u+56(SP), X0
	0x0032 00050 (vec.go:8)	MOVSD	"".v+24(SP), X1
	0x0038 00056 (vec.go:8)	ADDSD	X1, X0
	0x003c 00060 (vec.go:8)	MOVSD	X0, "".~r1+88(SP)
	0x0042 00066 (vec.go:8)	MOVSD	"".u+64(SP), X0
	0x0048 00072 (vec.go:8)	MOVSD	"".v+32(SP), X1
	0x004e 00078 (vec.go:8)	ADDSD	X1, X0
	0x0052 00082 (vec.go:8)	MOVSD	X0, "".~r1+96(SP)
	0x0058 00088 (vec.go:8)	RET
"".(*vec).addp STEXT nosplit size=73 args=0x18 locals=0x0 funcid=0x0
	0x0000 00000 (vec.go:11)	TEXT	"".(*vec).addp(SB), NOSPLIT|ABIInternal, $0-24
	0x0000 00000 (vec.go:11)	FUNCDATA	$0, gclocals·522734ad228da40e2256ba19cf2bc72c(SB)
	0x0000 00000 (vec.go:11)	FUNCDATA	$1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
	0x0000 00000 (vec.go:12)	MOVQ	"".u+16(SP), AX
	0x0005 00005 (vec.go:12)	MOVSD	(AX), X0
	0x0009 00009 (vec.go:12)	MOVQ	"".v+8(SP), CX
	0x000e 00014 (vec.go:12)	ADDSD	(CX), X0
	0x0012 00018 (vec.go:12)	MOVSD	8(AX), X1
	0x0017 00023 (vec.go:12)	ADDSD	8(CX), X1
	0x001c 00028 (vec.go:12)	MOVSD	16(CX), X2
	0x0021 00033 (vec.go:12)	ADDSD	16(AX), X2
	0x0026 00038 (vec.go:12)	MOVSD	24(AX), X3
	0x002b 00043 (vec.go:12)	ADDSD	24(CX), X3
	0x0030 00048 (vec.go:12)	MOVSD	X0, (CX)
	0x0034 00052 (vec.go:12)	MOVSD	X1, 8(CX)
	0x0039 00057 (vec.go:12)	MOVSD	X2, 16(CX)
	0x003e 00062 (vec.go:12)	MOVSD	X3, 24(CX)
	0x0043 00067 (vec.go:13)	MOVQ	CX, "".~r1+24(SP)
	0x0048 00072 (vec.go:13)	RET

The addv implementation uses values from the previous stack frame and writes the result directly to the return; whereas addp needs MOVQ that copies the parameter to different registers (e.g., copy pointers to AX and CX), then write back when returning. Therefore, with inline disabled, the reason that addv is slower than addp is caused by different memory access pattern.

Conclusion

Can pass by value always faster than pass by pointer? We could do a further test. But this time, we need use a generator to generate all possible cases. Here is how we could do it:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
// gen.go

// +build ignore

package main

import (
	"bytes"
	"fmt"
	"go/format"
	"io/ioutil"
	"strings"
	"text/template"
)

var (
	head = `// Code generated by go run gen.go; DO NOT EDIT.
package fields_test

import "testing"
`
	structTmpl = template.Must(template.New("ss").Parse(`
type {{.Name}} struct {
	{{.Properties}}
}

func (s {{.Name}}) addv(ss {{.Name}}) {{.Name}} {
	return {{.Name}}{
		{{.Addv}}
	}
}

func (s *{{.Name}}) addp(ss *{{.Name}}) *{{.Name}} {
	{{.Addp}}
	return s
}
`))
	benchHead = `func BenchmarkVec(b *testing.B) {`
	benchTail = `}`
	benchBody = template.Must(template.New("bench").Parse(`
	b.Run("addv-{{.Name}}", func(b *testing.B) {
		{{.InitV}}
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v1 = v1.addv(v2)
			} else {
				v2 = v2.addv(v1)
			}
		}
	})
	b.Run("addp-{{.Name}}", func(b *testing.B) {
		{{.InitP}}
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v1 = v1.addp(v2)
			} else {
				v2 = v2.addp(v1)
			}
		}
	})
`))
)

type structFields struct {
	Name       string
	Properties string
	Addv       string
	Addp       string
}
type benchFields struct {
	Name  string
	InitV string
	InitP string
}

func main() {
	w := new(bytes.Buffer)
	w.WriteString(head)

	N := 10

	for i := 0; i < N; i++ {
		var (
			ps   = []string{}
			adv  = []string{}
			adpl = []string{}
			adpr = []string{}
		)
		for j := 0; j <= i; j++ {
			ps = append(ps, fmt.Sprintf("x%d\tfloat64", j))
			adv = append(adv, fmt.Sprintf("s.x%d + ss.x%d,", j, j))
			adpl = append(adpl, fmt.Sprintf("s.x%d", j))
			adpr = append(adpr, fmt.Sprintf("s.x%d + ss.x%d", j, j))
		}
		err := structTmpl.Execute(w, structFields{
			Name:       fmt.Sprintf("s%d", i),
			Properties: strings.Join(ps, "\n"),
			Addv:       strings.Join(adv, "\n"),
			Addp:       strings.Join(adpl, ",") + " = " + strings.Join(adpr, ","),
		})
		if err != nil {
			panic(err)
		}
	}

	w.WriteString(benchHead)
	for i := 0; i < N; i++ {
		nums1, nums2 := []string{}, []string{}
		for j := 0; j <= i; j++ {
			nums1 = append(nums1, fmt.Sprintf("%d", j))
			nums2 = append(nums2, fmt.Sprintf("%d", j+i))
		}
		numstr1 := strings.Join(nums1, ", ")
		numstr2 := strings.Join(nums2, ", ")

		err := benchBody.Execute(w, benchFields{
			Name: fmt.Sprintf("s%d", i),
			InitV: fmt.Sprintf(`v1 := s%d{%s}
v2 := s%d{%s}`, i, numstr1, i, numstr2),
			InitP: fmt.Sprintf(`v1 := &s%d{%s}
			v2 := &s%d{%s}`, i, numstr1, i, numstr2),
		})
		if err != nil {
			panic(err)
		}
	}
	w.WriteString(benchTail)

	out, err := format.Source(w.Bytes())
	if err != nil {
		panic(err)
	}
	if err := ioutil.WriteFile("impl_test.go", out, 0660); err != nil {
		panic(err)
	}
}

If we generate our test code and perform the same benchmark procedure again:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ go generate
$ perflock -governor 80% go test -v -run=none -bench=. -count=10 | tee inline.txt
$ benchstat inline.txt
name            time/op
Vec/addv-s0-16  0.25ns ± 0%
Vec/addp-s0-16  2.20ns ± 0%
Vec/addv-s1-16  0.49ns ± 1%
Vec/addp-s1-16  2.20ns ± 0%
Vec/addv-s2-16  0.25ns ± 1%
Vec/addp-s2-16  2.20ns ± 0%
Vec/addv-s3-16  0.49ns ± 2%
Vec/addp-s3-16  2.21ns ± 1%
Vec/addv-s4-16  8.29ns ± 0%
Vec/addp-s4-16  2.37ns ± 1%
Vec/addv-s5-16  9.06ns ± 1%
Vec/addp-s5-16  2.74ns ± 1%
Vec/addv-s6-16   9.9ns ± 0%
Vec/addp-s6-16  3.17ns ± 0%
Vec/addv-s7-16  10.9ns ± 1%
Vec/addp-s7-16  3.27ns ± 1%
Vec/addv-s8-16  11.4ns ± 0%
Vec/addp-s8-16  3.29ns ± 0%
Vec/addv-s9-16  13.4ns ± 1%
Vec/addp-s9-16  3.37ns ± 0%

We could even further try a version that disables inline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
	structTmpl = template.Must(template.New("ss").Parse(`
type {{.Name}} struct {
	{{.Properties}}
}
+//go:noinline
func (s {{.Name}}) addv(ss {{.Name}}) {{.Name}} {
	return {{.Name}}{
		{{.Addv}}
	}
}
+//go:noinline
func (s *{{.Name}}) addp(ss *{{.Name}}) *{{.Name}} {
	{{.Addp}}
	return s
}
`))

Eventually, we will endup with the following results:

TLDR: The above figure basically demonstrates when should you pass-by-value or pass-by-pointer. If you are certain that your code won’t produce any escape variables, and the size of your argument is smaller than 4*4 = 16 bytes, then you should go for pass-by-value; otherwise, you should keep using pointers.

Further Reading Suggestions

  • Changkun Ou. Conduct Reliable Benchmarking in Go. March 26, 2020. https://golang.design/s/gobench
  • Dave Cheney. Mid-stack inlining in Go. May 2, 2020. https://dave.cheney.net/2020/05/02/mid-stack-inlining-in-go
  • Dave Cheney. Inlining optimisations in Go. April 25, 2020. https://dave.cheney.net/2020/04/25/inlining-optimisations-in-go
  • MOVSD. Move or Merge Scalar Double-Precision Floating-Point Value. Last access: 2020-10-27. https://www.felixcloutier.com/x86/movsd
  • ADDSD. Add Scalar Double-Precision Floating-Point Values. Last access: 2020-10-27. https://www.felixcloutier.com/x86/addsd
  • MOVEQ. Move Quadword. Last access: 2020-10-27. https://www.felixcloutier.com/x86/movq
#Go#
  • Author: Changkun Ou
  • Link: https://changkun.de/blog/posts/pointers-might-not-be-ideal-for-parameters/
  • License: All articles in this blog are licensed under CC BY-NC-ND 4.0 unless stating additionally.
The All in Go Stack
Eliminating A Source of Measurement Errors in Benchmarks
  • TOC
  • Overview
Changkun Ou

Changkun Ou

Stop Talking. Just Coding.

276 Blogs
165 Tags
Homepage GitHub Email YouTube Twitter Zhihu
Friends
    Frimin ZZZero march1993 qcrao maiyang Xargin Muniao
  • Inlining Optimization
  • Addressing Modes
  • Conclusion
  • Further Reading Suggestions
© 2008 - 2024 Changkun Ou. All rights reserved. | PV/UV: /
0%