Why for-range behave differently depending on the size of the element

原文地址

https://labs.yulrizka.com/en/why-for-range-behave-differently-depending-on-the-size-of-the-element/

package main

import "testing"

const size = 1000000

type SomeStruct struct {
    ID0 int64
    ID1 int64
    ID2 int64
    ID3 int64
    ID4 int64
    ID5 int64
    ID6 int64
    ID7 int64
    ID8 int64
}

func BenchmarkForVar(b *testing.B) {
    slice := make([]SomeStruct, size)
    b.ReportAllocs()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        for _, s := range slice { // index and value
            _ = s
        }
    }
}
func BenchmarkForCounter(b *testing.B) {
    slice := make([]SomeStruct, size)
    b.ReportAllocs()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        for i := range slice { // only use the index
            s := slice[i]
            _ = s
        }
    }
}

基准测试结果

$ go test -bench .
goos: linux
goarch: amd64
BenchmarkForVar-4               4363        269711 ns/op           0 B/op           0 allocs/op
BenchmarkForCounter-4           4195        285952 ns/op           0 B/op           0 allocs/op
PASS
ok      _/test1    2.685s

并没有太大差异，但是当我们稍微改一下SomeStruct结构

type SomeStruct struct {
    ID0 int64
    ID1 int64
    ID2 int64
    ID3 int64
    ID4 int64
    ID5 int64
    ID6 int64
    ID7 int64
    ID8 int64
    ID9 int64  
}

再进行基准测试

$ go test -bench .
goos: linux
goarch: amd64
BenchmarkForVar-4                282       4264872 ns/op           0 B/op           0 allocs/op
BenchmarkForCounter-4           4363        269761 ns/op           0 B/op           0 allocs/op
PASS
ok      _/test1    3.255s

为什么？问题大概是出在range上，看下汇编。

为了容易看汇编，我们搞一个main.go

package main

func main() {
    const size = 1000000

    slice := make([]SomeStruct, size)
    for _, s := range slice {
        _ = s
    }
}

go tool compile -S main.go type.go | grep -v FUNCDATA | grep -v PCDATA

第一个版本的SomeStruct

"".main STEXT size=93 args=0x0 locals=0x28
    ...
    0x0024 00036 (main_var.go:6)    MOVQ    AX, (SP)
    0x0028 00040 (main_var.go:6)    MOVQ    $1000000, 8(SP)
    0x0031 00049 (main_var.go:6)    MOVQ    $1000000, 16(SP)
    0x003a 00058 (main_var.go:6)    CALL    runtime.makeslice(SB)
    0x003f 00063 (main_var.go:6)    XORL    AX, AX       # set AX = 0
    0x0041 00065 (main_var.go:7)    INCQ    AX           # AX++
    0x0044 00068 (main_var.go:7)    CMPQ    AX, $1000000 # AX < 1000000
    0x004a 00074 (main_var.go:7)    JLT    65               # LOOP
    ...

第二个版本

0x0000 00000 (main_var.go:3)    TEXT    "".main(SB), ABIInternal, $120-0
    ...
    0x0044 00068 (main_var.go:6)    XORL    CX, CX # CX = 0
    0x0046 00070 (main_var.go:7)    JMP    76
    0x0048 00072 (main_var.go:7)    ADDQ    $80, AX
    0x004c 00076 (main_var.go:7)    PCDATA    $0, $2 # setup temporary variable autotmp_7
    0x004c 00076 (main_var.go:7)    LEAQ    ""..autotmp_7+32(SP), DI
    0x0051 00081 (main_var.go:7)    PCDATA    $0, $3
    0x0051 00081 (main_var.go:7)    MOVQ    AX, SI
    0x0054 00084 (main_var.go:7)    PCDATA    $0, $1
    0x0054 00084 (main_var.go:7)    DUFFCOPY    $826 # copy content of the struct
    0x0067 00103 (main_var.go:7)    INCQ    CX
    0x006a 00106 (main_var.go:7)    CMPQ    CX, $1000000
    0x0071 00113 (main_var.go:7)    JLT    72
    ...

明显看到多了一个DUFFCOPY，其实压根都不用思考就知道是这边的问题了。

当然也可以看下SSA

GOSSAFUNC=main go tool compile -S main_var.go type_small.go

对比两个ssa

你说咋优化呢，附上俩

    for i := 0; i < b.N; i++ {
        for i := range slice { // only use the index
            s := slice[i]
            _ = s
        }
    }

        for i := 0; i < len(slice); i++ {
            s := slice[i]
            _ = s
        }

一般情况我们不会单独测，都是性能测试，比如pprof trace这类

When profiling my app, and run top, I see

Showing top 10 nodes out of 31 (cum >= 0.12s)
      flat  flat%   sum%        cum   cum%
    13.93s 63.00% 63.00%     13.93s 63.00%  runtime.duffcopy







end

一个没有高级趣味的人。 email：hushui502@gmail.com