如何在CPU上优化GEMM(上)

如何在CPU上优化GEMM(上)

(TL;DR)TVM提供了抽象接口,用户分别描述算法和算法的实现组织(所谓的调度)。通常,在高性能调度中编写算法会破坏算法的可读性和模块性。尝试各种看似有希望的时间表是很耗时的。在TVM的帮助下,可以有效地尝试这些调度来提高性能。             

本文将演示如何使用TVM优化平方矩阵乘法,并通过简单地添加18行额外的代码来实现比baseline基线快200倍的速度。

在CPU上执行的高强度计算应用程序有两个重要的优化:             

提高内存访问的缓存命中率。高速缓存命中率可以加速复杂的数值计算和热点内存访问。这需要我们将源内存访问模式转换为适合缓存策略的模式。             

SIMD(单指令多数据)或称之为向量处理单元。每次都会处理一小批数据,而不是单个网格。这就要求将循环体中的数据访问模式转换为统一模式,以便LLVM后端能够将其降低到SIMD。             

实际上,使用的所有方法都是本文所述技巧的子集。其中一些已经被TVM抽象自动应用,但有些由于TVM的约束而不能简单地应用。             

下面提到的所有实验结果,都是在配备Intel i7-4770HQ CPU的2015款15寸MacBook上执行的。对于所有x86 CPU,缓存线大小应为64字节。

Preparation and Baseline

本文将演示如何使用TVM优化矩阵乘法。在实际演示之前,首先定义这些变量。然后编写了一个基线实现,这是在TVM中编写矩阵乘法的最简单方法。

import tvm

import tvm.testing

from tvm import te

import numpy

import timeit

 

# The size of the matrix

# (M, K) x (K, N)

# You are free to try out different shapes, sometimes TVM optimization outperforms numpy with MKL.

M = 1024

K = 1024

N = 1024

 

# The default tensor type in tvm

dtype = "float32"

 

# using Intel AVX2(Advanced Vector Extensions) ISA for SIMD

# To get the best performance, please change the following line

# to llvm -mcpu=core-avx2, or specific type of CPU you use

target = "llvm"

ctx = tvm.context(target, 0)

 

# Random generated tensor for testing

a = tvm.nd.array(numpy.random.rand(M, K).astype(dtype), ctx)

b = tvm.nd.array(numpy.random.rand(K, N).astype(dtype), ctx)

 

np_repeat = 100

np_runing_time = timeit.timeit(

    setup="import numpy "

    "M = " + str(M) + " "

    "K = " + str(K) + " "

    "N = " + str(N) + " "

    'dtype = "float32" '

    "a = numpy.random.rand(M, K).astype(dtype) "

    "b = numpy.random.rand(K, N).astype(dtype) ",

    stmt="answer = numpy.dot(a, b)",

    number=np_repeat,

)

print("Numpy running time: %f" % (np_runing_time / np_repeat))

 

answer = numpy.dot(a.asnumpy(), b.asnumpy())

 

# Algorithm

k = te.reduce_axis((0, K), "k")

A = te.placeholder((M, K), name="A")

B = te.placeholder((K, N), name="B")

C = te.compute((M, N), lambda x, y: te.sum(A[x, k] * B[k, y], axis=k), name="C")

 

# Default schedule

s = te.create_schedule(C.op)

func = tvm.build(s, [A, B, C], target=target, name="mmult")

assert func

 

c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), ctx)

func(a, b, c)

tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)

 

evaluator = func.time_evaluator(func.entry_name, ctx, number=1)

print("Baseline: %f" % evaluator(a, b, c).mean)

Out:

Numpy running time: 0.006963

Baseline: 3.516655

In TVM, we can always inspect lower level IR to debug or optimize our schedule. Here is the generated IR using our baseline schedule.

print(tvm.lower(s, [A, B, C], simple_mode=True))

Out:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()

  attr = {"global_symbol": "main", "tir.noalias": True}

  buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),

             B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),

             A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], [])}

  buffer_map = {A_1: A, B_1: B, C_1: C} {

  for (x: int32, 0, 1024) {

    for (y: int32, 0, 1024) {

      C_2[((x*1024) + y)] = 0f32

      for (k: int32, 0, 1024) {

        C_2[((x*1024) + y)] = ((float32*)C_2[((x*1024) + y)] + ((float32*)A_2[((x*1024) + k)]*(float32*)B_2[((k*1024) + y)]))

      }

    }

  }

}

Blocking

提高缓存命中率的一个重要技巧是分块——数据块将逐块计算。块内的内存访问是一个具有高内存局部性的小邻域。本文选择了32作为阻塞因子。因此,块将填充32*32*sizeof(float),即缓存中的4KB,其总大小为32KB(一级数据缓存)

bn = 32

s = te.create_schedule(C.op)

 

# Blocking by loop tiling

xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

(k,) = s[C].op.reduce_axis

ko, ki = s[C].split(k, factor=4)

 

# Hoist reduction domain outside the blocking loop

s[C].reorder(xo, yo, ko, ki, xi, yi)

 

func = tvm.build(s, [A, B, C], target=target, name="mmult")

assert func

 

c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), ctx)

func(a, b, c)

tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)

 

# By simply tiling the loop 32x32, and hoisting ko, ki outside the blocking loops,

# we can see big speedup compared with the baseline.

evaluator = func.time_evaluator(func.entry_name, ctx, number=10)

print("Opt1: %f" % evaluator(a, b, c).mean)

Out:

Opt1: 0.284967

Here is the generated IR after blocking.

print(tvm.lower(s, [A, B, C], simple_mode=True))

Out:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()

  attr = {"global_symbol": "main", "tir.noalias": True}

  buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),

             B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),

             A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], [])}

  buffer_map = {A_1: A, B_1: B, C_1: C} {

  for (x.outer: int32, 0, 32) {

    for (y.outer: int32, 0, 32) {

      for (x.inner.init: int32, 0, 32) {

        for (y.inner.init: int32, 0, 32) {

          C_2[((((x.outer*32768) + (x.inner.init*1024)) + (y.outer*32)) + y.inner.init)] = 0f32

        }

      }

      for (k.outer: int32, 0, 256) {

        for (k.inner: int32, 0, 4) {

          for (x.inner: int32, 0, 32) {

            for (y.inner: int32, 0, 32) {

              C_2[((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)) + y.inner)] = ((float32*)C_2[((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)) + y.inner)] + ((float32*)A_2[((((x.outer*32768) + (x.inner*1024)) + (k.outer*4)) + k.inner)]*(float32*)B_2[((((k.outer*4096) + (k.inner*1024)) + (y.outer*32)) + y.inner)]))

            }

          }

        }

      }

    }

  }

}

Vectorization

另一个重要的技巧是矢量化。当内存访问模式是一致的时,编译器可以检测到这种模式并将连续内存传递给向量处理器。在TVM中,可以使用向量化接口来提示编译器这个模式,这样就可以大大加速它。             

本文选择将内部循环行数据矢量化,因为它对缓存友好。

s = te.create_schedule(C.op)

xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

(k,) = s[C].op.reduce_axis

ko, ki = s[C].split(k, factor=4)

 

s[C].reorder(xo, yo, ko, ki, xi, yi)

 

# Vectorization

s[C].vectorize(yi)

 

func = tvm.build(s, [A, B, C], target=target, name="mmult")

assert func

 

c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), ctx)

func(a, b, c)

tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)

 

evaluator = func.time_evaluator(func.entry_name, ctx, number=10)

print("Opt2: %f" % evaluator(a, b, c).mean)

Out:

Opt2: 0.321595

Here is the generated IR after vectorization.

print(tvm.lower(s, [A, B, C], simple_mode=True))

Out:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()

  attr = {"global_symbol": "main", "tir.noalias": True}

  buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),

             B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),

             A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], [])}

  buffer_map = {A_1: A, B_1: B, C_1: C} {

  for (x.outer: int32, 0, 32) {

    for (y.outer: int32, 0, 32) {

      for (x.inner.init: int32, 0, 32) {

        C_2[ramp((((x.outer*32768) + (x.inner.init*1024)) + (y.outer*32)), 1, 32)] = broadcast(0f32, 32)

      }

      for (k.outer: int32, 0, 256) {

        for (k.inner: int32, 0, 4) {

          for (x.inner: int32, 0, 32) {

            C_2[ramp((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)), 1, 32)] = ((float32x32*)C_2[ramp((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)), 1, 32)] + (broadcast((float32*)A_2[((((x.outer*32768) + (x.inner*1024)) + (k.outer*4)) + k.inner)], 32)*(float32x32*)B_2[ramp((((k.outer*4096) + (k.inner*1024)) + (y.outer*32)), 1, 32)]))

          }

        }

      }

    }

  }

}

Loop Permutation

上面的IR,可以看到内循环行数据被矢量化,B被转换成PackedB。PackedB的遍历现在是连续的。因此,将研究A的访问模式。在当前调度中,A被逐列访问,这对缓存不友好。如果改变了KI和内轴席的嵌套循环顺序,则矩阵的访问模式更容易缓存。

s = te.create_schedule(C.op)

xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

(k,) = s[C].op.reduce_axis

ko, ki = s[C].split(k, factor=4)

 

# re-ordering

s[C].reorder(xo, yo, ko, xi, ki, yi)

s[C].vectorize(yi)

 

func = tvm.build(s, [A, B, C], target=target, name="mmult")

assert func

 

c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), ctx)

func(a, b, c)

tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)

 

evaluator = func.time_evaluator(func.entry_name, ctx, number=10)

print("Opt3: %f" % evaluator(a, b, c).mean)

Out:

Opt3: 0.111657

Here is the generated IR after loop permutation.

print(tvm.lower(s, [A, B, C], simple_mode=True))

Out:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()

  attr = {"global_symbol": "main", "tir.noalias": True}

  buffers = {B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),

             C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),

             A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], [])}

  buffer_map = {A_1: A, B_1: B, C_1: C} {

  for (x.outer: int32, 0, 32) {

    for (y.outer: int32, 0, 32) {

      for (x.inner.init: int32, 0, 32) {

        C_2[ramp((((x.outer*32768) + (x.inner.init*1024)) + (y.outer*32)), 1, 32)] = broadcast(0f32, 32)

      }

      for (k.outer: int32, 0, 256) {

        for (x.inner: int32, 0, 32) {

          for (k.inner: int32, 0, 4) {

            C_2[ramp((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)), 1, 32)] = ((float32x32*)C_2[ramp((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)), 1, 32)] + (broadcast((float32*)A_2[((((x.outer*32768) + (x.inner*1024)) + (k.outer*4)) + k.inner)], 32)*(float32x32*)B_2[ramp((((k.outer*4096) + (k.inner*1024)) + (y.outer*32)), 1, 32)]))

          }

        }

      }

    }

  }

}

 

原文地址:https://www.cnblogs.com/wujianming-110117/p/14108195.html