TVM优化Deep Learning GPU算子

高效的深度学习算子是深度学习系统的核心。通常，这些算子很难优化，需要HPC专家付出巨大的努力。端到端张量IR / DSL堆栈TVM使这一过程变得更加容易。

如何在TVM的帮助下编写高性能GPU运算符内核。本文以深度卷积（即topi.nn.depthwise_conv2d_nchw）为例，并演示如何在tensorflow中改进已经手工优化的CUDA内核。在不同的工作负载下，最终版本比tf-1.2中优化的内核快2到4倍，在启用了算子融合的情况下，最终版本快3到7倍。以下是在GTX1080上测试的结果，filter size = [1, 256, 3, 3], stride = [1, 1], padding = ‘SAME’:

深度卷积简介

深度卷积是现代体系结构的重要组成部分，例如Xception和MobileNet。这是降低深度神经网络计算复杂度的有效方法。

在TVM中，深度卷积可以声明为：

# padding stage

PaddedInput = tvm.compute(

(batch, in_channel, height_after_pad, width_after_pad),

lambda b, c, i, j: tvm.select(

tvm.all(i >= pad_top, i - pad_top < in_height, j >= pad_left, j - pad_left < in_width),

Input[b, c, i - pad_top, j - pad_left], tvm.const(0.0)),

name="PaddedInput")

# depthconv stage

di = tvm.reduce_axis((0, filter_height), name='di')

dj = tvm.reduce_axis((0, filter_width), name='dj')

Output = tvm.compute(

(batch, out_channel, out_height, out_width),

lambda b, c, i, j: tvm.sum(

PaddedInput[b, c/channel_multiplier, i*stride_h + di, j*stride_w + dj] * Filter[c/channel_multiplier, c%channel_multiplier, di, dj],

axis=[di, dj]),

name='DepthwiseConv2d')

通用GPU优化准则

本部分简要讨论了优化CUDA代码时应了解的三个概念：数据重用，共享内存和存储冲突。

数据重用

在现代计算体系结构中，从内存中加载数据的成本要比进行单个浮点计算高得多。因此，始终希望在将输入数据加载到寄存器或共享内存（高速缓存）后重用。

深度卷积有两种形式的数据重用：filter过滤器重用和输入重用。当filter过滤器在输入通道上滑动并多次计算时，会发生filter过滤器重用。输入重用是通过平铺实现的，以3x3深度转换为例：

General GPU Optimization Guidelines

在不进行平铺的情况下，每个线程计算1个输出元素并加载3x3输入数据。16个线程加在一起有9x16的负载。

通过平铺，每个线程计算2x2输出元素并加载4x4输入数据。4个线程加在一起有16x4的负载。

共享内存和bank冲突

共享内存可以看作是GPU中的缓存。这是片上的，并且比全局存储快得多。

共享内存按block块分配。通常的做法是将数据从全局内存加载到共享内存中， block块中的所有线程都从共享内存中读取数据。

共享内存的大小是有限的（通常为48K），必须谨慎对待共享内存的溢出。此外，分配给一个block块的共享内存过多，限制了每个多处理器的活动块数。

共享内存的另一个性能问题是存储区冲突。共享内存分为大小相等的内存模块（可同时访问），但是，如果多个线程访问同一内存库（导致内存库冲突），则访问将被序列化，从而降低了有效带宽。

共享存储体的组织方式是将连续的地址分配给连续的存储体。为避免存储区冲突，最好连续的线程，访问连续的内存地址，如下图所示（每种颜色代表一个共享存储区）：

开始优化TVM中的深度卷积。

调度优化

计算PaddedInput内联以节省内存分配

从第1部分中可以看到，padding填充被显式声明为一个单独的阶段。内联计算以避免冗余的内存分配：

s = tvm.create_schedule(Output.op)

s[PaddedInput].compute_inline()

将一个大通道划分为较小的块

深度卷积的一个简单明了的调度表是，一个cuda块负责一个输入通道和相应的filter过滤器，加载到共享内存中，然后进行计算：

IS = s.cache_read(PaddedInput, "shared", [DepthwiseConv2d])

FS = s.cache_read(Filter, "shared", [DepthwiseConv2d])

block_y = tvm.thread_axis("blockIdx.y")

block_x = tvm.thread_axis("blockIdx.x")

# bind the dimension of batch (N in NCHW) with block_y

s[Output].bind(Output.op.axis[0], block_y)

# bind the dimension of channel (C in NCHW) with block_x

s[Output].bind(Output.op.axis[1], block_x)

Here is the result: 测试了在GTX 1080上运行1000次的平均时间成本，并与tensorflow中的depthwise_conv2d进行了比较。结果如下：

Input	Filter	stride	tf-1.2 SAME pad (us)	TVM SAME pad (us)
[1, 256, 21, 21]	[256, 1, 3, 3]	[1, 1]	16.1	9.1
[1, 256, 32, 32]	[256, 1, 3, 3]	[1, 1]	34.8	14.5
[1, 256, 64, 64]	[256, 1, 3, 3]	[1, 1]	130.9	98.9
[1, 256, 96, 96]	[256, 1, 3, 3]	[1, 1]	251.6	387.4

As we can see, this schedule performs well with small channel size like 21 x 21 or 32 x 32, however, its performance drops seriously as the channel size increases to larger than 64 x 64. One main reason is that too much shared memory allocated to one block limits the number of active blocks per multiprocessor.

此调度在较小的通道大小（例如21 x 21或32 x 32）下表现良好，但是，当通道大小增加到大于64 x 64时，其性能会严重下降。一个主要原因是分配的共享内存过多分配到一块，限制每个多处理器的活动块数。

修改了调度表，将一个大频道划分为多个较小的块。例如，一个通道（64 x 64或96 x 96）被分成32 x 32的块，而一个cuda块负责一个32 x 32的块：

blocking_h = 32

blocking_w = 32

# split the dimension of height (H in NCHW)

bx1, _ = s[Output].split(Output.op.axis[2], factor=blocking_h)

# split the dimension of width (W in NCHW)

bx2, _ = s[Output].split(Output.op.axis[3], factor=blocking_w)

# assign one 32 x 32 block to one cuda block

by = s[Output].fuse(Output.op.axis[0], Output.op.axis[1])

s[Output].bind(by, block_y)

bx = s[Output].fuse(bx1, bx2)

s[Output].bind(bx, block_x)

结果如下:

Input	[blocking_h, blocking_w]	tf-1.2 SAME pad (us)	TVM SAME pad (us)
[1, 256, 64, 64]	[32, 32]	130.9	63.4
[1, 256, 96, 96]	[32, 32]	251.6	132.5

封锁策略有效！对于64 x 64通道大小，带来1.6倍加速（98.9us-> 63.4us）；对于96 x 96通道大小，带来2.9倍加速（387.4us-> 132.5us）。

线程的调整参数

如何在一个cuda块的线程之间调度工作负载（例如32x32）？直观地，应该是这样的：

num_thread_y = 8

num_thread_x = 8

thread_y = tvm.thread_axis((0, num_thread_y), "threadIdx.y")

thread_x = tvm.thread_axis((0, num_thread_x), "threadIdx.x")

ty, yi = s[Output].split(h_dim, nparts=num_thread_y)

tx, xi = s[Output].split(w_dim, nparts=num_thread_x)

s[Output].reorder(ty, tx, yi, xi)

s[Output].bind(ty, thread_y)

s[Output].bind(tx, thread_x)

调度表中有两个参数：num_thread_y和num_thread_x。如何确定最佳组合？先做一些实验。以下是Filter = [256，1，3，3]和stride = [1，1]的结果：

Case	Input	num_thread_y	num_thread_x	TVM SAME pad (us)
1	[1, 256, 32, 32]	8	32	9.7
2	[1, 256, 32, 32]	4	32	8.8
3	[1, 256, 32, 32]	1	32	17.7
4	[1, 256, 32, 32]	32	1	32.5

从以上结果中可以得到：

情况2比情况1快。在情况2中，每个线程在输出中计算一个8x1的图块，对应于输入中的10x3的图块。比情况1的4x1 tile具有更好的数据重用性。
情况3比情况2慢。这是因为在情况3中，每个线程的工作量太大，导致读取本地内存的成本较高。
情况4比情况3慢。这是因为num_thread_x = 32确保没有bank冲突，而num_thread_y = 32没有。

总结一下：

大图块有利于数据重用，但不利于本地内存读取。
num_thread_y和num_thread_x对bank冲突的影响是不对称的。
为了找到num_thread_y和num_thread_x的最佳组合，实现高效共享存储器访问（避免组冲突），数据复用，本地存储器read的平衡。

如何才能找到最佳组合呢？答案是蛮力搜索。可以将num_thread_y和num_thread_x作为参数传递给schedule函数，并尝试所有可能的组合以找到最佳组合。这可以在TVM中轻松完成：

def schedule_depthwise_conv2d(..., num_thread_y=8, num_thread_x=8):

num_thread_y = num_thread_y

num_thread_x = num_thread_x

do_schedule_as_usual

return schedule

min_time_cost = inf

for num_thread_y, num_thread_x in all_possible_combinations:

schedule = schedule_depthwise_conv2d(..., num_thread_y=num_thread_y, num_thread_x=num_thread_x)

time_cost = test_depthwise_conv2d(..., schedule)

if time_cost < min_time_cost:

min_time_cost = time_cost

optimal_combination = [num_thread_y, num_thread_x]

实际上，可以看作是一个简单的自动调度程序。

Vthread和交叉模式

引入TVM中的Vthread（虚拟线程），支持跨步模式。可以这样使用：

num_vthread_y = 2

num_vthread_x = 2

num_thread_y = 8

num_thread_x = 8

thread_vy = tvm.thread_axis((0, num_vthread_y), "vthread", name="vy")

thread_vx = tvm.thread_axis((0, num_vthread_x), "vthread", name="vx")

thread_y = tvm.thread_axis((0, num_thread_y), "threadIdx.y")

thread_x = tvm.thread_axis((0, num_thread_x), "threadIdx.x")

# split the dimension of height (H in NCHW) twice

tvy, vyi = s[Output].split(h_dim, nparts=num_vthread_y)

ty, yi = s[Output].split(vyi, nparts=num_thread_y)

# split the dimension of width (W in NCHW) twice

tvx, vxi = s[Output].split(w_dim, nparts=num_vthread_x)

tx, xi = s[Output].split(vxi, nparts=num_thread_x)

# bind thread and vthread respectively

s[Output].bind(tvy, thread_vy)

s[Output].bind(tvx, thread_vx)

s[Output].bind(ty, thread_y)

s[Output].bind(tx, thread_x)

s[Output].reorder(tvy, tvx, ty, tx, yi, xi)

Let’s print the IR to see what vthread does:

/* Input = [1, 1, 32, 32], Filter = [1, 1, 3, 3], stride = [1, 1], padding = 'SAME' */

produce DepthwiseConv2d {

// attr [iter_var(blockIdx.y, , blockIdx.y)] thread_extent = 1

// attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1

// attr [iter_var(threadIdx.y, Range(min=0, extent=8), threadIdx.y)] thread_extent = 8

// attr [iter_var(threadIdx.x, Range(min=0, extent=8), threadIdx.x)] thread_extent = 8

for (i.inner.inner.inner, 0, 2) {

for (j.inner.inner.inner, 0, 2) {

DepthwiseConv2d[((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)*2) + (i.inner.inner.inner*32)) + j.inner.inner.inner)] = 0.000000f

DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)*2) + (i.inner.inner.inner*32)) + j.inner.inner.inner) + 512)] = 0.000000f

DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)*2) + (i.inner.inner.inner*32)) + j.inner.inner.inner) + 16)] = 0.000000f

DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)*2) + (i.inner.inner.inner*32)) + j.inner.inner.inner) + 528)] = 0.000000f

for (di, 0, 3) {

for (dj, 0, 3) {

DepthwiseConv2d[((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)*2) + (i.inner.inner.inner*32)) + j.inner.inner.inner)] = (DepthwiseConv2d[((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)*2) + (i.inner.inner.inner*32)) + j.inner.inner.inner)] + (tvm_if_then_else(((((((1 - di) - i.inner.inner.inner) <= (((blockIdx.x*16) + threadIdx.y)*2)) && ((((blockIdx.x*16) + threadIdx.y)*2) < ((33 - di) - i.inner.inner.inner))) && (((1 - dj) - j.inner.inner.inner) <= (threadIdx.x*2))) && ((threadIdx.x*2) < ((33 - dj) - j.inner.inner.inner))), Input[(((((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)*2) + (i.inner.inner.inner*32)) + j.inner.inner.inner) + (di*32)) + dj) + -33)], 0.000000f)*Filter[((di*3) + dj)]))

DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)*2) + (i.inner.inner.inner*32)) + j.inner.inner.inner) + 512)] = (DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)*2) + (i.inner.inner.inner*32)) + j.inner.inner.inner) + 512)] + (tvm_if_then_else(((((((-15 - di) - i.inner.inner.inner) <= (((blockIdx.x*16) + threadIdx.y)*2)) && ((((blockIdx.x*16) + threadIdx.y)*2) < ((17 - di) - i.inner.inner.inner))) && (((1 - dj) - j.inner.inner.inner) <= (threadIdx.x*2))) && ((threadIdx.x*2) < ((33 - dj) - j.inner.inner.inner))), Input[(((((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)*2) + (i.inner.inner.inner*32)) + j.inner.inner.inner) + (di*32)) + dj) + 479)], 0.000000f)*Filter[((di*3) + dj)]))

DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)*2) + (i.inner.inner.inner*32)) + j.inner.inner.inner) + 16)] = (DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)*2) + (i.inner.inner.inner*32)) + j.inner.inner.inner) + 16)] + (tvm_if_then_else(((((((1 - di) - i.inner.inner.inner) <= (((blockIdx.x*16) + threadIdx.y)*2)) && ((((blockIdx.x*16) + threadIdx.y)*2) < ((33 - di) - i.inner.inner.inner))) && (((-15 - dj) - j.inner.inner.inner) <= (threadIdx.x*2))) && ((threadIdx.x*2) < ((17 - dj) - j.inner.inner.inner))), Input[(((((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)*2) + (i.inner.inner.inner*32)) + j.inner.inner.inner) + (di*32)) + dj) + -17)], 0.000000f)*Filter[((di*3) + dj)]))

DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)*2) + (i.inner.inner.inner*32)) + j.inner.inner.inner) + 528)] = (DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)*2) + (i.inner.inner.inner*32)) + j.inner.inner.inner) + 528)] + (tvm_if_then_else(((((((-15 - di) - i.inner.inner.inner) <= (((blockIdx.x*16) + threadIdx.y)*2)) && ((((blockIdx.x*16) + threadIdx.y)*2) < ((17 - di) - i.inner.inner.inner))) && (((-15 - dj) - j.inner.inner.inner) <= (threadIdx.x*2))) && ((threadIdx.x*2) < ((17 - dj) - j.inner.inner.inner))), Input[(((((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)*2) + (i.inner.inner.inner*32)) + j.inner.inner.inner) + (di*32)) + dj) + 495)], 0.000000f)*Filter[((di*3) + dj)]))

}

Without vthread (just set to 1), the IR is:

/* Input = [1, 1, 32, 32], Filter = [1, 1, 3, 3], stride = [1, 1], padding = 'SAME' */

produce DepthwiseConv2d {

// attr [iter_var(blockIdx.y, , blockIdx.y)] thread_extent = 1

// attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1

// attr [iter_var(threadIdx.y, Range(min=0, extent=8), threadIdx.y)] thread_extent = 8

// attr [iter_var(threadIdx.x, Range(min=0, extent=8), threadIdx.x)] thread_extent = 8

for (i.inner.inner.inner, 0, 4) {

for (j.inner.inner.inner, 0, 4) {

DepthwiseConv2d[((((((((blockIdx.y + blockIdx.x)*8) + threadIdx.y)*32) + threadIdx.x)*4) + (i.inner.inner.inner*32)) + j.inner.inner.inner)] = 0.000000f

for (di, 0, 3) {

for (dj, 0, 3) {

DepthwiseConv2d[((((((((blockIdx.y + blockIdx.x)*8) + threadIdx.y)*32) + threadIdx.x)*4) + (i.inner.inner.inner*32)) + j.inner.inner.inner)] = (DepthwiseConv2d[((((((((blockIdx.y + blockIdx.x)*8) + threadIdx.y)*32) + threadIdx.x)*4) + (i.inner.inner.inner*32)) + j.inner.inner.inner)] + (tvm_if_then_else(((((((1 - di) - i.inner.inner.inner) <= (((blockIdx.x*8) + threadIdx.y)*4)) && ((((blockIdx.x*8) + threadIdx.y)*4) < ((33 - di) - i.inner.inner.inner))) && (((1 - dj) - j.inner.inner.inner) <= (threadIdx.x*4))) && ((threadIdx.x*4) < ((33 - dj) - j.inner.inner.inner))), Input[(((((((((((blockIdx.y + blockIdx.x)*8) + threadIdx.y)*32) + threadIdx.x)*4) + (i.inner.inner.inner*32)) + j.inner.inner.inner) + (di*32)) + dj) + -33)], 0.000000f)*Filter[((di*3) + dj)]))

}

可以看到，当num_vthread_y = 2和时num_vthread_x = 2，将32 x 32通道分为四个16 x 16子通道。每个线程一次计算四个输出元素，一个子通道中一个元素。

以下是Filter = [256，1，3，3]，stride = [1，1]，blocking_h = 32，blocking_w = 32的结果：

Case	Input	num_thread_y, num_thread_x	num_vthread_y, num_vthread_x	TVM SAME pad (us)
1	[1, 256, 96, 96]	8, 8	1, 1	132.5
2	[1, 256, 96, 96]	8, 8	1, 4	103.1
3	[1, 256, 96, 96]	4, 32	1, 1	95.9
4	[1, 256, 96, 96]	8, 16	1, 2	90.9

Case 2比Case 1快。在Case 2中，num_thread_x=8并且num_vthread_x=4一起确保连续的线程访问连续的内存地址，从而避免了存储区冲突，如下所示（每种颜色代表一个线程的工作量）：

从理论上讲，case 3和case 4应该同样很快，每个线程的工作量相同，并且都享有有效的共享内存访问。不管怎样，case 4快了一点。

还记得tensorflow的速度吗？现在是251.6us，现在TVM快了2.8倍。387.4-> 132.5-> 95.9-> 90.9，封锁最有帮助；调整线程数可节省37us；vthread可以节省额外的5us。

实际上，TVM可以比具有大内核大小或channel_multiplier的tensorflow快得多（因为更多的filter过滤器重用）：

Input	Filter	stride	tf-1.2 SAME pad (us)	TVM SAME pad (us)	How faster is TVM
[1, 256, 96, 96]	[256, 1, 3, 3]	[1, 1]	251.6	90.9	2.8x
[1, 256, 96, 96]	[256, 1, 5, 5]	[1, 1]	597.6	128.9	4.6x
[1, 256, 96, 96]	[256, 2, 3, 3]	[1, 1]	659.9	143.7	4.6x
[1, 256, 96, 96]	[256, 2, 5, 5]	[1, 1]	1203.9	170.5	7.1x

Consider a common pattern in neural networks: depthwise_conv2d + scale_shift + relu. We can fuse the three operators into one, by slightly modifying the original schedule:

算子融合

算子融合是可以在深度学习中进行的一种典型优化，可以在单个内核中一起计算多个算子，无需将中间结果保存回全局内存中。TVM对此提供了开箱即用的支持。

神经网络中的一个常见模式：depthwise_conv2d+ scale_shift+ relu。稍微修改原始调度表，可以将三个算子融合为一个：

DepthwiseConv2d = topi.nn.depthwise_conv2d(Input, Filter, stride, padding)

ScaleShift = topi.nn.scale_shift(DepthwiseConv2d, Scale, Shift)

Relu = topi.nn.relu(ScaleShift)

Output = Relu # is no longer DepthwiseConv2d

s[ScaleShift].compute_inline() # this line fuses ScaleShift, explicitly

s[DepthwiseConv2d].set_scope("local") # this line fuses DepthwiseConv2d, implicitly

schedule(Output) # schedule for Output the same way we schedule for DepthwiseConv2d as discussed above

s[DepthwiseConv2d].compute_at(s[Output], tx) # tx is the inner most axis, bound to threadIdx.x

生成IR，如下所示：

/* Input = [1, 1, 32, 32], Filter = [1, 1, 3, 3], stride = [1, 1], padding = 'SAME' */

produce Relu {

// attr [iter_var(blockIdx.y, , blockIdx.y)] thread_extent = 1

// attr [DepthwiseConv2d] storage_scope = "local"

allocate DepthwiseConv2d[float32 * 1 * 1 * 4 * 4]

// attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1

// attr [iter_var(threadIdx.y, Range(min=0, extent=8), threadIdx.y)] thread_extent = 8

// attr [iter_var(threadIdx.x, Range(min=0, extent=8), threadIdx.x)] thread_extent = 8

produce DepthwiseConv2d {

for (i, 0, 4) {

for (j, 0, 4) {

DepthwiseConv2d[((i*4) + j)] = 0.000000f

for (di, 0, 3) {

for (dj, 0, 3) {

DepthwiseConv2d[((i*4) + j)] = (DepthwiseConv2d[((i*4) + j)] + (tvm_if_then_else(((((((1 - di) - i) <= (((blockIdx.x*8) + threadIdx.y)*4)) && ((((blockIdx.x*8) + threadIdx.y)*4) < ((33 - di) - i))) && (((1 - dj) - j) <= (threadIdx.x*4))) && ((threadIdx.x*4) < ((33 - dj) - j))), Input[(((((((((((blockIdx.y + blockIdx.x)*8) + threadIdx.y)*32) + threadIdx.x)*4) + (i*32)) + j) + (di*32)) + dj) + -33)], 0.000000f)*Filter[((di*3) + dj)]))

}

for (i2.inner.inner.inner, 0, 4) {

for (i3.inner.inner.inner, 0, 4) {

Relu[((((((((blockIdx.y + blockIdx.x)*8) + threadIdx.y)*32) + threadIdx.x)*4) + (i2.inner.inner.inner*32)) + i3.inner.inner.inner)] = max(((DepthwiseConv2d[((i2.inner.inner.inner*4) + i3.inner.inner.inner)]*Scale[0]) + Shift[0]), 0.000000f)

}

写入depthwise_conv2d全局内存的结果之前，每个线程计算scale_shift和relu。融合算子的速度与single depthwise_conv2d一样快。以下是输入= [1、256、96、96]，filter过滤器= [256、1、3、3]，stride步幅= [1、1]，padding填充='SAME'的结果：

tf-1.2 depthwise_conv2d: 251.6 us
tf-1.2 depthwise_conv2d + scale_shift + relu (separate): 419.9 us
TVM depthwise_conv2d: 90.9 us
TVM depthwise_conv2d + scale_shift + relu (fused): 91.5 us

The advantage of operator fusion is obvious.

This is not the end, TVM can do operator fusion in a smarter way. You may refer to this and read the source code provided below.

Show me the code算子融合的优势显而易见的。

这不是终点，TVM可以以更智能的方式进行算子融合。参考链接：

人工智能芯片与自动驾驶