GPU编程--利用stream实现kernel execution 和 data transfer overlap

首先，介绍 deviceQuery 脚本或者exe文件，这个是由cuda sdk提供的，安装cuda sdk 后就会有。

我的linux中，在这里：

/opt/cuda/cuda70/samples/bin/x86_64/linux/release

当然，每个人的不同。但，一般都在samples文件夹下面。

我特别想说的是，一定要认真看deviceQuery的输出信息。花了一天折腾stream，结果发现因为不了解GPU device 支持的硬件特性，瞎折腾了，都是血汗泪呀。

我的device 输出如下：

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 750 Ti"
  CUDA Driver Version / Runtime Version          7.5 / 7.0
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 2047 MBytes (2146762752 bytes)
  ( 5) Multiprocessors, (128) CUDA Cores/MP:     640 CUDA Cores
  GPU Max Clock rate:                            1084 MHz (1.08 GHz)
  Memory Clock rate:                             2700 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 23 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GeForce GTX 750 Ti
Result = PASS

特别注意黄色标注的部分：Concurrent copy and kernel execution，表明是否支持copy 和 kernel 同时执行，从compute capacity 1.1开始，已经开始支持啦。所以，compute capacity 5.0肯定支持。但是，问题是 Yes with 1 copy engine (s). 表明 HtoD 和 DtoH 是不能并行的！！而我一直纠结于此，还一直怀疑自己编程错啦，怀疑人生呀！！！后来，看到了一篇文章，把别人的代码也抄下来跑，结果和别人完全不同！！然后，再看到一篇文章，文章说就提到了 Yes with 2 copy engine (s)！！！然后，我也找到了deviceQuery是什么，查询设备参数神器呀。

总而言之，我的device因为 Yes with 1copy engine，不能实现HtoD 和 DtoH的并行，但是，可以实现 copy和 kernle的并行。安慰自己一下。

实现copy 和kernel并行条件：

1. Concurrent copy and kernel execution 是 Yes。至于1-2个copy engine，则看设备啦。贵的设备，不是没有道理呀，因为别人是2！！！

2. kernel 和 copy 必须是不同的stream

3. host data 必须使用 pinned memory。这里不是device data，而是 host data。我也犯过错误呀，device data不变，该啥是啥。

streams 是由 kepler 提出的Hyper-Q 实现的，因为提供了最多32个 work queue，用于存放kernel，可以避免kernel之间的 false dependency。work queue中kernel的选择是由 Grid Management Unit 选择的，因为一个kernel就是一个 Grid。

copy 和 kernel 的overlap 则是stream 带来的一个小福利，同时也是需要硬件支持的。

HtoD 和 DtoH 是需要硬件支持的！

cuda 7.0 中，可以直接由pthread实现不同streams，非常棒，意思是 pthread 和streams 自动对应，不需要特殊处理，同一个pthread的kernel是同一个stream中！！看链接

题外话： nvvp 好用，但是有点点慢！

❤ᎻᎯᏉᏋ Ꭿ ᏪᏐᏨᏋ❤
' /\_,,,,_/ ☀ᎠᎯᎽ☀
'┃ ❚ ❚ ┃
'┃ΞΞ↧ΞΞ┃
'╰┳━┳╯
'╭┫ ┣╮
'┺┻┻┻┹

高山仰止，景行行止。虽不能至，然心向往之。