如何优化cuda c/c++中的数据传输

原文地址（host端就是cpu，device端就是gpu）

https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/

　　在本系列的前三篇文章中，我们为该系列的主旨打下基础：如何优化cuda c代码。在这一篇和下一篇文章中，我们开始讨论代码优化以及如何在主机和设备之间有效的传输数据。device内存和device之间的峰值带宽（例如NVIDIA Tesla C2050上的峰值带宽为144GB/s ）远高于host内存和device内存之间的峰值带宽（PCIe x16 Gen2上的8GB/s）。这种差异意味着你在host和device之间实现数据传输可能会影响你的整体应用程序性能。让我们从host-device数据传输的一些通用指南开始。

尽可能减少host和device之间传输的数据量，尽管这意味着在gpu的内核上运行没比cpu上快多少
使用page-locked（页面锁定？）内存或者pinned内存时，host和device之间可能有更高的带宽
将许多小传输批处理为一个更大的传输，性能会好的多，因为它消除了每次传输的大部分开销
host和device之间的数据传输又是会与核函数运行或其他数据传输重叠（同步进行？）

　　我们将在这篇文章中研究前三个准则，在下一篇文章中专门研究重叠数据传输。首先我想谈谈如何在不修改源代码的情况下测量数据传输所花费的时间。

使用nvprof测量数据传输时间

　　为了测量每次数据传输所花费的时间，我们可以在每次传输之前和之后记录一个cuda event，并使用cudaEventElapsedTime()，像我们上一篇文章描述的那样。但是我们可以通过使用nvprof（一个包含在cuda工具包（从cuda5开始）中的命令行cuda分析器）来获取经过的传输时间，而无需使用cuda event检测源代码。让我们用下面的示例代码来试一下，你可以在这篇文章的github存储库中找到它。（https://github.com/NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/optimize-data-transfers/profile.cu）

#include <stdio.h>

int main(void)
{
    const unsigned int N=1048576;
    const unsigned int bytes=N*sizeof(int);
    int *h_a=(int*)malloc(bytes);
    int *d_a;
    cudaMalloc((int**)&d_a,bytes)
    
    memset(h_a,0,bytes);
    cudaMemcpy(d_a,h_a,bytes,cudaMemcpyHostToDevice);
    cudaMemcpy(h_a,d_a,bytes,cudaMemcpyDeviceToHost);
    
    return 0;
}

　　要分析此代码，我们只需用nvcc编译它，然后以程序文件名作为参数运行nvprof（文件名profile.cu）

$ nvcc profile.cu -o profile_test
$ nvprof ./profile_test

　　当我在装有GeForce GTX 680（GK104 gpu，类似于Tesla K10）的台式机上运行时，我得到了以下输出。

$ nvprof ./a.out 
======== NVPROF is profiling a.out...
======== Command: a.out
======== Profiling result:
Time(%)     Time  Calls      Avg      Min      Max Name
  50.08 718.11us      1 718.11us 718.11us 718.11us [CUDA memcpy DtoH]
  49.92 715.94us      1 715.94us 715.94us 715.94us [CUDA memcpy HtoD]

　　nvprof测量每个cuda memcpy调用说花费的时间。它报告每次调用的平均，最短和最长时间（因为我们只运行每个副本一次，所以所有时间都相同）。nvprof非常灵活，因此请务必查看文档https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview

最小化数据传输

　　我们不应该只凭内核的gpu执行时间相对于其cpu执行时间来决定是运行gpu还是cpu版本。我们还需要考虑在PCI-e总线上移动的数据成本，尤其是我们最初将代码移植到cuda时。由于cuda的异构编程模型同时使用cpu和gpu，因此可以一次将代码移植到cuda一个kernel中，在移植的初始阶段，数据传输会占用程序全部的执行时间。奖花在数据传输上的时间和花在kernel上的执行时间分开是有必要的。正如我们演示的那样，使用分析工具不是一件难事。随着我们移植更多的代码，我们将删除中间传输并相应地减少整体执行时间。

固定host端内存（pinned host memory）

　　默认情况下，host（cpu）数据分配是可分页的。gpu无法直接从可分页host内存访问数据，所以当调用从可分页host内存到device内存的数据传输时，cuda驱动程序必须先分配一个临时的page-lock或者pinned的host array，将host端数据拷贝到pinned array中，然后将pinned array传输到device memory中，如下图所示。

　　如图所示，pinned memory用作从host端传到device端的暂存区。我们可以通过直接在pinned memory中分配host array来直接避免分页内存和pinned memory之间的传输成本。使用cudaMallocHost()或cudaHostAlloc()在cuda c中分配pinned memory，使用cudaFreeHost()释放。pinned memory分配可能会失败，因此应该在全程检查是否有错误。下面代码演示了带错误检查的pinned memory分配。

cudaError_t status = cudaMallocHost((void**)&h_aPinned, bytes);
if (status != cudaSuccess)
  printf("Error allocating pinned host memory\n");

　　使用pinned memory的数据传输与使用可分也内存的传世使用相同的cudaMemcpy()语法。我们可以用以下的bandwidthtest程序（也可以在github上获得https://github.com/NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/optimize-data-transfers/bandwidthtest.cu）来比较分页内存和pinned memory传输速率

#include <stdio.h>
#include <assert.h>

// Convenience function for checking CUDA runtime API results
// can be wrapped around any runtime API call. No-op in release builds.
inline
cudaError_t checkCuda(cudaError_t result)
{
#if defined(DEBUG)||defined(_DEBUG)
    if(result!=cudaSuccess)
    {
        fprintf(stderr,"CUDA Runtime Error: %s\n",cudaGetErrorString(result));
        assert(result==cudaSuccess);
    }
#endif
    return result;
}

void profileCopies(float *h_a,float *h_b,float *d,unsigned int n,char *desc)
{
    printf("\n%s transfers \n",desc);
    unsigned int bytes=n*sizeof(float);

    // event for timing
    cudaEvent_t startEvent,stopEvent;
    checkCuda(cudaEventCreate(&startEvent));
    checkCuda(cudaEventCreate(&stopEvent));

    checkCuda(cudaEventRecord(startEvent,0));
    checkCuda(cudaMemcpy(d,h_a,bytes,cudaMemcpyHostToDevice));
    checkCuda(cudaEventRecord(stopEvent,0));
    checkCuda(cudaEventSynchronize(stopEvent));

    float time;
    checkCuda(cudaEventElapsedTime(&time,startEvent,stopEvent));
    printf("Host to Device bandwidth(GB/s):%f\n",bytes*1e-6/time);

    checkCuda(cudaEventRecord(startEvent,0));
    checkCuda(cudaMemcpy(h_b,d,bytes,cudaMemcpyDeviceToHost));
    checkCuda(cudaEventRecord(stopEvent,0));
    checkCuda(cudaEventSynchronize(stopEvent));

    checkCuda(cudaEventElapsedTime(&time,startEvent,stopEvent));
    printf("Device to Host bandwidth(GB/s): %f\n",bytes*1e-6/time);

    for(int i=0;i<n;++i)
    {
        if(h_a[i]!=h_b[i])
        {
            printf("*** %s transfers failed ***\n",desc);
            break;
        }
    }
    checkCuda(cudaEventDestroy(startEvent));
    checkCuda(cudaEventDestroy(stopEvent));
}

int main(void)
{
    unsigned int nElements=4*1024*1024;
    const unsigned int bytes=nElements*sizeof(float);

    // host array
    float *h_aPageable,*h_bPageable;
    float *h_aPinned,*h_bPinned;

    // device array
    float *d_a;

    // alloc and initalize
    h_aPageable=(float*)malloc(bytes);                      // host pageable
    h_bPageable=(float*)malloc(bytes);                      // host pageable
    checkCuda(cudaMallocHost((void**)&h_aPinned,bytes));    // host pinned
    checkCuda(cudaMallocHost((void**)&h_bPinned,bytes));    // host pinned
    checkCuda(cudaMalloc((void**)&d_a,bytes));              // device

    for(int i=0;i<nElements;++i)
    {
        h_aPageable[i]=i;
    }
    memcpy(h_aPinned,h_aPageable,bytes);
    memset(h_bPageable,0,bytes);
    memset(h_bPinned,0,bytes);

    // output device info and transfer size
    cudaDeviceProp prop;
    checkCuda(cudaGetDeviceProperties(&prop,0));
    printf("\nDevice: %s\n",prop.name);
    printf("Transfer size (MB): %d\n",bytes/(1024*1024));

    // perform copied and report bandwidth
    profileCopies(h_aPageable,h_bPageable,d_a,nElements,(char*)"pageable");
    profileCopies(h_aPinned,h_bPinned,d_a,nElements,(char*)"pinned");

    printf("\n");

    // clean up
    cudaFree(d_a);
    cudaFreeHost(h_aPinned);
    cudaFreeHost(h_bPinned);
    free(h_aPageable);
    free(h_bPageable);

    return 0;
}

　　数据传输率可能取决于主机系统（主板，cpu和芯片组）以及gpu类型，在配备 Intel Core i7-2620M CPU（2.7GHz，2 个 Sandy Bridge 内核，4MB L3 缓存）和 NVIDIA NVS 4200M GPU（1 Fermi SM，Compute Capability 2.1，PCI-e Gen2 x16）的笔记本电脑上，运行 BandwidthTest以下结果。很明显pinned传输速度是pageable传输的两倍多

Device: NVS 4200M
Transfer size (MB): 16

Pageable transfers
  Host to Device bandwidth (GB/s): 2.308439
  Device to Host bandwidth (GB/s): 2.316220

Pinned transfers
  Host to Device bandwidth (GB/s): 5.774224
  Device to Host bandwidth (GB/s): 5.958834

　　在我的台式 PC 上，使用更快的 Intel Core i7-3930K CPU（3.2 GHz，6 个 Sandy Bridge 内核，12MB L3 缓存）和 NVIDIA GeForce GTX 680 GPU（8 个 Kepler SM，计算能力 3.0），我们看到可pageable传输速度更快，如以下输出所示。这大概是因为更快的 CPU（和芯片组）降低了主机端内存复制成本。

Device: GeForce GTX 680
Transfer size (MB): 16

Pageable transfers
  Host to Device bandwidth (GB/s): 5.368503
  Device to Host bandwidth (GB/s): 5.627219

Pinned transfers
  Host to Device bandwidth (GB/s): 6.186581
  Device to Host bandwidth (GB/s): 6.670246

　　但是不应该过度分配固定内存，这样做会降低整体的系统性能，因为它会减少操作系统和其他程序可用的物理内存量。具体应该多少是很难提前判断的，因此与所有优化一样，测试你的应用程序以及运行的系统，以获得最佳性能参数。

小规模传输批量化

　　将多次小规模传输批处理成单个传输要更好，可以省去多次传输的重复开销。这很容易通过使用临时数组（最好是固定）并将其与要传输的数据一起打包来实现。

　　对于二维数组传输，可以使用cudaMemcpy2D()

cudaMemcpy2D(dest, dest_pitch, src, src_pitch, w, h, cudaMemcpyHostToDevice)

　　这里的第一个参数是接收端数组的指针，第二个参数是接收端数组的间距，第三个是发送端数组的指针，地四个参数是发送端数组的间距，wh对应传输数组的宽高，最后一个参数表示拷贝方向。还有cudaMemcpy3D()函数，用于传输三维数组。（数组间距pitch怎么用？）

总结

　　host和device之间的传输是 GPU 计算中数据移动最慢的环节，因此应该注意尽量减少传输。遵循本文中的指南可以帮助确保必要的转移是有效的。当你移植或编写新的 CUDA C/C++ 代码时，我建议你从现有主机指针的可分页传输开始。正如我之前提到的，随着编写更多的设备代码，你将消除一些中间传输，因此在移植早期优化传输所花费的任何努力都可能被浪费。此外，我建议你使用nvprof、命令行CUDA剖析器或一种可视化剖析工具，如NVIDIA visual剖析器(CUDA工具包中也包含)，而不是使用CUDA事件或其他计时器来测量每次传输所花费的时间。

　　这篇文章的重点是提高数据传输效率。在下一篇文章中，我们将讨论如何将数据传输与计算和其他数据传输重叠。

无情的摸鱼机器