关于CUDA context 的理解

部分内容转自

https://chenrudan.github.io/blog/2015/07/22/cudastream.html

http://stackoverflow.com/questions/10415204/how-to-create-a-cuda-context

很早的时候就发现，每次运行cuda的第一个函数，都会花很久时间，解释是因为cuda initialization。

那么，cuda initializaiton主要解决什么问题呢？

其中一个就是创建 cuda context。即调用这些函数的时候，需要已经有context 存在了。

cuda context 非常重要，它作为一个容器，管理了所有对象的生命周期，大多数的CUDA函数调用需要context。这些对象如下：

所有分配内存
Modules，类似于动态链接库，以.cubin和.ptx结尾  【在jcuda中要使用】
CUDA streams，管理执行单元的并发性
CUDA events
texture和surface引用
kernel里面使用到的本地内存（设备内存）
用于调试、分析和同步的内部资源
用于分页复制的固定缓冲区

即调用这些函数的时候，需要已经有context存在了。那么context 如何创建呢？

有两种方式，隐式调用和显式调用（这样说或者有点不正确，但是这个意思）

隐式调用

cuda runtime 软件层的库，是隐式调用。

从4.0开始，cuda runtime创建的context 是针对所有线程的，即一个device 对应一个context，所有线程都可以使用。

cuda runtime 不提供API直接创建CUDA context，而是通过延迟初始化（deferred initialization）来创建context，也就是lazy initialization。具体意思是在调用每一个CUDART库函数时，它会检查当前是否有context存在，假如需要context，那么才自动创建。也就是说需要创建上面这些对象的时候就会创建context。可以显式的控制初始化，即调用cudaFree(0)，强制的初始化。cuda runtime将context和device的概念合并了，即在一个gpu上操作可看成在一个context下。因而cuda runtime提供的函数形式类似cudaDeviceSynchronize()而不是与driver API 对应的cuCtxSynchronize()。应用可以通过driver API来访问当前context的栈。与context相关的操作，都是以cuCtxXXXX()的形式作为driver API实现。

显式调用

cuda driver API，驱动层的库，显式调用

cuda driver API 创建的context是针对一个线程的，即一个device，对应多个context，每个context对应多个线程，线程之间的context可以转移。

在driver API中，每一个cpu线程必须要创建 context，或者从其他cpu线程转移context。如果没有context，就会报错。怎样才回到导致报错呢？即如果没有创建context，就直接调用 driver api创建上面那些对象，就会报错。因为上面的那些对象在runtime 和driver api 中都有函数可以创建。因此，注意注意！！！

每个cpu线程都有一个current context的栈，新建新的context就入栈。针对每一个线程只能有一个出栈变成可使用的current context，而这个游离的context可以转移到另一个cpu线程，通过函数cuCtxPushCurrent/cuCtxPopCurrent来实现。

当context被销毁，里面分配的资源也都被销毁，一个context内分配的资源其他的context不能使用。

注意：

1、隐式调用的context是primary context；显示调用的context是standard context

2、每次cuda初始化比较费时间，其中一个工作可能就是使用runtime 进行了隐式调用context。因此，如果要避免这部分，有一个方法就是使用cudasetdevice() 提前创建context

The canonical way to force runtime API context establishment is to call cudaFree(0). If you have multiple devices, call cudaSetDevice() with the ID of the device you want to establish a context on, then cudaFree(0) to establish the context.

EDIT: Note that as of CUDA 5.0, it appears that the heuristics of context establishment are slightly different and cudaSetDevice() itself establishes context on the device is it called on. So the explicit cudaFree(0) call is no longer necessary (although it won't hurt anything).

Using the runtime API: cudaDeviceSynchronize, cudaDeviceGetLimit, or anything that actually accesses the context should work. I'm quite certain you're not using the driver API, as it doesn't do that sort of lazy initialization, but for others' benefit the driver call would be cuCtxCreate.

上述英语的意思总结如下：

如果是runtime，则调用会隐式调用创建context的函数，比如cudasetdevice,cudaDeviceSynchronize.

如果是drive api，则必须使用 cuCtxCreate.

为什么我要纠结这个问题：我使用多线程调用cuda，但是一个问题是每个线程都需要create context，这个会增加很多时间，如果只需要runtime 创建一个context就可以节约很多时间。

但是有一个问题是，如果多个线程使用一个context，会不会有什么隐患？