GPU的deviceQuery 和 Nvidia-smi的使用

1.deviceQuery 非常重要，对于编程中遇到的blockgrid设置，memory hierarchy 的使用具有指导意义。

deviceQuery 实际上是一个sample，需要编译后才能使用。在 /opt/cuda/cuda70/NVIDIA_CUDA-7.0_Samples 或者loca的cuda 文件夹（这个不确定）。

因为是只读文件，需要copy 到 home 文件目录下面，由于会使用 NVIDIA_CUDA-7.0_Samples/common 文件夹中的文件，直接copy NVIDIA_CUDA-7.0_Samples。

make 运行，就得到了deviceQuery 可运行文件。

建议对于任何一个GPU编程，第一个工作就是编译 deviceQuery。

有一个结果不明白，compute mode （我在nvidia-smi的说明书找到了说明）：

Compute mode 的意思是是否允许多个程序同时使用GPU。

Compute Mode The compute mode flag indicates whether individual or multiple compute applications may run on the GPU.

　　"Default" means multiple contexts are allowed per device.

　　"Exclusive Thread" means only one context is allowed per device, usable from one thread at a time.

　　"Exclusive Process" means only one context is allowed per device, usable from multiple threads at a time. "

　　“Prohibited" means no contexts are allowed per device (no compute apps).

　　"EXCLUSIVE_PROCESS" was added in CUDA 4.0. Prior CUDA releases supported only one exclusive mode, which is equivalent to "EXCLUSIVE_THREAD" in CUDA 4.0 and beyond.

For all CUDA-capable products

2. Nvidia-smi 有 nvidia-smi 说明书，http://developer.download.nvidia.com/compute/cuda/6_0/rel/gdk/nvidia-smi.331.38.pdf

Nvidia-smi:NVIDIA System Management Interface. 命令行, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.

GPU configuration options (such as ECC memory capability) may be enabled and disabled.

Nvidia-smi命令是在install drive，因此有。

nvidia-smi -i 0 -q 可以显示所有的信息。 （-i, 表示 gpu的编号）

nvidia-smi -h 帮助命令

3. 使用 cudaGetDeviceProperties（）

deviceQuery实际是调用 cudaGetDeviceProperties（)，逐条答应各种信息。

例如：程序+结果

void PrintDeviceProperties(cudaDeviceProp devProp)
{
FILE *deviceProperties = fopen("DeviceProperties.txt", "a+");
fprintf(deviceProperties, "Major revision number: %d
", devProp.major);
fprintf(deviceProperties, "Minor revision number: %d
", devProp.minor);
fprintf(deviceProperties, "Name: %s
", devProp.name);
fprintf(deviceProperties, "Total global memory: %u
", devProp.totalGlobalMem);
fprintf(deviceProperties, "Total shared memory per block: %u
", devProp.sharedMemPerBlock);
fprintf(deviceProperties, "Total registers per block: %d
", devProp.regsPerBlock);
fprintf(deviceProperties, "Warp size: %d
", devProp.warpSize);
fprintf(deviceProperties, "Maximum memory pitch: %u
", devProp.memPitch);
fprintf(deviceProperties, "Maximum threads per block: %d
", devProp.maxThreadsPerBlock);
for (int i = 0; i < 3; ++i)
fprintf(deviceProperties, "Maximum dimension %d of block: %d
", i, devProp.maxThreadsDim[i]);
for (int i = 0; i < 3; ++i)
fprintf(deviceProperties, "Maximum dimension %d of grid: %d
", i, devProp.maxGridSize[i]);
fprintf(deviceProperties, "Clock rate: %d
", devProp.clockRate);
fprintf(deviceProperties, "Total constant memory: %u
", devProp.totalConstMem);
fprintf(deviceProperties, "Texture alignment: %u
", devProp.textureAlignment);
fprintf(deviceProperties, "Concurrent copy and execution: %s
", (devProp.deviceOverlap ? "Yes" : "No"));
fprintf(deviceProperties, "Number of multiprocessors: %d
", devProp.multiProcessorCount);
fprintf(deviceProperties, "Kernel execution timeout: %s
",
devProp.kernelExecTimeoutEnabled ? "Yes" : "No"));      
fclose(deviceProperties);
}
And the result is as follows:Major revision number: 2
Minor revision number: 0
Name: Tesla C2075
Total global memory: 1341849600
Total shared memory per block: 49152
Total registers per block: 32768
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 1024
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 65535
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 1147000
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 14
Kernel execution timeout: No

高山仰止，景行行止。虽不能至，然心向往之。