thrust

1. Cuda中的thrust库的使用

cuda的thrust的官方文档地址：https://docs.nvidia.com/cuda/thrust/

Thrust 是一个类似STL的 CUDA C++ 模板库

Thrust是并行算法和数据结构的基于GPU CUDA的C++库。Thrust主要通过管理系统底层的功能比如memory access（内存获取）和memory allocation（内存分配）来实现加速，使得工程师们在GPU编程的环境下能更focus在算法的设计上。Thrust的最主要特征是实现了几种数据结构和算法的高速并行计算（high performance heterogeneous parallel computing）。例如sort，reduce，scan等。

CUDA的各位应该都了解过归约算法，包括归约算法求和、求最大最小值、求方差标准差等等。

为了保证算法的时间复杂度，我们常常会花费大量的时间去优化归约算法的实现，包括线程分散度的问题、thread分歧以及bank冲突的问题等等。

当数据维度较小时还能够冷静的分析每一个可能还存在优化空间的点，但当数据维度较大时，常常感觉优化的程度还是不够。不要慌，这时就是体现CUDA强大的时刻，CUDA的thrust库可以完美的解决这些问题。
(1) vector

在记录函数之前，首先记录一下thrust提供的数据类型vector，thrust中定义了host端和device端的两种vector，分别定义在host_vector.h和device_vector.h中，在声明变量时也很简单：

thrust::host_vector<type> hvec;
thrust::device_vector<type> dvec;
dvec=hvec; //device vector和 host vector可以直接用等号进行传递，对应于cudaMemcpy的功能

thrust中还定义了device_ptr指针类型，当传入函数的指针是指向device端的内存时，需要用device_ptr进行封装：

float array[6] = { 3, 1, 2, 3, 5, 4 };
float *dev_array = 0;
cudaMalloc(&dev_array, 4 * 6);
cudaMemcpy(dev_array, array, 4 * 6, cudaMemcpyHostToDevice);
thrust::device_ptr<float> dev_ptr(dev_array);
thrust::reduce(dev_ptr, dev_ptr + 6);//由于dev_array指向device端，不能直接作为参数，需要对其封装

thrust::host_vector<type> hvec;
thrust::device_vector<type> dvec;
dvec=hvec;
thrust::reduce(dvec.begin(), dvec.end());//此时的参数是迭代器，不用也不能用device_ptr对其封装

//上述的两种函数的调用方法也存在host端的版本，传入的指针或者迭代器都是host端数据
thrust::reduce(array, array + 6);
thrust::reduce(hvec.begin(), hvec.end());

//从device_ptr中提取“原始”指针需要使用raw_pointer_cast函数
float dev_array=thrust::raw_pointer_cast(dev_ptr);

　for_each, transform, copy等简单算法:

#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <thrust/fill.h>
#include <thrust/replace.h>
#include <thrust/functional.h>
#include <iostream>

int main(void)
{
    thrust::device_vector<int> X(10000);
    thrust::device_vector<int> Y(10000);
    thrust::device_vector<int> Z(10000);

    thrust::sequence(X.begin(), X.end());
    thrust::transform(X.begin(), X.end(), Y.begin(), thrust::negate<int>());
    thrust::fill(Z.begin(), Z.end(), 2);
    thrust::transform(X.begin(), X.end(), Z.begin(), Y.begin(), thrust::modulus<int>());
    thrust::replace(Y.begin(), Y.end(), 1, 10);
   
    return 0;    
}

　　　Reductions：Reduction算法使用二元操作将输入序列规约为一个单值。例如，需要获得一数列的和，可以通过加运算规约此数组得到。相似的，数列的最大值，可以通过由两个输入值返回一个最大值的运算子规约得到。数列的求和的规约操作可以由thrust::reduce如下实现：

int sum = thrust :: reduce (D. begin () , D. end () , ( int ) 0, thrust :: plus <int >());

　　开始的两个参数定义了需要规约的数组，第三和第四个参数分别提供了初始值和相关的规约操作。实际上，通常使用的时候我们选择默认情况下没有初始值和不特别指出规约方法。所以下面三行代码是等同的：

int sum = thrust :: reduce (D. begin () , D. end () , ( int ) 0, thrust :: plus <int >());  
int sum = thrust :: reduce (D. begin () , D. end () , ( int ) 0);  
int sum = thrust :: reduce (D. begin () , D. end ())

虽然thrust::reduce能够有效的满足大部分的规约操作，但是，Thrust库依然提供了另外的一些函数以便使用（类似于STL）。例如，thrust::count能够返回给定序列的特定值的数量。

# include <thrust / count .h>   
# include <thrust / device_vector .h>   
...  
// put three 1s in a device_vector   
thrust :: device_vector <int > vec (5 ,0);  
vec [1] = 1;  
vec [3] = 1;  
vec [4] = 1;  
// count the 1s   
int result = thrust :: count ( vec . begin () , vec .end () , 1);  
// result is three

　　另一些规约操作，包括thrust::count_if、thrust::min_element、thrust::max_element、thrust::is_sorted、thrust::inner_product等，详细请参考documentation。

Transformations篇章中的SAXPY例子使用transformation内核展示了融合内核如何来减少内存交换。我们也可以使用thrust::transform_reduce实现融合内核来规约。下面的例子用来计算向量的模：

# include <thrust / transform_reduce .h>   
# include <thrust / functional .h>   
# include <thrust / device_vector .h>   
# include <thrust / host_vector .h>   
# include <cmath >   
// square <T> computes the square of a number f(x) -> x*x   
template <typename T>  
struct square  
{  
__host__ __device__  
T operator ()( const T& x) const {  
return x * x;  
}  
};  
int main ( void )  
{  
// initialize host array   
float x [4] = {1.0 , 2.0 , 3.0 , 4.0};  
// transfer to device   
thrust :: device_vector <float > d_x (x, x + 4);  
// setup arguments   
square <float > unary_op ;  
thrust :: plus <float > binary_op ;  
float init = 0;  
// compute norm   
float norm = std :: sqrt ( thrust :: transform_reduce ( d_x . begin () , d_x . end () ,  -  
unary_op , init , binary_op ) );  
std :: cout << norm << std :: endl ;  
return 0;  
}

　Prefix-Sums：

　并行的前追求和，也叫scan操作，与压实流、基数排序等都是并行算法的重要模块。下面的源码将举例说明使用默认加法的inclusive scan：

# include <thrust / scan .h>   
int data [6] = {1, 0, 2, 2, 1, 3};  
thrust :: inclusive_scan (data , data + 6, data ); // in - place scan   
// data is now {1, 1, 3, 5, 6, 9}

Inclusive scan的每个输出元素为输入数列的相应部分和。例如，data[2] = data[0] + data[1] + data[2]。Exclusive scan类似，但是右移一个位置：

# include <thrust / scan .h>   
int data [6] = {1, 0, 2, 2, 1, 3};  
thrust :: exclusive_scan (data , data + 6, data ); // in - place scan   
// data is now {0, 1, 1, 3, 5, 6}

现在为data[2] = data[0] + data[1]。由例子可见，inclusive_sacn与exclusive_scan允许原址操作。Thrust也提供了函数transform_inclusive_scan与transform_exclusive_scan可以实现在scan操作前对输入数列进行一元操作。完整的scan变体说明请参见documentation。

参考：https://www.cxyzjd.com/article/qq_43707919/112251190

2. hip中的thrust库的使用

与cuda类似