分布式计算课程补充笔记 part 3

▶ OpenMP 的任务并行 (task parallelism)：显式定义一系列可执行的任务及其相互依赖关系，通过任务调度的方式多线程动态执行，支持任务的延迟执行 (deferred execution)

● 变量的数据域：并行区共享变量 → task 区也为共享；并行区私有变量 → task 区为 firstprivate；task 区其他变量 → 默认私有

● 范例代码

 1 #include <stdio.h>
 2 #include <omp.h>   
 3 #include <time.h>
 4 
 5 int fib(int n)
 6 {
 7     int x, y;
 8     if (n < 2)
 9         return n;
10 #pragma omp task shared(x)  // 创建关于 x 的 task
11     x = fib(n - 1);
12 #pragma omp task shared(y)  // 创建关于 y 的 task
13     y = fib(n - 2);
14 #pragma omp taskwait        // 等待两个 task 完成才嫩开始接下来的计算
15     return  x + y;
16 }
17 
18 int main()
19 {
20     int res, n = 30;
21     clock_t tick = clock();
22 #pragma omp parallel        // task 要在并行区内调用
23     {
24 #pragma omp single          // 根任务只调用 1 次
25         res = fib(n);
26     }
27     printf("Fib[%d] == %d, time = %f ms
", n, res, float(clock() - tick)/1000);
28     getchar();
29     return 0;
30 }

▶ 动态线程：系统动态选择并行区的线程数 (默认关闭)

● 用库函数打开 / 关闭动态线程，flag == 0 按优先级决定线程数，flag != 0 系统动态调节线程数

void omp_set_dynamic(int flag)

● 用环境变量打开 / 关闭动态线程

export OMP_DYNAMIC = true

● 检查动态线程是否打开

int omp_get_dynamic (void)

▶ 嵌套并行：并行区之内开启并行区 (默认开启)

● 用库函数打开 / 关闭嵌套并行

void omp_set_nested(int flag)

● 用环境变量打开 / 关闭嵌套并行

export OMP_NESTED = true
export OMP_NUM_THREADS = n1, n2, n3 # 每层嵌套的线程数

● 检查嵌套并行是否打开

int omp_get_nested (void)

▶ 动态线程和嵌套并行的范例代码

 1 #include <stdio.h>
 2 #include <omp.h>   
 3 #include <time.h>
 4 
 5 int main()
 6 {
 7     omp_set_dynamic(0);                                             // 关闭动态线程
 8     #pragma omp parallel num_threads(2)
 9     {
10         #pragma omp single                                          // 一个线程来执行，返回 2
11         printf("Outer: num_thds=%d

", omp_get_num_threads());
12 
13         omp_set_nested(1);                                          // 开启嵌套并行
14         #pragma omp parallel num_threads(3)                         // 内嵌一个 3 线程的并行块
15         {
16             #pragma omp single
17             printf("Inner: num_thds=%d
", omp_get_num_threads());  // 返回 3
18         }
19         #pragma omp barrier
20         
21         omp_set_nested(0);                                          // 关闭嵌套并行
22         #pragma omp parallel num_threads(3)                         // 内嵌一个 3 线程的并行块
23         {
24             #pragma omp single
25             printf("Inner: num_thds=%d
", omp_get_num_threads());  // 返回 1
26         }                                                                   
27         #pragma omp barrier               
28     }              
29 
30     getchar();
31     return 0;
32 }

▶ 线程私有型全局变量：将全局变量置为线程私有（对线程而言是全局变量），必须置于全局变量的声明列表之后

#pragma omp threadprivate (list)

● 范例代码

 1 #include <stdio.h>
 2 #include <omp.h>   
 3 
 4 int a, b, i, tid;
 5 float x;
 6 
 7 #pragma omp threadprivate(a, x)
 8 
 9 int main(int argc, char *argv[])
10 {
11     omp_set_dynamic(0);
12     omp_set_num_threads(4);
13 
14     printf("1st Parallel Region:
");
15     #pragma omp parallel private(b, tid)
16     {
17         tid = omp_get_thread_num();
18         a = tid;
19         b = tid;
20         x = float(tid);
21         printf("Thread %d: a, b, x= %d, %d, %f
", tid, a, b, x);
22     }
23 
24     printf("
2nd Parallel Region:
");
25     #pragma omp parallel private(tid)
26     {
27         tid = omp_get_thread_num();
28         printf("Thread %d: a, b, x= %d, %d, %f
", tid, a, b, x);
29     }
30 
31     getchar();
32     return 0;
33 }

● 输出结果，b 没有私有化，保持了第一并行区的结果

1st Parallel Region :
Thread 0 : a, b, x = 0, 0, 0.000000
Thread 1 : a, b, x = 1, 1, 1.000000
Thread 3 : a, b, x = 3, 3, 3.000000
Thread 2 : a, b, x = 2, 2, 2.000000

2nd Parallel Region :
Thread 0 : a, b, x = 0, 0, 0.000000
Thread 2 : a, b, x = 2, 0, 2.000000
Thread 3 : a, b, x = 3, 0, 3.000000
Thread 1 : a, b, x = 1, 0, 1.000000

▶ OpenMP 堆栈：除了主线程，每个线程的私有变量存储空间受线程堆栈大小控制，超出堆栈大小程序的行为不可控

● OpenMP 堆栈大小依赖实现：icc 默认 4 MB；gcc / gfortran 默认 2 MB；

● 可以通过环境变量修改默认堆栈大小：

export OMP_STACKSIZE=32M
export OMP_STACKSIZE=8192K

▶ 线程亲和性（affinity）和线程绑定（binding）：线程亲和性决定 NUMA 架构的系统上线程在物理计算核心的映射策略；线程绑定显式确定线程与物理计算核心的对应关系，以提升性能

● OpenMP 3.1 开始提供线程绑定支持，OpenMP 4.5 开始较好支持，工具：numactl（参考http://www.glennklockwood.com/hpc-howtos/process-affinity.html）

export OMP_PROC_BIND=TRUE

● icc 可设置线程亲和性（参考https://software.intel.com/en-us/node/522691）

export KMP_AFFINITY = [<modifier>,...] <type> [,<permute>] [,<offset>]

▶ PETSc （Portable Extensible Toolkit for Sciencific Computation）讲座相关

● Advanced Sciencific Computing：

　　■ 应用上（Large and Complex）

　　■ 算法上（fully or semi implicit, multileve, nested, hierarchical, computer architure aware）

　　■ 并行化（Libraries, extensible solvers, composable）

● 部分幻灯片

● 终端中的代码

cd petsc-3.10.2/
module add mpich
module add petsc
cd src/vec/vec/examples/tutorials/
ls -al
make ex2
srun -c 8 mpiexec -n 4 ./ ex2       # 指定 8 核心