MAGMA

LAPACK + GPU = MAGMA

使用gotoblas2+CUDA安装magma1.1.0(227)

准备阶段：

1 安装CUDA

2安装cpu BLAS

3安装LAPACK

安装过程：

1 按照README文档进行安装

2 在make.inc lib'中加入-lgfortran

3 出现error

gcc -O3 -DADD_ -DGPUSHMEM=130 -fPIC -Xlinker -zmuldefs -DGPUSHMEM=130 testing_zhetrd.o -o testing_zhetrd lin/liblapacktest.a -L../lib \
-lcuda -lmagma -lmagmablas -lmagma -L/opt/GotoBLAS2 -L/usr/local/cuda/lib64 -L/usr/lib64 /opt/GotoBLAS2/libgoto.a -lgoto -lpthread -lcublas -lcudart -llapack -lm -lgfortran
../lib/libmagma.a(zlatrd.o): In function `magma_zlatrd':
zlatrd.cpp:(.text+0x3be): undefined reference to `zdotc'
collect2: ld returned 1 exit status
make: *** [testing_zhetrd] 错误 1

解决方案：参考http://icl.cs.utk.edu/magma/forum/viewtopic.php?f=2&t=278和http://www.pavanky.com/installing-magma-with-gotoblas2/

The forum post linked above talks about how to fix the issue in zlatrd.cpp and clatrd.cpp by replacingblasf77_*dotc withcblas_*dotc_sub.
Be aware that the function is used twice. The first around line 256, and the second around line 325. Here are the changes to be made inzlatrd.cpp（在src目录下）

cblas_zdotc_sub(i, W(0, iw), ione, A(0, i), ione, &value); // Line 256

//blasf77_zdotc(&value, &i, W(0, iw), &ione, A(0, i), &ione);

...

cblas_zdotc_sub(i_n, W(i +1, i), ione,A(i +1, i), ione, &value); // Line 326

//blasf77_zdotc(&value, &i_n, W(i+1,i), &ione, A(i+1, i), &ione）；

原因：

This problem comes from zdot not having the same interface in the different BLAS implementations. We didn't realize there would be a problem for GotoBLAS with this change. The way it is now will work for MKL. If you open file zlatrd.cpp, before calling blasf77_zdotc, there is a call to cblas_zdotc_sub that is commented. This is an alternative to calling the blasf77_zdotc but you would have to add linking to cblas (if it is not part of the GOTO BLAS). The other way is to see what is the ZDOT interface in GotoBLAS and call it the correct way. Meanwhile probably for the next release we will make all BLAS functions to use CBLAS and require linking to CBLAS to avoid problems like this.
4 运行时无法找到libgoto.so

解决：export LD_LIBRARY_PATH加上

参考安装方案：

OPTIONS

Firstly, MAGMA needs a CPU LAPACK and BLAS backend installed on your machine.
There are four options for this.

Intel’s MKL
AMD’s ACML
Netlib’s LAPACK + ATLAS
Netlib’s LAPACK + GotoBLAS2

Each of the four options can be configured by one of the files make.inc.$(LIB). LIB is eithermkl,acml, atlas or goto. I wanted to go the opensource all the way with this.For reasons inexplicable, I chose GOTOBLAS2 over ATLAS.

GOTOBLAS2

That meant, I had to build GOTOBLAS2 first. It was mostly painless; Except, I had gcc 4.6. Which meant the compiler started complaining about-l flags with nothing mentioned to the right. It was quickly evident that a parser was broken in the pipeline. After digging through perl code (with which I have *no* experience) for a few minutes, I had the fix. The following patch had to be made tof_check inside the root directory of gotoblas.

$link =~ s/\-rpath\s+/\-rpath\@/g;
$link =~ s/\-l\ /\-l/g; # Add this new line around line 237.

MAGMA

Finally, with everything setup, I had to make a change or two to make.inc.goto.
- Change GPU_TARGET = 1 (because I use a fermi card. Leave as 0 if you have pre-fermi cards).
- Change lgoto to lgoto2
- Copy make.inc.goto to make.inc
Doing a make at this point halts with a linker error.
The forum post linked above talks about how to fix the issue in zlatrd.cpp and clatrd.cpp by replacingblasf77_*dotc withcblas_*dotc_sub.
Be aware that the function is used twice. The first around line 256, and the second around line 325. Here are the changes to be made inzlatrd.cpp

如下

cblas_zdotc_sub(i, W(0, iw), ione, A(0, i), ione, &value); // Line 256
//blasf77_zdotc(&value, &i, W(0, iw), &ione, A(0, i), &ione);
...
...
cblas_zdotc_sub(i_n, W(i +1, i), ione,A(i +1, i), ione, &value); // Line 326
//blasf77_zdotc(&value, &i_n, W(i+1,i), &ione, A(i+1, i), &ione)

Make similar changes in clatrd.cpp. do a make. Add -j if you are in a hurry. You are good to go!

http://www.pavanky.com/installing-magma-with-gotoblas2/

2在深圳超算上安装MAGMA

./testing_sgeqrf: error while loading shared libraries: libcublas.so.4: failed to map segment from shared object: Cannot allocate memory

不知道是cuda没有安装好（因为权限问题驱动没有装好），还是系统的问题？

3安装CLMAGMA

需要opencl blas（这个从可以从AMD得到）

需要cpu blas 和 cpu lapack （使用 mkl）

大概还需要amd app

测试中报错，放弃。

3MAGMA测试

使用多个GPU setenv MAGMA_NUM_GPUS 4

####在testing目录下，我们看到测试过程中使用了magma_sgeqrf2_gpu,magma_sgeqrf_gpu等同一函数的不同版本。这一般是因为存储策略不同。

sgeqrf2_gpu is LAPACK consistent in terms of input and output data layout. The sgeqrf_gpu version stores the triangular matrices used in the factorization. sgeqrf3_gpu stores the triangular matrices but also modifies the storage for the Householder vectors used in the factorization - 0s are put in the upper triangular parts of the panels, 1s on the diagonal, and the upper triangular parts are stored separately

####测试testing目录，我们会发现testing_sgeqrf 和testing_sgeqrf_gpu两个函数的结果都包含有cpu和gpu性能。原因是这样的：两者分别测试了sgeqrf的cpu接口和gpu接口，但是并不代表二者都仅仅使用cpu或者gpu.事实上，二者都是用了cpu和gpu,但是testing_sgeqrf来说，它的输入输出存储在cpu的mem上，而testing_sgeqrf_gpu的存储则是在gpu的mem上。

#####testing_sgeqrf 和testing_sgeqrf_gpu两个函数：总的性能testing_sgeqrf/testing_sgeqrf_gpu大约为96%,,CPU的性能比：testing_sgeqrf/testing_sgeqrf_gpu=1.02，gpu的性能比testing_sgeqrf/testing_sgeqrf_gpu=0.96（测试规模1000-20000）

*_gpu表示输入输出存放在GPU中，而没有_gpu的表示存放在CPU中

4 QR分解代码（CUDA版）