并行优化、xvout

编程优化:
1. 编程优化,比如Fortran二维数组中,要按行存放,按列访问,这样可以让cache工作更高效。
2. 循环展开,如CPU一拍能做四次浮点运算,那么可以将一个浮点计算循环(计算内容简单的)拆分成每次循环step为4,循环体内手动做四次循环代码。这样代码不好看,但是能得到性能提高。
3. 运算顺序的调整,减少CPU流水线的迟滞。可以和循环展开配合,得到更好的性能。
3. 针对Cache的优化。
    数组合并: 利用块长,改善空间局部性
    循环交换: 改变嵌套循环中访问内存的次序
    循环合并: 增强数据的可重用性(时间局部性)
    分块: 集中访问可取入cache的块状矩阵,避免全行或全列的读写,以增强时间局部性。-- 计算一个Cache中能放下的子矩阵。
4. 做MPI通讯的时候,将小数据量多次通讯变成整合式少次通讯。这样回避网络latency的问题。
5. 文件I/O要尽量减少,除非是不能回避的。
6. 考虑OpenMP+MPI的并行编码方式,这样在单机内可以避免使用MPI。

编译等其他辅助性优化:
1. Intel compiler - 针对硬件架构的优化,比如MMX/SSE等指令集的优化。可以优化一些looper等。Prefetch, loop interchange, cache blocking等。
2. 使用高效的数学库(BLAS/GOTO等)。
3. 利用vtunes等类似性能调试软件来调试性能热点。比如vtunes/Intel cluster toolkits可以看出计算/通讯的时间占用比。

xvout:
The X video extension, often abbreviated as XVideo or Xv, is a video output mechanism for the X Window System. The protocol was designed by David Carver; the specification for version 2 of the protocol was written in July 1991.[1] Its main use today is to rescale video playback in the video controller hardware, in order to enlarge a given video or to watch it in full screen mode. Without XVideo, X would have to do this scaling on the main CPU. That requires a considerable amount of processing power, sometimes to the point of slowing down/degrading the video stream; the video controller is specifically designed for this kind of computation, so can do it much more cheaply. Similarly, the X video extension has the video controller perform color space conversions. It can also have the controller change contrast, brightness and hue of a displayed video stream.
原文地址:https://www.cnblogs.com/super119/p/2326192.html