valgrind试用笔记

valgrind是一款功能齐全的代码诊断软件，Ubuntu下可以获取安装

sudo apt-get install valgrind

官网上可以下载 Manuel.pdf。

可以诊断内存泄漏

g++ xxx.cpp
valgrind --tool=memcheck ./a.out

它会汇报内存漏点。

也可以诊断缓存命中率

g++ xxx.cpp
valgrind --tool=cachegrind ./a.out

它会汇报一级缓存数据命中率、指令命中率、最末级缓存命中率等信息。

如下示例

#include<iostream>
using namespace std;
#include<ctime>

const size_t N = 1E3;

int main(){

        double y=0,z=0;
        clock_t tstart = clock();
        double *A = new double [N*N];
        for(size_t i=0;i<N*N;i++)A[i]=i;
        double *B = new double [N*N];
        for(size_t i=0;i<N*N;i++)B[i]=i;
        double *C = new double [N*N];
        for(size_t i=0;i<N;i++)
        for(size_t j=0;j<N;j++){
                z=0;
                for(size_t l=0;l<N;l++)
                        z += B[l*N+j];
                y=0;
                for(size_t k=0;k<N;k++){
                        y += A[k*N+i] * z;
                }
                C[i*N+j]=y;
        }
        clock_t t1=clock();
        cout<<(double)(t1-tstart)/CLOCKS_PER_SEC<<" s"<<endl;
        delete [] A; delete [] B; delete [] C;
        return 0;
}

这个代码的内层循环中，l,k是行数，所以会导致 memory locality 不太好，反映在 valgrind 的检测报告中，就是1级缓存数据命中率低一些（D1 miss rate: 11.8%）。

==2322== Cachegrind, a cache and branch-prediction profiler
==2322== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==2322== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==2322== Command: ./a.out
==2322== 
--2322-- warning: L3 cache found, using its data for the LL simulation.
276.428 s
==2322== 
==2322== I   refs:      31,053,190,692
==2322== I1  misses:             1,976
==2322== LLi misses:             1,928
==2322== I1  miss rate:           0.00%
==2322== LLi miss rate:           0.00%
==2322== 
==2322== D   refs:      17,025,701,244  (15,018,537,430 rd   + 2,007,163,814 wr)
==2322== D1  misses:     2,001,266,444  ( 2,000,014,098 rd   +     1,252,346 wr)
==2322== LLd misses:       125,490,381  (   125,113,840 rd   +       376,541 wr)
==2322== D1  miss rate:           11.8% (          13.3%     +           0.1%  )
==2322== LLd miss rate:            0.7% (           0.8%     +           0.0%  )
==2322== 
==2322== LL refs:        2,001,268,420  ( 2,000,016,074 rd   +     1,252,346 wr)
==2322== LL misses:        125,492,309  (   125,115,768 rd   +       376,541 wr)
==2322== LL miss rate:             0.3% (           0.3%     +           0.0%  )

而下面的代码的内层循环中，k,l是列数，memory locality 就好一些，

#include<iostream>
using namespace std;
#include<ctime>

const size_t N = 1E3;

int main(){

        double y=0,z=0;
        clock_t tstart = clock();
        double *A = new double [N*N];
        for(size_t i=0;i<N*N;i++)A[i]=i;
        double *B = new double [N*N];
        for(size_t i=0;i<N*N;i++)B[i]=i;
        double *C = new double [N*N];
        for(size_t i=0;i<N;i++)
        for(size_t j=0;j<N;j++){
                z=0;
                for(size_t l=0;l<N;l++)
                        z += B[j*N+l];
                y=0;
                for(size_t k=0;k<N;k++){
                        y += A[i*N+k] * z;
                }
                C[i*N+j]=y;
        }
        clock_t t1=clock();
        cout<<(double)(t1-tstart)/CLOCKS_PER_SEC<<" s"<<endl;
        delete [] A; delete [] B; delete [] C;
        return 0;
}

反映在 cachegrind 的报告中，就是1级缓存数据命中率高一些（D1 miss rate: 0.7%）。

==2334== Cachegrind, a cache and branch-prediction profiler
==2334== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==2334== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==2334== Command: ./a.out
==2334== 
--2334-- warning: L3 cache found, using its data for the LL simulation.
202.343 s
==2334== 
==2334== I   refs:      31,053,190,658
==2334== I1  misses:             1,974
==2334== LLi misses:             1,926
==2334== I1  miss rate:           0.00%
==2334== LLi miss rate:           0.00%
==2334== 
==2334== D   refs:      17,025,701,233  (15,018,537,423 rd   + 2,007,163,810 wr)
==2334== D1  misses:       125,517,445  (   125,140,099 rd   +       377,346 wr)
==2334== LLd misses:       125,510,970  (   125,134,429 rd   +       376,541 wr)
==2334== D1  miss rate:            0.7% (           0.8%     +           0.0%  )
==2334== LLd miss rate:            0.7% (           0.8%     +           0.0%  )
==2334== 
==2334== LL refs:          125,519,419  (   125,142,073 rd   +       377,346 wr)
==2334== LL misses:        125,512,896  (   125,136,355 rd   +       376,541 wr)
==2334== LL miss rate:             0.3% (           0.3%     +           0.0%  )

在加上 valgrind 以后，两段代码的运行时间分别是 276.428s 和 202.342s。不加 valgrind 命令，两段代码的运行时间分别是 24.0162 s 和 7.99471 s。所以报告中 D1 miss rate 的 10% 的差别（此外几乎没有别的差别，LL refs 差一个数量级，但是 LL misses 数量差不多），会导致几倍的效率区别。这说明 cpu 计算的任务远没有那 10% 的 D1 miss 导致的多余任务。

关于电脑的缓存，cpu 需要数据的时候，会依次在如下单元中寻找 1级缓存 -> 2级缓存 -> ... -> 最后一级缓存 -> 内存。如果在 1级缓存中找到了就停止寻找，拿去用了，找的深度越大，时间成本越高。如果最后要到内存中找（缓存未命中），就一次取一个“数据块”，存到缓存里，如果下次找的数据在同一个数据块中，就可以节省时间成本。所以，缓存命中率越高，程序性能越好，所以 memory locality 非常重要。