[转载]高效使用matlab之四：一个加速matlab程序的例子

原文地址：http://www.bfcat.com/index.php/2012/11/speed-up-app/

这篇文章原文是matlab网站上的，我把它翻译过来同时自己也学习一下。原文见这里

这篇文章主要使用到了如下几种加速方法：

这篇文章原文是matlab网站上的，我把它翻译过来同时自己也学习一下。原文见这里

这篇文章主要使用到了如下几种加速方法：

预分配空间
向量化
移除重复运算

我们要加速的程序是这样的。代码首先生成一个 x1 x2为横纵坐标的2D网格. 这个程序是要循环遍历所有初始和终止点的组合。给定一组位置，程序计算一个指数，如果这个指数小于阈值gausThresh，那么这个值就用于计算 out. subs变量保存所有点的位置坐标。

最初的程序如下：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

% Initialize the grid and initial and final points
nx1 = 10; nx2 = 10;
x1l = 0; x1u = 100;
x2l = 0; x2u = 100;

x1 = linspace(x1l,x1u,nx1+1);
x2 = linspace(x2l,x2u,nx2+1);

limsf1 = 1:nx1+1; limsf2 = 1:nx2+1;

% Initalize other variables
t = 1;
sigmax1 = 0.5; sigmax2 = 1;
sigma = t * [sigmax1^2 0; 0 sigmax2^2];
invSig = inv(sigma);
detSig = det(sigma);

expF = [1 0; 0 1];
n = size (expF, 1);
gausThresh = 10;

small = 0; subs = []; vals = [];

% Iterate through all possible initial
% and final positions and calculate
% the values of exponent and out
% if exponent > gausThresh.

for i1 = 1:nx1+1
for i2 = 1:nx2+1
for f1 = limsf1
for f2 = limsf2

% Initial and final position
xi = [x1(i1) x2(i2)]';
xf = [x1(f1) x2(f2)]';

exponent = 0.5 * (xf - expF * xi)'...
* invSig * (xf - expF * xi);

if exponent > gausThresh
small = small + 1;
else
out = 1 / (sqrt((2 * pi)^n * detSig))...
* exp(-exponent);
subs = [subs; i1 i2 f1 f2];
vals = [vals; out];
end

end
end
end
end

下面是一个图形表示的可视化（nx1=nx2=100）. 因为数据非常稠密，所以我们只显示其中一部分。红线链接的就是指数计算结果小于阈值的部分，线的粗细反应了数值大小。

这个程序在一个T60 Lenovo dual-core laptop 上面，开启了多线程以后，运行时间如下

1
2
3
4

displayRunTimes(1)
nx1 nx2 time
50 50 296 seconds
100 100 9826 seconds

M-Lint 的建议

第一步，也是最简单的一步就是根据 M-Lint 的建议修改, 这是matlab自带的静态代码分析工具。在editor里面就可以看到这些建议内容，不过也可以自己写一个函数让建议内容更加集中的显示一下。例如

1 2	output = mlint('initial.m'); displayMlint(output);

'subs' might be growing inside a loop. Consider preallocating for speed.

'vals' might be growing inside a loop. Consider preallocating for speed.

根据建议，第一步是要给一些数组预分配空间。这样做是因为matlab使用的内存中连续的块，因此，如果在循环里不断改变数组大小，matlab就要不断寻找合适大小的内存片段并把数据移动过去。如果分配了一个比较大的空间，matlab就可以一直在这个连续空间工作。

但是，目前我们不知道 subs和vars的个数，如果要预分配，我们得分配最大可能的空间。那就是100^4，我们来试试：

1
2
3
4
5

try
zeros(1,100^4)
catch ME
end
display(ME.message)

Out of memory. Type HELP MEMORY for your options.
看来不行啊。太大了。

因此，我们只能通过分块分配空间来实现，每次分配一个可以接受的大小，并设置一个计数器，当这块空间满了的时候，再分配一个块。这样，内存移动的次数大大得到了降低。

（bfcat注：这种不知道数组大小的时候，还有一个方法就是使用cell。我没有仔细分析cell的原理，但是我觉得它像是一个链表，因此cell里面的每一个元素不需要在连续的内存空间。因此，当我们执行类似 M{end+1} = m 的时候，matlab也不需要将M 中已有的元素都拷贝一次。这样，虽然Mlint还会提示让我们为cell预先分配空间，但是没关系，不分配对速度影响也不大。当循环结束以后，执行类似 M = cat(1, M{:}) 这样的语句就可以将其变回数组了。）

预分配空间后的代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

% Initialization
nx1 = 10;
nx2 = 10;

[x1,x2,limsf1,limsf2,expF,gausThresh,small,invSig,detSig,n]= ...
initialize(nx1,nx2);

% Initial guess for preallocation
mm = min((nx1+1)^2*(nx2+1)^2, 10^6);
subs = zeros(mm,4);
vals = zeros(mm,1);

counter = 0;

% Iterate through all possible initial
% and final positions
for i1 = 1:nx1+1
for i2 = 1:nx2+1
for f1 = limsf1
for f2 = limsf2

xi = [x1(i1) x2(i2)]'; %% Initial position
xf = [x1(f1) x2(f2)]'; %% Final position

exponent = 0.5 * (xf - expF * xi)'...
* invSig * (xf - expF * xi);

% Increase preallocation if necessary
if counter == length(vals)
subs = [subs; zeros(mm, 4)];
vals = [vals; zeros(mm, 1)];
end

if exponent > gausThresh
small = small + 1;
else
% Counter introduced
counter=counter + 1;
out = 1 / (sqrt((2 * pi)^n * detSig))...
* exp(-exponent);
subs(counter,:) = [i1 i2 f1 f2];
vals(counter) = out;
end

end
end
end
end

% Remove zero components that came from preallocation
vals = vals(vals > 0);
subs = subs(vals > 0);

1
2
3
4

displayRunTimes(2)
nx1 nx2 time
50 50 267 seconds
100 100 4228 seconds

运行速度变快了一些，但是还是不够理想。

向量化

因为matlab是基于矩阵的语言，因此，我们最好尽量用向量代替循环。尤其是多重循环嵌套的时候更要注意速度问题。对于这个代码，我们主要进行以下两种改动

向量化里面的两个循环
向量化里面的三个循环

尝试1: 向量化两个内循环

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

% Initialization

nx1 = 10;
nx2 = 10;

[x1,x2,limsf1,limsf2,expF,gausThresh,small,invSig,detSig,n]= ...
initialize(nx1,nx2);

vals = cell(nx1+1,nx2+1); % Cell preallocation
subs = cell(nx1+1,nx2+1); % Cell preallocation

[xind,yind] = meshgrid(limsf1,limsf2);
xyindices = [xind(:)' ; yind(:)'];

[x,y] = meshgrid(x1(limsf1),x2(limsf2));
xyfinal = [x(:)' ; y(:)'];

exptotal = zeros(length(xyfinal),1);

% Loop over all possible combinations of positions
for i1 = 1:nx1+1
for i2 = 1:nx2+1

xyinitial = repmat([x1(i1);x2(i2)],1,length(xyfinal));

expa = 0.5 * (xyfinal - expF * xyinitial);
expb = invSig * (xyfinal - expF * xyinitial);
exptotal(:,1) = expa(1,:).*expb(1,:)+expa(2,:).*expb(2,:);

index = find(exptotal < gausThresh);
expreduced = exptotal(exptotal < gausThresh);

out = 1 / (sqrt((2 * pi)^n * detSig)) * exp(-(expreduced));
vals{i1,i2} = out;
subs{i1,i2} = [i1*ones(1,length(index)) ; ...
i2*ones(1,length(index)); xyindices(1,index); ...
xyindices(2,index)]' ;

end
end

% Reshape and convert output so it is in a
% simple matrix format
vals = cell2mat(vals(:));
subs = cell2mat(subs(:));

small = ((nx1+1)^2*(nx2+1)^2)-length(subs);

这个向量化效果非常明显

1
2
3
4

displayRunTimes(3)
nx1 nx2 time
50 50 1.51 seconds
100 100 19.28 seconds

这里主要使用了下面几个向量化的手段

使用 meshgrid 和 repmat 创建矩阵
使用向量和矩阵操作
尽量直接使用向量元素操作

可以参考这个页面获得更多向量化技巧：http://www.mathworks.com/help/matlab/matlab_prog/techniques-for-improving-performance.html?s_tid=doc_12b#f8-783589

尝试2: 向量化三个内循环

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

% Initialization
nx1 = 10;
nx2 = 10;

[x1,x2,limsf1,limsf2,expF,gausThresh,small,invSig,detSig,n]= ...
initialize(nx1,nx2);

limsi1 = limsf1;
limsi2 = limsf2;

% ndgrid gives a matrix of all the possible combinations
[aind,bind,cind] = ndgrid(limsi2,limsf1,limsf2);
[a,b,c] = ndgrid(x2,x1,x2);

vals = cell(nx1+1,nx2+1); % Cell preallocation
subs = cell(nx1+1,nx2+1); % Cell preallocation

% Convert grids to single vector to use in a single loop
b = b(:); aind = aind(:); bind = bind(:); cind = cind(:);

expac = a(:)-c(:); % Calculate x2-x1

% Iterate through initial x1 positions (i1)
for i1 = limsi1

exbx1= b-x1(i1);
expaux = invSig(2)*exbx1.*expac;
exponent = 0.5*(invSig(1)*exbx1.*exbx1+expaux);

index = find(exponent < gausThresh);
expreduced = exponent(exponent < gausThresh);

vals{i1} = 1 / (sqrt((2 * pi)^n * detSig))...
.*exp(-expreduced);

subs{i1} = [i1*ones(1,length(index));
aind(index)' ; bind(index)';...
cind(index)']';

end

vals = cell2mat(vals(:));
subs = cell2mat(subs(:));

small = ((nx1+1)^2*(nx2+1)^2)-length(subs);

现在运行时间更短了：

1
2
3
4

displayRunTimes(4)
nx1 nx2 time
50 50 0.658 seconds
100 100 8.77 seconds

最后，简化一些计算，让其只算一次

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70

nx1 = 100; nx2 = 100;
x1l = 0; x1u = 100;
x2l = 0; x2u = 100;

x1 = linspace(x1l,x1u,nx1+1);
x2 = linspace(x2l,x2u,nx2+1);

limsi1 = 1:nx1+1;
limsi2 = 1:nx2+1;

limsf1 = 1:nx1+1;
limsf2 = 1:nx2+1;

t = 1;

sigmax1 = 0.5;
sigmax2 = 1;

sigma = t * [sigmax1^2 sigmax2^2];
detSig = sigma(1)*sigma(2);
invSig = [1/sigma(1) 1/sigma(2)];

gausThresh = 10;
n=3;
const=1 / (sqrt((2 * pi)^n * detSig));

% ndgrid gives a matrix of all the possible combinations
% of position, except limsi1 which we iterate over

[aind,bind,cind] = ndgrid(limsi2,limsf1,limsf2);
[a,b,c] = ndgrid(x2,x1,x2);

vals = cell(nx1+1,nx2+1); % Cell preallocation
subs = cell(nx1+1,nx2+1); % Cell preallocation

% Convert grids to single vector to
% use in a single for-loop
b = b(:);
aind = aind(:);
bind = bind(:);
cind = cind(:);

expac= a(:)-c(:);
expaux = invSig(2)*expac.*expac;

% Iterate through initial x1 positions

for i1 = limsi1

expbx1= b-x1(i1);
exponent = 0.5*(invSig(1)*expbx1.*expbx1+expaux);

% Find indices where exponent < gausThresh
index = find(exponent < gausThresh);

% Find and keep values where exp < gausThresh

expreduced = exponent(exponent < gausThresh);

vals{i1} = const.*exp(-expreduced);

subs{i1} = [i1*ones(1,length(index));
aind(index)' ; bind(index)';...
cind(index)']';
end

vals = cell2mat(vals(:));
subs = cell2mat(subs(:));

small = ((nx1+1)^2*(nx2+1)^2)-length(subs);

最终的运行时间

1
2
3
4

displayRunTimes(5)
nx1 nx2 time
50 50 0.568 seconds
100 100 8.36 seconds

预分配空间
向量化
移除重复运算

最初的程序如下：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

这个程序在一个T60 Lenovo dual-core laptop 上面，开启了多线程以后，运行时间如下

1
2
3
4

displayRunTimes(1)
nx1 nx2 time
50 50 296 seconds
100 100 9826 seconds

M-Lint 的建议

1 2	output = mlint('initial.m'); displayMlint(output);

'subs' might be growing inside a loop. Consider preallocating for speed.

'vals' might be growing inside a loop. Consider preallocating for speed.

但是，目前我们不知道 subs和vars的个数，如果要预分配，我们得分配最大可能的空间。那就是100^4，我们来试试：

1
2
3
4
5

try
zeros(1,100^4)
catch ME
end
display(ME.message)

Out of memory. Type HELP MEMORY for your options.
看来不行啊。太大了。

预分配空间后的代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

1
2
3
4

displayRunTimes(2)
nx1 nx2 time
50 50 267 seconds
100 100 4228 seconds

运行速度变快了一些，但是还是不够理想。

向量化

向量化里面的两个循环
向量化里面的三个循环

尝试1: 向量化两个内循环

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

这个向量化效果非常明显

1
2
3
4

displayRunTimes(3)
nx1 nx2 time
50 50 1.51 seconds
100 100 19.28 seconds

这里主要使用了下面几个向量化的手段

使用 meshgrid 和 repmat 创建矩阵
使用向量和矩阵操作
尽量直接使用向量元素操作

可以参考这个页面获得更多向量化技巧：http://www.mathworks.com/help/matlab/matlab_prog/techniques-for-improving-performance.html?s_tid=doc_12b#f8-783589

尝试2: 向量化三个内循环

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

现在运行时间更短了：

1
2
3
4

displayRunTimes(4)
nx1 nx2 time
50 50 0.658 seconds
100 100 8.77 seconds

最后，简化一些计算，让其只算一次

最终的运行时间

1
2
3
4

displayRunTimes(5)
nx1 nx2 time
50 50 0.568 seconds
100 100 8.36 seconds

本文引用地址：http://blog.sciencenet.cn/blog-791749-636039.html 转载请注明来自科学网博客，并请注明作者姓名。