Line Search and Quasi-Newton Methods 线性搜索与拟牛顿法

Gradient Descent

机器学习中很多模型的参数估计都要用到优化算法，梯度下降是其中最简单也用得最多的优化算法之一。梯度下降(Gradient Descent)[3]也被称之为最快梯度(Steepest Descent)，可用于寻找函数的局部最小值。梯度下降的思路为，函数值在梯度反方向下降是最快的，只要沿着函数的梯度反方向移动足够小的距离到一个新的点，那么函数值必定是非递增的，如图1所示。

梯度下降思想的数学表述如下：

b = a - α \nabla F (a) \Rightarrow f (a) \geq f (b) (1)

其中

x k + 1 = x k - α k \nabla f (x k), 0 \leq k \leq n (2)

f (x 0) \geq f (x 1) \geq f (x 2) \geq \dots \geq f (x n) (3)

f (x k + α d k) < f (x k)

d k = - B k \nabla f (x k) (5)

Line Search

在给定搜索方向

α = a r g

Bisection Search

二分线性搜索(Bisection Line Search)[2]可用于求解函数的根，其思想很简单，就是不断将现有区间划分为两半，选择必定含有使

L = (1 2 ) n α ^ (7)

L \leq ϵ \Rightarrow k \leq [log 2 (α ^ ϵ ) ] (8)

 1 def bisection(dfun,theta,args,d,low,high,maxiter=1e4):
 2     """
 3     #Functionality:find the root of the function(fun) in the interval [low,high]
 4     #@Parameters
 5     #dfun:compute the graident of function f(x)
 6     #theta:Parameters of the model
 7     #args:other variables needed to compute the value of dfun
 8     #[low,high]:the interval which contains the root
 9     #maxiter:the max number of iterations
10     """
11     eps=1e-6
12     val_low=np.sum(dfun(theta+low*d,args)*d.T)
13     val_high=np.sum(dfun(theta+high*d,args)*d.T)
14     if val_low*val_high>0:
15         raise Exception('Invalid interval!')
16     iter_num=1
17     while iter_num<maxiter:
18         mid=(low+high)/2
19         val_mid=np.sum(dfun(theta+mid*d,args)*d.T)
20         if abs(val_mid)<eps or abs(high-low)<eps:
21             return mid
22         elif val_mid*val_low>0:
23             low=mid
24         else:
25             high=mid
26         iter_num+=1

Backtracking

回溯线性搜索(Backing Line Search)[1]基于Armijo准则计算搜素方向上的最大步长，其基本思想是沿着搜索方向移动一个较大的步长估计值，然后以迭代形式不断缩减步长，直到该步长使得函数值

f (x k + α d k) \leq f (x k) + c 1 α f' (x k) T d k (9)

h' (0) < c 1 h' (0) < 0 (10)

h' (0) = lim α \to 0 h ( α ) - h ( 0 ) α = lim α \to 0 f ( x k +

f ( x k + α d k ) - f ( x k ) α < c f ' ( x k ) T d k (12)

 1 def ArmijoBacktrack(fun,dfun,theta,args,d,stepsize=1,tau=0.5,c1=1e-3):
 2     """
 3     #Functionality:find an acceptable stepsize via backtrack under Armijo rule
 4     #@Parameters
 5     #fun:compute the value of objective function
 6     #dfun:compute the gradient of objective function
 7     #theta:a vector of parameters of the model
 8     #stepsize:initial step size
 9     #c1:sufficient decrease Parameters
10     #tau:rate of shrink of stepsize
11     """
12     slope=np.sum(dfun(theta,args)*d.T)
13     obj_old=costFunction(theta,args)
14     theta_new=theta+stepsize*d
15     obj_new=costFunction(theta_new,args)
16     while obj_new>obj_old+c1*stepsize*slope:
17         stepsize*=tau
18         theta_new=theta+stepsize*d
19         obj_new=costFunction(theta_new,args)
20     return stepsize

Interpolation

基于Armijo准则的回溯线性搜索的收敛速度无法得到保证，特别是要回退很多次后才能落入满足Armijo准则的区间。如果我们根据已有的函数值和导数信息，采用多项式插值法(Interpolation)[12,6,5,9]拟合函数，然后根据该多项式函数估计函数的极值点，这样选择合适步长的效率会高很多。假设我们只有

h q (α) = (h ( α 0 ) - h ( 0 ) - α 0 h ' ( 0 ) α 2 0 ) α 2 + h

α 1 = h ' ( 0 ) α 2 0 2 [ h ( 0 ) + h ' ( 0 ) α 0 - h ( α 0 ) ]

h c (α) = a α 3 + b α 2 + h' (0) α + h (0) (15)

[a b] = 1 α 2 i - 1 α 2 i ( α i - α i - 1 ) [

α i + 1 = - b + b 2 - 3 a h ' ( 0 )----------\sqrt 3 a (17)

H 3 (α) = [1 + 2 α i - α α i - α i - 1 ]

α i + 1 = α i - (α i - α i - 1) [h ' ( α i ) + d 2 -

d 1 = h' (α i) + h' (α i - 1) - 3 [h ( α i ) - h ( α

d 2 = s i g n (α i - α i - 1) d 2 1 - h' (α i - 1) h' (

 1 def quadraticInterpolation(a,h,h0,g0):
 2     """
 3     #Functionality:Approximate h(a) with a quadratic function and return its stationary point
 4     #@Parameters
 5     #a:current stepsize
 6     #h:a function value about stepsize,h(a)=f(x_k+a*d)
 7     #h:h(0)=f(x_k)
 8     #g0:h'(0)=f'(0)
 9     """
10     numerator=g0*a**2
11     denominator=2*(g0*a+h0-h)
12     if abs(denominator)<1e-12:#indicates that a is almost 0
13         return a
14     return numerator/denominator

def cubicInterpolation(a0,h0,a1,h1,h,g):
    """
    #Functionality:Approximate h(x) with a cubic function and return its stationary point
    #This version of cubic interpolation computes h'(x) as few as possible,suitable for the case in which computing derivative is more expensive than computing function values
    #@Parameters
    #a0 and a1 are stepsize it previous two iterations
    #h0:h(a0)
    #h1:h(a1)
    #h:h(0)=f(x)
    #g:h'(0)
    """
    mat=matlib.matrix([[a0**2,-a1**2],[-a0**3,a1**3]])
    vec=matlib.matrix([[h1-h-g*a1],[h0-h-g*a0]])
    ab=mat*vec/(a0**2*a1**2*(a1-a0))
    a=ab[0,0]
    b=ab[1,0]
    if abs(a)<1e-12:#a=0 and cubic function is a quadratic one
        return -g/(2*b)
    return (-b+np.sqrt(b**2-3*a*g))/(3*a)

def cubicInterpolationHermite(a0,h0,g0,a1,h1,g1):
    """
    #Functionality:Approximate h(a) with a cubic Hermite polynomial function and return its stationary point
    #This version of cubic interpolation computes h(a) as few as possible,suitable for the case in which computing derivative is easier than computing function values
    #@Parameters
    #a0 and a1 are stepsize it previous two iterations
    #h0:h(a0)
    #g0:h'(a0)
    #h1:h(a1)
    #g1:h'(a1)
    """
    d1=g0+g1-3*(h1-h0)/(a1-a0)
    d2=np.sign(a1-a0)*np.sqrt(d1**2-g0*g1)
    res=a1-(a1-a0)*(g1+d2-d2)/(g1-g0+2*d2)
    return res

基于Armijo准则的线性搜索的算法描述如下[4]对应的Armijo线性搜索的Python代码如下：

 1 def ArmijoLineSearch(fun,dfun,theta,args,d,a0=1,c1=1e-3,a_min=1e-7,max_iter=1e5):
 2     """
 3     #Functionality:Line search under Armijo condition with quadratic and cubic interpolation
 4     #@Parameters
 5     #fun:objective Function
 6     #dfun:compute the gradient of fun
 7     #theta:a vector of parameters of the model
 8     #args:other variables needed for fun and func
 9     #d:search direction
10     #a0:initial stepsize
11     #c1:constant used in Armijo condition
12     #a_min:minimun of stepsize
13     #max_iter:maximum of the number of iterations
14     """
15     eps=1e-6
16     c1=min(c1,0.5)#c1 should<=0.5
17     a_pre=h_pre=g_pre=0
18     a_cur=a0
19     f_val=fun(theta,args) #h(0)=f(0)
20     g_val=np.sum(dfun(theta,args)*d.T) #h'(0)=f'(x)^Td
21     h_cur=g_cur=0
22     k=0
23     while a_cur>a_min and k<max_iter:
24         h_cur=fun(theta+a_cur*d,args)
25         g_cur=np.sum(dfun(theta+a_cur*d,args)*d.T)
26         if h_cur<=f_val+c1*a_cur*g_val: #meet Armijo condition
27             return a_cur
28         if not k: #k=0,use quadratic interpolation
29             a_new=quadraticInterpolation(a_cur,h_cur,f_val,g_val)
30         else: #k>0,use cubic Hermite interpolation
31             a_new=cubicInterpolationHermite(a_pre,h_pre,g_pre,a_cur,h_cur,g_cur)
32         if abs(a_new-a_cur)<eps or abs(a_new)<eps: #safeguard procedure
33             a_new=a_cur/2
34         a_pre=a_cur
35         a_cur=a_new
36         h_pre=h_cur
37         g_pre=g_cur
38         k+=1
39     return a_min #failed search

Wolfe Search

前面说到单凭Armijo准则(不考虑回溯策略)选出的步长可能太小，为了排除这些微小的步长，我们加上曲率的约束条件(如图5所示)

h' (α) = f' (x k + α d k) T d k \geq c 2 f' (x k) T d k

{f (x k + α d k) f' (x k + α d k) T d k \leq f (x k

{f (x k + α d k) | f' (x k + α d k) T d k |

f (x k + α' d k) = f (x k) + α' c 1 f' (x k) T d k (25)

f (x k + α' d k) - f (x k) = α' f' (x k + α'' d k) T d k

f' (x k + α'' d k) T d k = c 1 f' (x k) T d k > c 2 f'

在算法5中，

这一点结合图7就很容易理解了，我在图中分别用红色和绿色点标注了

 1 def WolfeLineSearch(fun,dfun,theta,args,d,a0=1,c1=1e-4,c2=0.9,a_min=1e-7,max_iter=1e5):
 2     """
 3     #Functionality:find a stepsize meeting Wolfe condition
 4     #@Parameters
 5     #fun:objective Function
 6     #dfun:compute the gradient of fun
 7     #theta:a vector of parameters of the model
 8     #args:other variables needed for fun and func
 9     #d:search direction
10     #a0:intial stepsize
11     #c1:constant used in Armijo condition
12     #c2:constant used in curvature condition
13     #a_min:minimun of stepsize
14     #max_iter:maximum of the number of iterations
15     """
16     eps=1e-16
17     c1=min(c1,0.5)
18     a_pre=0
19     a_cur=a0
20     f_val=fun(theta,args) #h(0)=f(x)
21     g_val=np.sum(dfun(theta,args)*d.T)
22     h_pre=f_val #h'(0)=f'(x)^Td
23     k=0
24     while k<max_iter and abs(a_cur-a_pre)>=eps:
25         h_cur=fun(theta+a_cur*d,args) #f(x+ad)
26         if h_cur>f_val+c1*a_cur*g_val or h_cur>=h_pre and k>0:
27             return zoom(fun,dfun,theta,args,d,a_pre,a_cur,c1,c2)
28         g_cur=np.sum(dfun(theta+a_cur*d,args)*d.T)
29         if abs(g_cur)<=-c2*g_val:#satisfy Wolfe condition
30             return a_cur
31         if g_cur>=0:
32             return zoom(fun,dfun,theta,args,d,a_pre,a_cur,c1,c2)
33         a_new=quadraticInterpolation(a_cur,h_cur,f_val,g_val)
34         a_pre=a_cur
35         a_cur=a_new
36         h_pre=h_cur
37         k+=1
38     return a_min

zoom函数的算法描述见6。zoom函数中需要传入搜寻区间

zoom函数对应的Python代码如下：

 1 def zoom(fun,dfun,theta,args,d,a_low,a_high,c1=1e-3,c2=0.9,max_iter=1e4):
 2     """
 3     #Functionality:enlarge the interval to find a stepsize meeting Wolfe condition
 4     #@Parameters
 5     #fun:objective Function
 6     #dfun:compute the gradient of fun
 7     #theta:a vector of parameters of the model
 8     #args:other variables needed for fun and func
 9     #d:search direction
10     #[a_low,a_high]:interval containing a stepsize satisfying Wolfe condition
11     #c1:constant used in Armijo condition
12     #c2:constant used in curvature condition
13     #max_iter:maximum of the number of iterations
14     """
15     if a_low>a_high:
16         print('low:%f,high:%f'%(a_low,a_high))
17         raise Exception('Invalid interval of stepsize in zoom procedure')
18     eps=1e-16
19     h=fun(theta,args) #h(0)=f(x)
20     g=np.sum(dfun(theta,args)*d.T) #h'(0)=f'(x)^Td
21     k=0
22     h_low=fun(theta+a_low*d,args)
23     h_high=fun(theta+a_high*d,args)
24     if h_low>h+c1*a_low*g:
25         raise Exception('Left endpoint violates Armijo condition in zoom procedure')
26     while k<max_iter and abs(a_high-a_low)>=eps:
27         a_new=(a_low+a_high)/2
28         h_new=fun(theta+a_new*d,args)
29         if h_new>h+c1*a_new*g or h_new>h_low:
30             a_high=a_new
31             h_high=h_new
32         else:
33             g_new=np.sum(dfun(theta+a_new*d,args)*d.T)
34             if abs(g_new)<=-c2*g: #satisfy Wolfe condition
35                 return a_new 
36             if g_new*(a_high-a_low)>=0:
37                 a_high=a_new
38                 h_high=h_new
39             else:
40                 a_low=a_new
41                 h_low=h_new
42         k+=1
43     return a_low #a_low definitely satisfy Armijo condition

Newton's Method

牛顿法(Newton's method)[8]以迭代方式求解函数的根，其基本思想是从一个初始点出发，不断在当前点

f (x k + △ x) \approx f (x k) + f' (x k) △ x + 1 2 △ x T B k △

f' (x k + 1) = f' (x k) + B k (x k + 1 - x k) (29)

x k + 1 = x k - B - 1 k f' (x k) (30)

Quasi-Newton Method

拟牛顿(Quasi-Newton)[11]算法可用于求解函数的局部最优解，也就是那些导数为0的驻点。牛顿法用于解决优化问题时，事先假设原函数可用二次函数近似，然后用一阶和二阶导数寻找局部最优解。而在拟牛顿算法中，不需要准确计算Hessian矩阵，取而代之的是运用下面的拟牛顿条件分析连续两个梯度向量得到的近似值矩阵

f' (x k + 1) - f' (x k) \approx B k + 1 (x k + 1 - x k)

 1 def BFGS(fun,dfun,theta,args,H=None,mode=0,eps=1e-12,max_iter=1e4):
 2     """
 3     #Functionality:find the minimum of objective function f(x)
 4     #@Parameters
 5     #fun:objective function f(x)
 6     #dfun:compute the gradient of f(x)
 7     #args:parameters needed by fun and dfun
 8     #theta:start vector of parameters of the model
 9     #H:initial inverse Hessian approximation
10     #mode:index of line search algorithm
11     """
12     x_pre=x_cur=theta
13     g=dfun(x_cur,args)
14     I=matlib.eye(theta.size)
15     if not H:#initialize H as an identity matrix
16         H=I
17     k=0
18     while k<max_iter and np.sum(np.abs(g))>eps:
19         d=-g*H
20         step=LineSearch(fun,dfun,x_pre,args,d,1,mode)
21         x_cur=x_pre+step*d
22         s=step*d
23         y=dfun(x_cur,args)-dfun(x_pre,args)
24         ys=np.sum(y*s.T)
25         if abs(ys)<eps:
26             return x_cur
27         change=(ys+np.sum(y*H*y.T))*(s.T*s)/(ys**2)-(H*y.T*s+s.T*y*H)/ys
28         H+=change
29         g=dfun(x_cur,args)
30         x_pre=x_cur
31         k+=1
32     return x_cur

下面我们分析如何构造下L-BFGS的算法[10,13]。假设我们现在处于优化过程的第

= = = H k g k V T k - 1 H

q i = (V k - i \dots V k - 2 V k - 1) g k (33)

a i = ρ k - i s T k - i q i - 1 (34)

q i = V k - i + 1 q i - 1 = q i - 1 - ρ

H k g k = P 1 = V T k - 1 P 2 + s k - 1 a 1 (36)

P 2 = V T k - 2 P 3 + s k - 2 a 2 (37)

P i = V T k - i P i + 1 + s k - i a i = P

在算法9中，需要给出矩阵

γ k = y T k - 1 s k - 1 y T k - 1 y k - 1 (39)

 1 def LBFGS(fun,dfun,theta,args,mode=0,eps=1e-12,max_iter=1e4):
 2     """
 3     #Functionality:find the minimum of objective function f(x) with LBFGS
 4     #@Parameters
 5     #fun:objective function f(x)
 6     #dfun:compute the gradient of f(x)
 7     #args:parameters needed by fun and dfun
 8     #theta:start vector of parameters of the model
 9     #H:initial inverse Hessian approximation
10     #mode:index of line search algorithm
11     """
12     x_pre=x_cur=theta
13     s_arr=[]
14     y_arr=[]
15     Hscale=1
16     k=0
17     while k<max_iter:
18         g=dfun(x_cur,args)
19         d=LBFGSSearchDirection(y_arr,s_arr,Hscale,-g)
20         step=LineSearch(fun,dfun,x_pre,args,d,1,mode)
21         s=step*d
22         x_cur=x_pre+s
23         y=dfun(x_cur,args)-dfun(x_pre,args)
24         ys=np.sum(y*s.T)
25         if np.sum(np.abs(s))<eps:
26             return x_cur
27         x_pre=x_cur
28         k+=1
29         y_arr,s_arr,Hscale=LBFGSUpdate(y,s,y_arr,s_arr)
30     return x_cur
31 
32     
33 def LBFGSSearchDirection(y_arr,s_arr,Hscale,g):
34     """
35     #Functionality:estimate search direction using with LBFGS
36     #@Parameters
37     #y_arr:m*dim matrix,where y_arr[i,:]=f'(x_{i+1})-f'(x_i)
38     #s_arr:m*dim matrix,where s_arr[i,:]=x_{k+1}-x_k
39     #Hscale:a scale to initilize the inverse of Hessian matrix
40     #g:a row vector representing -f'(x_{k})
41     """
42     histNum=len(s_arr)#number of update data stored
43     if not histNum:
44         return g
45     dim=s_arr[0].size
46     a_arr=[0 for i in range(histNum)]
47     rho=[0 for i in range(histNum)]
48     q=g
49     for i in range(1,histNum+1):
50         s=s_arr[histNum-i]
51         y=y_arr[histNum-i]
52         rho[histNum-i]=1/np.sum(s*y.T)
53         a_arr[i-1]=rho[histNum-i]*np.sum(s*q.T)
54         q-=(a_arr[i-1]*y)
55     P=Hscale*q
56     for i in range(histNum,0,-1):
57         y=y_arr[histNum-i]
58         s=s_arr[histNum-i]
59         beta=rho[histNum-i]*np.sum(y*P.T)
60         P+=s*(a_arr[i-1]-beta)
61     return P
62         
63 
64 def LBFGSUpdate(y,s,oldy,olds,m=1e2):
65     """
66     #Functionality:refresh the historical update data
67     #@Parameters
68     #y:f'(x_{k+1})-f'(x_k)
69     #s:x_{k+1}-x_k
70     #oldy:[y0,y1,...],which is a list
71     #olds:[s0,s1,...],which is a list
72     #m:number of historical data to store(default:100)
73     """
74     eps=1e-12
75     Hscale=np.sum(y*s.T/y*y.T) #a scale to initialize H_{k-m}
76     if Hscale<eps:#skip update
77         return oldy,olds,Hscale
78     
79     cur_m=len(oldy)
80     if cur_m>=m:
81         oldy.pop(0)
82         olds.pop(0)
83     oldy.append(copy.deepcopy(y))
84     olds.append(copy.deepcopy(s))
85     return oldy,olds,Hscale

References

[1] Backtracking line search. http://en.wikipedia.org/wiki/Backtracking_line_search.

[2] Bisection method. http://en.wikipedia.org/wiki/Bisection_method.

[3] Gradient descent. http://en.wikipedia.org/wiki/Gradient_descent.

[4] Limited-memory bfgs. http://en.wikipedia.org/wiki/Limited-memory_BFGS.

[5] Line search methods. http://pages.cs.wisc.edu/~ferris/cs730/chap3.pdf.

[6] Line search methods:step length selection. http://terminus.sdsu.edu/SDSU/Math693a_f2013/Lectures/06/lecture.pdf.

[7] Math 408a line search methods. https://www.math.washington.edu/~burke/crs/408/lectures/L7-line-search.pdf.

[8] Newton’s method. http://en.wikipedia.org/wiki/Newton%27s_method.

[9] Nonlinear programming algorithms. http://www.math.bme.hu/~bog/GlobOpt/Chapter5.pdf.

[10] Oerview of quasi-newton optimization methods. https://homes.cs.washington.edu/~galen/files/quasi-newton-notes.pdf.

[11] Quasi-newton method. http://en.wikipedia.org/wiki/Quasi-Newton_method.

[12] Unconstrained minimization. http://www.ing.unitn.it/~bertolaz/2-teaching/2011-2012/AA-2011-2012-OPTIM/lezioni/slides-mND.pdf.

[13] Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528,1989.

作者：JeromeWang
邮箱：yunfeiwang@hust.edu.cn
出处：http://www.cnblogs.com/jeromeblog/