联邦学习FedAvg记录

Notation

符号 含义
(F(w)) 总目标函数
(w) 待优化参数
(F_k(w)) (k)个client的目标函数
(p_k) (k)个client在总表函数中的占比
(E) 每个local update的次数
(T) 总迭代次数,即通讯次数为(T/E)
(eta_t) (t)时刻学习率
(L) 函数(F_k)(L-smooth),即( abla^2 F_k(w)leq L)
(mu) 函数(F_k)(mu)convex,即( abla^2 F_k(w)geq u)
(sigma_k) (E(Vert abla F_k(w_t^k, xi^k_t)- abla F_k(w_t^k)Vert)leqsigma^2_k)
(G) (E(Vert abla F_k(w_t^k, xi^k_t)Vert)leq G^2)
(F^*) 最优函数值
(F_k^*) 单独对第(k)个client优化得到的最优函数值
(Gamma) (F^*-sum_k\,p_kF_k^*),度量异质性
(eta_t) (t)时刻的学习率
(kappa) (frac{L}{mu}),可近似看为条件数
(w^*) 最优参数
(xi_t^k) client (k)(t)时刻进行随机梯度下降选出的样本

假设

  1. 函数(F_k)(L-smooth),对于所有的(k)
  2. 函数(F_k)(mu-convex),对于所有的(k)
  3. client(k)每次计算的随机梯度的方差是(sigma^2_k)有界的
  4. 所有计算的随机梯度的范数是(G)有界的

全参与下的收敛性证明

引入两个变量,(v^k_t)(w_t^k)

[egin{align*} v_{t+1}^k &= v_t^k - eta_t abla F_k(w_t^k, xi_t^K)\ w_{t+1}^k &= left{ egin{matrix} v_{t+1} quad & ext{for }t+1 otin I_E\ sum_k^{N}\,p_k v_{k+1}^k quad & ext{for }t+1 in I_Eend{matrix} ight. end{align*} ]

定义

[egin{align*} ar v_t &= sum olimits_k \, p_k v_t^k\ ar w_t &= sum olimits_k \, p_k w_t^k\ ar g_t &= sum olimits_k\, p_k abla F_k(w_t^k)\ g_t &=sum olimits_k p_k abla F_k(w_t^k, xi_k^t) end{align*} ]

由于(tin I_E),在能够交换的迭代轮次才能获取参数的更新(w),变量(v)用来表示在不能进行交换数据的轮次的参数。由于全局参与,(ar v_t = ar w_t)对于所有的(t),而且(ar v_{t+1} = ar w_t - eta_t g_t)

个人理解:

要证明收敛性需要证明,参数是收敛的,由于参数(ar w_t)是根据梯度下降求出来的,所以需要证明,

[EVert ar w_{t+1} - w^*Vert leq l(ar w_{t} - w) ]

即当前迭代的参数和最优点的(w^*)的距离是小于上一次迭代参数与最优参数的距离,而且(l)函数是可以递推的。也就是说,当前迭代的参数和最优点的(w^*)的距离的上界是逐渐减小的。

文章没有选择(ar w - w^*)而是选择了(ar v - w^*),因为(ar v)是对应所有client的,在部分参与的场景下,(ar w)是偏差的。

[egin{align*} EVert ar v_{t-1} - w^*Vert^2 &= EVert ar w_t - eta_t g_t -w^*-eta_t ar g_t + eta_t ar g_tVert^2\ &= Eleft(Vert ar w_t -eta _t ar g_t -w^*Vert ^2 + 2eta_t <ar w_t-eta_tar g_t -w^*, -g_t+ar g_t> + eta_t^2 Vertar g_t-g_tVert^2 ight) end{align*}label{eq:1} ag{1} ]

上面之所以拆分出(-eta_t ar g_t + eta_t ar g_t)的原意是想利用(E(eta_t g_t - eta_t ar g_t)=0)来对( ef{eq:1})进行拆分。

对于(Eleft(Vert ar w_t -eta _t ar g_t -w^*Vert ^2 ight))继续进行计算

[egin{align*} Vert ar w_t - eta_t ar g_t - w^*Vert^2&= Vert ar w_t - w^*Vert^2 - 2eta_t<ar w_t-w^*, ar g_t> +eta_t^2Vert ar g_tVert^2\ label{eq:2} ag{2} end{align*} ]

根据(L-smooth)[1]

[egin{align*} Vert abla F_k(w_t^k)Vert^2 leq 2L(F_k(w_t^k - F_k^*)) label{eq:3} ag{3} end{align*} ]

因为二范数为凸函数再结合(~ ef{eq:3}),得到

[egin{align*} eta^2_t Vert ar g_t Vert^2 &leq eta_t^2 sum \, p_kVert abla F_k(w_t^k)Vert^2\ &leq 2Leta_t^2(F_k(w_t^k - F_k^*)) end{align*} ]

对于(2eta_t <ar w_t -w^*, ar g_t>)展开

[egin{align*} -2eta_t <ar w_t -w^*, ar g_t>&=-2eta_t sum\, p_k <ar w_t-w^*, abla F_k(w^k_t)> \ &= -2eta_t sum\, p_k <ar w_t - w_t^k, abla F_k(w_t^k)>-2eta_tsum\, p_k < w_t^k - w_t^k, abla F_k(w_t^k)> end{align*} ]

根据可惜施瓦茨不等式和矩阵不等式得到

[egin{align*} -2<ar w_t-w_t^k, abla F_k(w_t^k) > leq frac{1}{eta_t} Vert ar w_t - w_t^kVert^2+eta_k Vert abla F_k(w_t^k)Vert^2 end{align*} ]

根据(mu-convex)得到

[egin{align*} -<w_t^k - w^*, abla F_k(w_t^k)> leq -(F_k(w_t^k)-F_k(w^*)) - frac{mu}{2}Vert w_t^k-w^*Vert^2 end{align*} ]

因此(~ ef{eq:2})写为

[egin{align*} Vert ar w_t - eta_t ar g_t - w^*Vert^2&= Vert ar w_t - w^*Vert^2 - 2eta_t<ar w_t-w^*, ar g_t> +eta_t^2Vert ar g_tVert^2\ & leq Vert ar w_t - w^*Vert^2 + 2Leta_t^2(F_k(w_t^k - F_k^*)) + \ &quadeta_tsum\, p_kleft(frac{1}{eta_t} Vert ar w_t - w_t^kVert^2+eta_k Vert abla F_k(w_t^k)Vert^2 ight) -\ &quad2eta_t sum\, p_k(F_k(w_t^k)-F_k(w^*) - frac{mu}{2}Vert w_t^k-w^*Vert^2)\ & leq(1-mueta_t)Vert ar w_t- w^*Vert^2 + sum\, p_k Vert ar w_t-w^k_tVert^2+ \&quad4Leta_t^2 sum \, p_k(F_k(w_t^k)-F_k^*) - 2eta_tsum\, p_k(F_k(w_t^k)-F_k(w^*)) end{align*} ]

拿出后面的(4Leta_t^2 sum \, p_k(F_k(w_t^k)-F_k^*) - 2eta_tsum\, p_k(F_k(w_t^k)-F_k(w^*)))一项,定义(gamma_t=2eta_t(1-2Leta_t)),同时假定(eta_t)随时间非增(也就是学习率是衰减的)且(eta_tleq frac{1}{4L})。可以得到定义的(eta_tleqgamma_tleq 2 eta_t)。整理得到

[egin{align*} &4Leta_t^2 sum \, p_k(F_k(w_t^k)-F_k^*) - 2eta_tsum\, p_k(F_k(w_t^k)-F_k(w^*))\ &=-2eta_t(1-2Leta_t)sum \, p_k(F_k(w_t^k)-F_k^*)+2eta_tsum\, p_k(F_k(w^*)-F_k*)\ &=-gamma_t sum \, p_k(F_k(w_t^k)-F^*) + (2eta_t - gamma_t)sum\, p_k(F^*-F_k^*)\ &=-gamma_t sum\, p_k(F_k(w_t^k)-F^*) + 4Leta_t^2Gamma end{align*} ]

第三个等号将(F_k(w_t^k-F_k^*))拆分为(F_k(w_t^k)-F^*+F^* - F_k^*)

[egin{align*} sum\, p_k(F_k(w_t^k)-F^*) & = sum\, p_k (F_k(w_t^k)-F_k(ar w_t)) + sum\, p_k(F_k(ar w_t)-F^*)\ & geq sum\, p_k < abla F_k(ar w_t), w_t^k-ar w_t> +F_k(ar w_t)-F^*\ &geq -frac 1 2sum\, p_k left[eta_tVert abla F_k(ar w_t)Vert^2 + frac{1}{eta}Vert w_t^k -ar w_tVert^2 ight] + (F(ar w_t)-F^*)\ &geq -sum\, p_k left[eta_t L(F_k(ar w_t) - F_k^*) + frac{1}{2eta}Vert w_t^k -ar w_tVert^2 ight] + (F(ar w_t)-F^*) end{align*} ]

综上所述,

[egin{align*} &4Leta_t^2 sum \, p_k(F_k(w_t^k)-F_k^*) - 2eta_tsum\, p_k(F_k(w_t^k)-F_k(w^*))\ &leq gamma_t sum\, p_k left[eta_t L(F_k(ar w_t) - F_k^*) +frac{1}{2eta}Vert w_t^k -ar w_tVert^2 ight]-gamma(F(ar w_t)-F^*)+ + 4Leta_tGamma\ & = gamma_t(eta_tL-1)sum \, p_k(F_k(ar w_t) - F^*) + (4Leta_t^2+gamma_teta_tL)+frac{gamma}{2eta_t}sum \, p_k Vert w_t^k - ar w_tVert^2\ &leq 6Leta_t^2Gamma + sum\, p_k Vert w_t^k - ar w_tVert^2 end{align*} ]

最后一个不等式取得的原因是((eta_tL-1)leq -frac{3}{4})(sum \, p_k(F_k(ar w_t) - F^*)geq0)

因此

[Vert ar w_t -eta _t ar g_t -w^*Vert ^2leq (1-mu eta_t)Vert ar w_t- w^*Vert^2 +6Leta_t^2Gamma + sum\, p_k Vert w_t^k - ar w_tVert^2 ]

加上梯度方差有限假设,即

[egin{align*} EVert g_t - ar g_tVert^2 &= EVert sum \, p_k( abla F_k(w_k, xi_t^k) - abla F_k(w_t^k))Vert\ & leq sum\, p_k^2 sigma_k^2 end{align*} ]

[egin{align*}Esum\, p_k Vert ar w_t - w^k_tVert^2 &= Esum \,p_kVert w^k_t-ar w_{t_0}-(ar w_t - ar w_{t_0})Vert^2\ &leq Esum\, p_k Vert w_t^k - ar w_{t_0}Vert^2\ &leq sum p_k Esum olimits_{t=t_0}^{t-1}\, (E-1)eta^2_t Vert abla F_k(w_t^k, xi_t^k)Vert\ &leq 4eta_t^2 (E-1)^2 G^2 end{align*} ]

因此最终,令(Delta_t=EVert ar w_{t+1}-w^*Vert)

[egin{align*} Delta_{t+1}leq(1-mueta_t)Delta_t + eta_t^2 B end{align*} ]

(B=sum\, p_k^2 sigma_k^2+6LGamma+8(E-1)^2G^2)

参考资料

  1. Francis Bach, Statistical machine learning and convex optimization
原文地址:https://www.cnblogs.com/DemonHunter/p/12984659.html