Basic Primal-Dual Optimization Methods


This is going to be a continuation of the second half of MATH 173B at UCSD (which at the time of writing was Optimization Methods in Data Science II), which covered Dual Ascent, the Augmented Lagrangian Method of Multipliers, and ADMM. The original lecture notes didn’t cover the convergence theorems and proofs at a depth that I wanted, so I thought that I’d write something to extend them. The proof of Danskin’s Theorem and Finsler’s Lemma, which are used in the convergence proofs, are also included at the end to round things out.

Dual Ascent:

For simplicity we consider an equality constrained problem of the following form.

minf(x)s.t. Ax=b\begin{gather*} \min f(x)\\ \text{s.t. }Ax=b \end{gather*}

For all algorithms shown here, one can simply replace the constraints in each of the parts of the algorithm to use another set of constraints. As a refresher we can define the dual problem for this using the Lagrangian L(x,y)=f(x)+yT(Axb)L(x,y)=f(x)+y^T(Ax-b) and the Lagrange Dual g(y)=minxL(x,y)g(y)=\min_xL(x,y) below.

maxyg(y)\begin{gather*} \max_y g(y) \end{gather*}

Since this removes the need to explicitly handle the constraint, we could consider running Gradient Ascent on the dual problem rather than running Gradient Descent on the primal, which allows us to derive the update rule for Dual Ascent.

y(t+1)=y(t)+αt(g(y(t)))y^{(t+1)}=y^{(t)}+\alpha_t(\nabla g(y^{(t)}))

This leads to the following algorithm for the problem we considered originally since g(y)=Ax(y)b\nabla g(y)=Ax^*(y)-b, where x(y)x^*(y) is the optimal xx given some yy.

x(t+1)=argminxL(x,y(t))y(t+1)=y(t)+αt(Ax(t+1)b)\begin{gather*} x^{(t+1)}=\text{arg}\min_xL(x,y^{(t)})\\ y^{(t+1)}=y^{(t)}+\alpha_t(Ax^{(t+1)}-b) \end{gather*}

One of the main benefits of this method to be taken into consideration in later sections is that it can be parallelized in many real world applications. Consider the optimization of a f(w)f(w) of the following form, one that is commonly found in applications of optimization over a dataset (which has ww being split into many xix_i terms to allow for parallelization).

f(w)=i=1nfi(w)=i=1nfi(xi)f(w)=\sum^n_{i=1}f_i(w)=\sum^n_{i=1}f_i(x_i)

Due to the structure of the Lagrangian we can separate the Lagrangian over the variables of xx.

L(x,y)=[i=1nfi(xi)]+[yT(Axb)]=i=1nLi(xi,y)Li(x,y)=fi(x)+yTaixiyTbn\begin{gather*} L(x,y)=\left[\sum^n_{i=1}f_i(x_i)\right]+[y^T(Ax-b)]=\sum^n_{i=1}L_i(x_i,y)\\ L_i(x,y)=f_i(x)+y^Ta_ix_i-\frac{y^Tb}{n} \end{gather*}

Due to this, we can separate the optimization step over xx within the algorithm itself, letting us parallelize it.

xi(t+1)=argminxiLi(xi,y(t))y(t+1)=y(t)+αt(Ax(t+1)b)\begin{gather*} x^{(t+1)}_i=\text{arg}\min_{x_i}L_i(x_i,y^{(t)})\\ y^{(t+1)}=y^{(t)}+\alpha_t(Ax^{(t+1)}-b) \end{gather*}

Convergence:

Given f(x)f(x) is mm-strongly convex and LL-smooth and 0<α<2mA20<\alpha<\frac{2m}{\|A\|^2}, we know that Ax(t)b0Ax^{(t)}-b\rightarrow 0, x(t)xx^{(t)}\rightarrow x^*, and y(t)yy^{(t)}\rightarrow y^*.

The proof follows by showing that Dual Ascent makes constant improvements on g(y)g(y) and the fact that x=argminxL(x,y)x^*=\text{arg}\min_xL(x,y^*).

We can first use the assumption that ff is strongly convex to prove that g(y)g(y) is L-smooth. This can be trivially done with known properties or even the connection of g(y)g(y) to the convex conjugate, but for completeness we can derive it from scratch here through algebraic manipulation of two points x1=x(y1)x_1=x(y_1) and x2=x(y2)x_2=x(y_2), where x(y)=argminxL(x,y)x(y)=\text{arg}\min_xL(x,y). Since x1x_1 and x2x_2 minimize the Lagrangian, we know that they satisfy the optimality conditions below.

f(x1)+ATy1=0f(x2)+ATy2=0\nabla f(x_1)+A^Ty_1=0\quad\nabla f(x_2)+A^Ty_2=0

We can subtract these to derive a statement involving both points.

f(x1)f(x2)=AT(y2y1)\nabla f(x_1)-\nabla f(x_2)=A^T(y_2-y_1)

Since ff is mm-strongly convex, we know that the following holds from an equivalent condition of strongly convex functions and then simplify using the previous statement.

mx1x22(x1x2)T(f(x1)f(x2))mx1x22(x1x2)TAT(y2y1)mx1x22\begin{gather*} m\|x_1-x_2\|^2\leq(x_1-x_2)^T(\nabla f(x_1)-\nabla f(x_2))\\ m\|x_1-x_2\|^2\leq (x_1-x_2)^TA^T(y_2-y_1) m\|x_1-x_2\|^2 \end{gather*}

We can then simplify using the Cauchy-Schwartz Inequality and then divide by mx1x2m\|x_1-x_2\| to get the definition of x(y)x(y) being lipschitz.

mx1x22A(x1x2)y2y1Ax1x2y2y1x1x2Amy1y2\begin{gather*} m\|x_1-x_2\|^2\leq\|A(x_1-x_2)\|\|y_2-y_1\|\leq\|A\|\|x_1-x_2\|\|y_2-y_1\|\\ \|x_1-x_2\|\leq\frac{\|A\|}{m}\|y_1-y_2\| \end{gather*}

By Danskin’s Theorem we know that g(y)=Ax(y)b\nabla g(y)=Ax(y)-b, and since the linear transformation of an L-smooth function is also L-smooth, we have proven that g(y)g(y) is L-smooth. This means we can use the Descent Lemma on g(y)g(y) between y(t+1)y^{(t+1)} and y(t)y^{(t)}, and then use y(t+1)y(t)=αgy^{(t+1)}-y^{(t)}=\alpha\nabla g to simplify.

g(y(t+1))g(y(t))+g(y(t))T(y(t+1)y(t))L2y(t+1)y(t)2g(y(k+1))g(y(k))αg2Lα22g2=α(1αL2)g2\begin{gather*} g(y^{(t+1)})\geq g(y^{(t)})+\nabla g(y^{(t)})^T(y^{(t+1)}-y^{(t)})-\frac{L}{2}\|y^{(t+1)}-y^{(t)}\|^2\\ g(y^{(k+1)})-g(y^{(k)})\geq\alpha\|\nabla g\|^2-\frac{L\alpha^2}{2}\|\nabla g\|^2=\alpha\left(1-\frac{\alpha L}{2}\right)\|\nabla g\|^2 \end{gather*}

This shows that g(y(t+1))>g(y(t))g(y^{(t+1)})>g(y^{(t)}) if we have α(1αL2)>0\alpha(1-\frac{\alpha L}{2})>0, which simplifies to α<2L\alpha<\frac{2}{L}. With this step-size we know that as long as g(y(t))0\|\nabla g(y^{(t)})\|\neq 0, we are making constant progress on reaching optimality. This also proves g(y(t))g(y)g(y^{(t)})\rightarrow g(y^*) which will be used later.

Since g(y)g(y) is bounded from above by the primal optimal value pp^*, the sum of the increases must also be bounded since it forms a telescoping sum, which is detailed below. Since this is still running a standard gradient descent algorithm we know that the iterates will have non-increasing objective values, so the only way for this to be the case is if g(y(t))0\|\nabla g(y^{(t)})\|\rightarrow 0 (since g(y(t+1))g(y(t))αg(y(t))2g(y^{(t+1)})-g(y^{(t)})\approx\alpha\|\nabla g(y^{(t)})\|^2), which proves that primal faesibility is reached.

k=0g(y(t+1))g(y(t))pg(y(0))<\sum^\infty_{k=0}g(y^{(t+1)})-g(y^{(t)})\leq p^*-g(y^{(0)})<\infty

To prove that yyy\rightarrow y^*, we can first prove that the dual problem is strongly concave given the assumptions made. Since ff is strongly convex, we know that 2f(x)mI\nabla^2f(x)\succeq mI and that (2f)11mI(\nabla^2f)^{-1}\preceq\frac{1}{m}I from the inversion property of eigenvalues. Since AA is full rank, we know the smallest eigenvalue of AATAA^T is greater than zero, letting the following prove that g(y)g(y) is strongly concave.

2g(y)=A(2f(x(y)))1AT1mAAT2g(y)σmin(AAT)mI\begin{gather*} \nabla^2 g(y)=-A(\nabla^2f(x(y)))^{-1}A^T\preceq -\frac{1}{m}AA^T\\ \nabla^2g(y)\preceq-\frac{\sigma_\text{min}(AA^T)}{m}I \end{gather*}

Since g(y)g(y) is strongly concave, we know that it satisfies the PL-Condition, which we can use to write the following inequality which along with g(y(t))g(y)g(y^{(t)})\rightarrow g(y^*) proves that yyy\rightarrow y^*.

g(y)g(y(t))σmin(AAT)2my(t)y2g(y^*)-g(y^{(t)})\geq\frac{\sigma_\text{min}(AA^T)}{2m}\|y^{(t)}-y^*\|^2

Finally, since we know that both primal feasibility and dual optimality are reached, we know that xxx\rightarrow x^* from the fact that x=argminxL(x,y)x^*=\text{arg}\min_xL(x,y^*) and the fact that ff is strongly convex, leading to one minimum.

Augmented Lagrangian Method of Multipliers:

The main problem with Dual Ascent is that it has some strict requirements to guarantee convergence. Problems that are not strictly convex or problems where the function can take values of ++\infty will not work, as the update for xx becomes a problem if the lagrangian does not have a unique minimum. The Augmented Lagrangian Lρ(x,y)L_\rho(x,y) tries to fix this issue by adding a penalty term to the Lagrangian, which allows for more stability under weaker assumptions.

Lρ(x,y)=f(x)+yT(Axb)+ρ2Axb22L_\rho(x,y)=f(x)+y^T(Ax-b)+\frac{\rho}{2}\|Ax-b\|^2_2

The Augmented Lagrange Method of Multipliers, shortened ALM, then performs the exact same calculations as Dual Ascent with this new Augmented Lagrangian.

x(t+1)=argminxLρ(x,y(t))y(t+1)=y(t)+ρ(Ax(t+1)b)\begin{gather*} x^{(t+1)}=\text{arg}\min_{x}L_\rho(x,y^{(t)})\\ y^{(t+1)}=y^{(t)}+\rho(Ax^{(t+1)}-b) \end{gather*}

Convergence:

The convergence theorem for ALM relies on converging to a local minimum, contrasting with the other two focusing on global minimums. Assume f(x)f(x) and Ax=bAx=b are twice continuously differentiable, xx^* is a local minimum satisfying second-order sufficient condition for the Lagrangian, and AA has full rank. With these we know that there exists a threshold ρ^\hat{\rho} such that for all ρ>ρ^\rho>\hat{\rho} we know x(t)xx^{(t)}\rightarrow x^*, y(t)yy^{(t)}\rightarrow y^*, and the convergence of y(t)y^{(t)} is linear with a ratio proportional to 1ρ\frac{1}{\rho}. A lot of the proof solely cares about the existence of a ρ\rho, but in practical applications the ρ\rho that works best is still reasonable.

The proof follows by first ensuring there is some ρ\rho that can make LρL_\rho convex, using it to guarantee that the algorithm’s iterations are well-defined, and then proving the linear convergence rate by relating iterates with a contraction.

By definition of xx^* satisfying second order sufficient conditions, we know that 2L(x,y)\nabla^2L(x^*,y^*) is positive definite on the null space of AA. By Finsler’s Lemma, we know that for large enough ρ\rho the augmented matrix (which forms the hessian of the augmented lagrangian) is positive definite everywhere, which means the augmented lagrangian is convex everywhere for a large enough ρ\rho.

xx2Lρ(x,y)=xx2L(x,y)+ρATA\nabla^2_{xx}L_\rho(x^*,y^*)=\nabla^2_{xx}L(x^*,y^*)+\rho A^TA

Since the Hessian is positive definite, we know that xx2Lρ(x,y)\nabla^2_{xx}L_\rho(x^*,y^*) is non-singular. Along with the fact that G(x,y)=f(x)+ATy=0G(x^*,y^*)=\nabla f(x^*)+A^Ty^*=0 by KKT, we can use the Implicit Function Theorem to define a unique function x(y)x(y) (the corresponding minimizing xx for any given yy) such that xLρ(x(y),y)=0\nabla_xL_\rho(x(y),y)=0 in a neighborhood around yy^*. The theorem gives us the derivative of the function, and since the size of the neighborhood depends on ρ\rho (as we are able to overcome non-convexity), we are able to make general statements about the problem as a whole using it.

x(y)=[xx2Lρ(x(y),y)]1AT\nabla x(y)=-[\nabla^2_{xx}L_\rho(x(y),y)]^{-1}A^T

Now that we have a derivative of x(y)x(y), we can start showing that the update steps of the algorithm converge. To start we can define a surrogate T(y)T(y) for the updates as follows.

y(t+1)=y(t)+ρ(Ax(y(t))b)T(y)=y+ρ(Ax(y)b)\begin{gather*} y^{(t+1)}=y^{(t)}+\rho(Ax(y^{(t)})-b)\\ T(y)=y+\rho(Ax(y)-b) \end{gather*}

We can then use a first-order Taylor expansion of T(y(t))T(y^{(t)}) to find an approximation of the above statement, which can be simplified since y(t+1)=T(y(t))y^{(t+1)}=T(y^{(t)}) and by definition y=T(y)y^*=T(y^*).

T(y(t))T(y)+T(y)(y(t)y)y(t+1)yT(y)(y(t)y)\begin{gather*} T(y^{(t)})\approx T(y^*)+\nabla T(y^*)(y^{(t)}-y^*)\\ y^{(t+1)}-y^*\approx \nabla T(y^*)(y^{(t)}-y^*) \end{gather*}

This is where we can use the previous gradients found for x(y)x(y) to help define the gradient T(y)\nabla T(y^*). Once the previous terms are plugged in we can simplify further using the Woodbury Matrix Identity.

T(y)=y[y+ρ(Ax(y)b)]=I+ρAx(y)T(y)=IρA[xx2L+ρATA]1AT=(I+ρA(xx2L)1AT)1\begin{gather*} \nabla T(y)=\frac{\partial}{\partial y}[y+\rho(Ax(y)-b)]=I+\rho A\nabla x(y)\\ \nabla T(y^*)=I-\rho A[\nabla^2_{xx}L+\rho A^TA]^{-1}A^T=(I+\rho A(\nabla^2_{xx}L)^{-1}A^T)^{-1} \end{gather*}

As ρ\rho gets larger, the impact of the added identity matrix becomes negligible in comparison to the other terms, so we can remove it for now.

T(y)=(I+ρM)1(ρM)1=1ρM1M=A(xx2L)AT\begin{gather*} \nabla T(y^*)=(I+\rho M)^{-1}\approx(\rho M)^{-1}=\frac{1}{\rho}M^{-1}\\ M=A(\nabla^2_{xx}L)A^T \end{gather*}

This means that the norm of the gradient is bounded by C/ρC/\rho for some arbitrary constant CC, so we know that T(y)T(y) forms a contraction for large enough ρ\rho.

T(y)Cρy(t+1)yCρ(y(t)y)\begin{gather*} \|\nabla T(y^*)\|\leq\frac{C}{\rho}\\ y^{(t+1)}-y^*\approx\frac{C}{\rho}(y^{(t)}-y^*) \end{gather*}

This means that the error is multiplied by a factor of 1/ρ1/\rho at each step, so the convergence to yy^* is proven to be linear since there exists some ρ\rho where that would be the case. Since this means that y(t)yy^{(t)}\rightarrow y^*, and we know that x(y)x(y) is continuous, we know that x(k)xx^{(k)}\rightarrow x^* as well, thus concluding the proof.

Alternating Direction Method of Multipliers:

Although ALM allows us to get much more stable and faster convergence results in comparison to Dual Ascent, by using ALM we miss out on the ability to parallelize the primal update. The Alternating Direction Method of Multipliers, shortened ADMM, is an algorithm that allows for convergence under weak assumptions in the same way as ALM and also is able to achieve parallelization in the same way as Dual Ascent, getting the best of both worlds.

ADMM considers a problem over two decision variables xx and zz, where we will consider a problem of the form below to simplify.

minf(x)+g(z)s.t. Ax+Bz=c\begin{gather*} \min f(x)+g(z)\\ \text{s.t. }Ax+Bz=c \end{gather*}

The method then performs the ALM with this problem, minimizing xx, zz, and yy at each iteration, with one key distinction in how updates on xx and zz are made. Instead of updating both jointly like what ALM would do with (x(t+1),z(t+1))=argminx,zLρ(x,z,y(t))(x^{(t+1)},z^{(t+1)})=\text{arg}\min_{x,z}L_\rho(x,z,y^{(t)}), ADMM uses an alternating update, first updating xx and then zz.

x(t+1)=argminxLρ(x,z(t),y(t))z(t+1)=argminzLρ(x(t+1),z,y(t))y(t+1)=y(t)+ρ(Ax(t+1)+Bz(t+1)c)\begin{gather*} x^{(t+1)}=\text{arg}\min_xL_\rho(x,z^{(t)},y^{(t)})\\ z^{(t+1)}=\text{arg}\min_zL_\rho(x^{(t+1)},z,y^{(t)})\\ y^{(t+1)}=y^{(t)}+\rho(Ax^{(t+1)}+Bz^{(t+1)}-c) \end{gather*}

This separation of variables actually allows for many use cases that the first two algorithms are not capable of. Parallelization can be achieved in a number of different ways, but one of the most common is for problems in the following form with a consensus variable zz.

mini=1Nfi(xi)s.t. xiz=0, for all i=1,,N\begin{gather*} \min\sum^N_{i=1}f_i(x_i)\\ \text{s.t. }x_i-z=0,\text{ for all }i=1,\dots,N \end{gather*}

Under this we can separate the primal update the same way we did with Dual Ascent.

xi(t+1)=argminxi(fi(xi)+yi(t)T(xiz(t))+ρ2xiz(t)2)z(t+1)=1Ni=1N(xi(t+1)+1ρyi(t))\begin{gather*} x^{(t+1)}_i=\text{arg}\min_{x_i}\left(f_i(x_i)+y^{(t)T}_i(x_i-z^{(t)})+\frac{\rho}{2}\|x_i-z^{(t)}\|^2\right)\\ z^{(t+1)}=\frac{1}{N}\sum^N_{i=1}\left(x^{(t+1)}_i+\frac{1}{\rho}y^{(t)}_i\right) \end{gather*}

Convergence:

If f(x)f(x) and g(z)g(z) are convex (updates will be well-behaved) and Slater’s Condition holds (for the existence of a proper solution), we know that Ax(t)+Bz(t)c0Ax^{(t)}+Bz^{(t)}-c\rightarrow0, f(x(t))+g(z(t))pf(x^{(t)})+g(z^{(t)})\rightarrow p^*, and y(t)yy^{(t)}\rightarrow y^*.

The proof follows by deriving inequalities about iterates to the optimal and to the next iteration, combining them, and then using those to derive relationships.

To derive an inequality about global relationships, we can use the optimality conditions that a given optimal (x,z,y)(x^*,z^*,y^*) would need to satisfy. Since yy interacts with xx and zz separately, we can separate the derivative optimality conditions, where \partial is used to denote the subdifferential of the corresponding function.

Ax+Bzc=0ATyf(x) and BTyg(z)\begin{gather*} Ax^*+Bz^*-c=0\\ A^Ty^*\in\partial f(x^*)\text{ and }B^Ty^*\in\partial g(z^*) \end{gather*}

Since we know that convex functions satisfy f(x)f(x)+<g,xx>f(x)\geq f(x^*)+\left<g,x-x^*\right> for all subderivatives gg at xx^*, by the optimality conditions we know that the following hold as well.

f(x)f(x)+<ATy,xx>0g(z)g(z)+<BTy,zz>0\begin{gather*} f(x)-f(x^*)+\left<A^Ty^*,x-x^*\right>\geq0\\ g(z)-g(z^*)+\left<B^Ty^*,z-z^*\right>\geq0 \end{gather*}

We can then combine these along with the primal feasibility condition to get the first inequality we want to derive.

(f(x)+g(z))(f(x)+g(z))+<y,Ax+Bzc>(f(x)+g(z))-(f(x^*)+g(z^*))+\left<y^*,Ax+Bz-c\right>

To derive our second set of inequalities, we can instead use the optimality conditions of the ADMM updates themselves.

0f(x(t+1))+ATy(t)+ρAT(Ax(t+1)+Bz(t)c)0g(z(t+1))+BTy(t)+ρBT(Ax(t+1)+Bz(t+1)c)\begin{gather*} 0\in\partial f(x^{(t+1)})+A^Ty^{(t)}+\rho A^T(Ax^{(t+1)}+Bz^{(t)}-c)\\ 0\in\partial g(z^{(t+1)})+B^Ty^{(t)}+\rho B^T(Ax^{(t+1)}+Bz^{(t+1)}-c) \end{gather*}

We can substitute the dual update y(t+1)=y(t)+ρ(Ax(t+1)+Bz(t+1)c)y^{(t+1)}=y^{(t)}+\rho(Ax^{(t+1)}+Bz^{(t+1)}-c) to simplify both.

0f(x(t+1))+ATy(t+1)ρATB(z(t+1)z(t))0g(z(t+1))+BTy(t+1)\begin{gather*} 0\in\partial f(x^{(t+1)})+A^Ty^{(t+1)}-\rho A^TB(z^{(t+1)}-z^{(t)})\\ 0\in\partial g(z^{(t+1)})+B^Ty^{(t+1)} \end{gather*}

These both give us subderivatives that we can plug into the same property of convex functions that we did for the first inequality to derive the two new inequalities.

f(x)f(x(t+1))+<ATy(t+1)ρATB(z(t+1)z(t)),xx(t+1)>0g(z)g(z(t+1))+<y(t+1),B(zz(t+1))>0\begin{gather*} f(x)-f(x^{(t+1)})+\left<A^Ty^{(t+1)}-\rho A^TB(z^{(t+1)}-z^{(t)}),x-x^{(t+1)}\right>\geq0\\ g(z)-g(z^{(t+1)})+\left<y^{(t+1)},B(z-z^{(t+1)})\right>\geq0 \end{gather*}

Now that we have all three, we can combine the inequalities evaluated at x=xx=x^* and y=yy=y^* to derive the following bound.

<y(t+1)y,r(t+1)>+ρ<B(z(t+1)z(t)),r(t+1)+B(z(t+1)z)>0\left<y^{(t+1)}-y^*,r^{(t+1)}\right>+\rho\left<B(z^{(t+1)}-z^{(t)}),r^{(t+1)}+B(z^{(t+1)}-z^*)\right>\leq 0

We can apply the identity 2<ab,ac>=ab2+ac2bc22\left<a-b,a-c\right>=\|a-b\|^2+\|a-c\|^2-\|b-c\|^2 to both terms and simplify to get the following inequality which is simplified using the Lyapunov Function VV, which is found after separating step tt and step t+1t+1 terms and multiplying by 22. It is also further simplified by using the fact that y(t+1)y(t)=ρr(t+1)y^{(t+1)}-y^{(t)}=\rho r^{(t+1)}.

VtVt+11ρy(t+1)y(t)2+ρB(z(t+1)z(t))2VtVt+1ρr(t+1)2+ρB(z(t+1)z(t))2Vt=1ρy(t)y2+ρB(z(t)z)2\begin{gather*} V_t-V_{t+1}\geq\frac{1}{\rho}\|y^{(t+1)}-y^{(t)}\|^2+\rho\|B(z^{(t+1)}-z^{(t)})\|^2\\ V_t-V_{t+1}\geq\rho\|r^{(t+1)}\|^2+\rho\|B(z^{(t+1)}-z^{(t)})\|^2\\ V_t=\frac{1}{\rho}\|y^{(t)}-y^*\|^2+\rho\|B(z^{(t)}-z^*)\|^2 \end{gather*}

Since the right-hand side of the inequality is always non-negative, we know that VtVt+1V_t\geq V_{t+1}. As well, the definition of each VkV_k means that it forms a sequence of non-negative numbers and is monotonically non-increasing, so the sequence must converge to some limit VV_\infty. If we examine the following sum as NN\rightarrow\infty, we see that it forms a telescoping sum which leads to the entire sum being bounded by the initial error.

t=0N(ρr(t+1)2+ρB(z(t+1)z(t))2)=t=0NVtVt+1V0VN+1V0\sum^N_{t=0}\left(\rho\|r^{(t+1)}\|^2+\rho\|B(z^{(t+1)}-z^{(t)})\|^2\right)=\sum^N_{t=0}V_t-V_{t+1}\leq V_0-V_{N+1}\leq V_0

Since this applies for NN\rightarrow\infty, we know that the terms in the sum themselves must go to zero for the sum to be finite. This means that we need both r(t)0\|r^{(t)}\|\rightarrow 0 and B(z(t+1)z(t))0\|B(z^{(t+1)}-z^{(t)})\|\rightarrow 0, which proves both feasibility and optimality of zz. Since B(z(k+1)z(k))0B(z^{(k+1)}-z^{(k)})\rightarrow 0, we know that the update for xx will converge to 0f(x(t+1))+ATy(t+1)0\in\partial f(x^{(t+1)})+A^Ty^{(t+1)}, which is the optimality condition for xx, so xxx\rightarrow x^* arises. Since y(t+1)=y(t)+ρr(t+1)y^{(t+1)}=y^{(t)}+\rho r^{(t+1)} and r(t)0\|r^{(t)}\|\rightarrow 0, we know that y(t+1)y(t)0y^{(t+1)}-y^{(t)}\rightarrow 0, so yyy\rightarrow y^* since we know that feasibility is reached. From B(z(t+1)z(t))0\|B(z^{(t+1)}-z^{(t)})\|\rightarrow 0 we know that z(t+1)z(t)0z^{(t+1)}-z^{(t)}\rightarrow 0, and since the iteration of zz is defined so that the optimality condition for zz is always satisfied with the current yy, we know that zzz\rightarrow z^*.

Since Strong Duality holds from Slater’s Condition, we know that the KKT conditions are sufficient for optimality, so proving that these points satisfy the optimality conditions we laid out originally is enough to prove that they are converging to their optimal points, which concludes the proof.

Appendix:

This is going to include proofs for all supporting theorems except for the Implicit Function Theorem, because the proof for that theorem relies on a chain of theorems that would trail a little of course for this.

Danskin’s Theorem:

Consider the function ϕ(x)\phi(x) below defined for some continuous function f(x,z)f(x,z), where zRz\in\mathbb{R} and zz is a parameter from a compact set ZRmZ\subset\mathbb{R}^m (keep in mind that zz being from a compact set is only so that a stable minimizer zz^* exists, so it can be replaced by other conditions).

ϕ(x)=maxzZf(x,z)\phi(x)=\max_{z\in Z}f(x,z)

If ff is convex in xx for every zZz\in Z and ff has continuous partial derivatives with respect to xx, then the directional derivative of ϕ(x)\phi(x) can be defined below with the set of maximizing points for any given xx Z0(x)Z_0(x).

ϕ(x;y)=maxzZ0(x)<xf(x,z),y>Z0(x)={zZf(x,z)=ϕ(x)}\begin{gather*} \phi^\prime(x;y)=\max_{z\in Z_0(x)}\left<\nabla_xf(x,z),y\right>\\ Z_0(x)=\{z\in Z\mid f(x,z)=\phi(x)\} \end{gather*}

The proof follows by proving that the maximum of the individual gradients is both an upper and lower bound on the directional derivative and then combining them to prove equality.

Proving the lower bound is a simple manipulation of the definition of each function. By definition of the maximum function, we know that for some zZ0(x)z^*\in Z_0(x) the maximum function forms an upper bound for the original function. We can use to form an inequality with a perturbation away from xx in direction yy and then subtract ϕ(x)=f(x,z)\phi(x)=f(x,z^*) from both sides to get it into a more workable form.

ϕ(x+ty)f(x+ty,z)ϕ(x+ty)ϕ(x)f(x+ty,z)f(x,z)\begin{gather*} \phi(x+ty)\geq f(x+ty,z^*)\\ \phi(x+ty)-\phi(x)\geq f(x+ty,z^*)-f(x,z^*) \end{gather*}

We can then divide both sides by t>0t>0 and take the limit as t0+t\rightarrow0^+ to turn both sides of the inequality into statements about the derivatives of each function.

lim inft0+ϕ(x+ty)ϕ(x)tlimt0+f(x+ty,z)f(x,z)tlim inft0+ϕ(x+ty)ϕ(x)t<xf(x,z),y>\begin{gather*} \liminf_{t\rightarrow0^+}\frac{\phi(x+ty)-\phi(x)}{t}\geq \lim_{t\rightarrow0^+}\frac{f(x+ty,z^*)-f(x,z^*)}{t}\\ \liminf_{t\rightarrow0^+}\frac{\phi(x+ty)-\phi(x)}{t}\geq \left<\nabla_xf(x,z^*),y\right> \end{gather*}

Since this applies for all zZ0(x)z^*\in Z_0(x), we know it applies for the maximizer among them, completing the lower bound section of the proof.

lim inft0+ϕ(x+ty)ϕ(x)tmaxzZ0(x)<xf(x,z),y>\liminf_{t\rightarrow0^+}\frac{\phi(x+ty)-\phi(x)}{t}\geq\max_{z\in Z_0(x)}\left<\nabla_xf(x,z),y\right>

The proof for the upper bound follows a similar case of manipulation of definitions with some use of the fact that ZZ is a closed set. By definition of the maximum function, we know that the following apply for some t>0t>0 and ztZ0(x+ty)z_t\in Z_0(x+ty).

ϕ(x+ty)=f(x+ty,zt)ϕ(x)f(x,zt)\phi(x+ty)=f(x+ty,z_t)\quad \phi(x)\geq f(x,z_t)

We can subtract these two terms to derive an inequality.

ϕ(x+ty)ϕ(x)f(x+ty,zt)f(x,zt)\phi(x+ty)-\phi(x)\leq f(x+ty,z_t)-f(x,z_t)

We can use the Mean Value Theorem on the right-hand side to know that there must be some θt\theta_t that satisfies the following, which we can plug into the inequality and divide by tt to get into the familiar pre limit derivative form.

f(x+ty,zt)f(x,zt)=t<xf(x+θtty,zt),y>ϕ(x+ty)ϕ(x)t<xf(x+θtty,zt),y>\begin{gather*} f(x+ty,z_t)-f(x,z_t)=t\left<\nabla_xf(x+\theta_tty,z_t),y\right>\\ \frac{\phi(x+ty)-\phi(x)}{t}\leq\left<\nabla_xf(x+\theta_tty,z_t),y\right> \end{gather*}

Since ztz_t changes as we take the limit of tt, we need to make sure that ztz_t has a convergent point as t0+t\rightarrow0^+. Consider the sequence of shrinking step sizes tk0+t_k\rightarrow 0^+, where each is guaranteed by the Extreme Value Theorem to have a corresponding maximizer ztkZz_{t_k}\in Z. Since ZZ is a compact set, by the Heine-Borel Theorem we know that the set is both closed and bounded. By the Bolzano-Weierstrass Theorem, we know that {ztk}\{z_{t_k}\} must contain a convergent subsequence defined as {ztkj}\{z_{t_{k_j}}\} that converges to some z0z_0, which we know satisfies z0Zz_0\in Z since ZZ is closed.

For z0z_0 to be meaningful to the bound we need to prove that z0Z0(x)z_0\in Z_0(x). For any arbitrary zZz\in Z, the fact that ztkjz_{t_{k_j}} is the maximizer of the step tkt_k means the following is satisfied.

f(x+tkjy,ztkj)f(x+tkjy,z)f(x+t_{k_j}y,z_{t_{k_j}})\geq f(x+t_{k_j}y,z)

We can then take the limit as jj\rightarrow\infty which leads to f(x,z0)f(x,z)f(x,z_0)\geq f(x,z), which along with the fact that ff is continuous leads to z0z_0 being the maximizer at xx, so z0Z0(x)z_0\in Z_0(x). This lets us take the limit to derive an inequality that we can use to derive the desired bound.

ϕ(x+tkjy)ϕ(x)tkj<xf(x+θkjtkjy,ztkj)>lim supjϕ(x+tkjy)ϕ(x)tkj<xf(x,z0),y>\begin{gather*} \frac{\phi(x+t_{k_j}y)-\phi(x)}{t_{k_j}}\leq\left<\nabla_xf(x+\theta_{k_j}t_{k_j}y,z_{t_{k_j}})\right>\\ \limsup_{j\rightarrow\infty}\frac{\phi(x+t_{k_j}y)-\phi(x)}{t_{k_j}}\leq\left<\nabla_xf(x,z_0),y\right> \end{gather*}

As this applies for every convergent subsequence, we know that the limit superior of the sequence itself must be bounded by the maximum among the individual bounds, completing the upper bound section of the proof.

lim supt0+ϕ(x+ty)ϕ(x)tmaxzZ0(x)<xf(x,z),y>\limsup_{t\rightarrow0^+}\frac{\phi(x+ty)-\phi(x)}{t}\leq\max_{z\in Z_0(x)}\left<\nabla_xf(x,z),y\right>

Since we know that both the lim sup\limsup of the sequence is upper bounded by the same term that the lim inf\liminf is lower bounded by, we know that the lim sup\limsup and lim inf\liminf are equal, proving that the limit exists and that it has the following value, proving the theorem.

limt0+ϕ(x+ty)ϕ(x)t=ϕ(x;y)=maxzZ0(x)<xf(x,z),y>\lim_{t\rightarrow 0^+}\frac{\phi(x+ty)-\phi(x)}{t}=\phi^\prime(x;y)=\max_{z\in Z_0(x)}\left<\nabla_xf(x,z),y\right>

Finsler’s Lemma:

Given a symmetric matrix HRn×nH\in\mathbb{R}^{n\times n} and a matrix ARm×nA\in\mathbb{R}^{m\times n}, we know the following statements are equivalent.

  1. dTHd>0d^THd>0 for all dRn\{0}d\in\mathbb{R}^n\backslash\{0\} such that Ad=0Ad=0
  2. There exists a scalar ρ^R\hat{\rho}\in\mathbb{R} such that for all ρ>ρ^\rho>\hat{\rho}, we know H+ρATA0H+\rho A^TA\succ0

This allows us to know that there exists a threshold ρ^\hat{\rho} such that an augmented form of HH is positive definite everywhere, making the utility to the convergence proof for ALM clear.

Since the theorem is one of equivalence, we need to prove both that (1)    (2)(1)\implies(2) and (2)    (1)(2)\implies (1), where the first is proven through contradiction and the second is proven through definition.

To prove (1)    (2)(1)\implies(2), we suppose that statement (1)(1) holds but statement (2)(2) is false and then reach a contradiction. If (2)(2) is false then for every ρk=k\rho_k=k, k=1,2,k=1,2,\dots, there exists a vector dkRnd_k\in\mathbb{R}^n such that the following holds. We can enforce dk=1\|d_k\|=1 since only the direction of the vector matters for the inequality.

dkT(H+kATA)dk=dkTHdk+kAdk20d^T_k(H+kA^TA)d_k=d^T_kHd_k+k\|Ad_k\|^2\leq 0

Since the sequence {dk}\{d_k\} is a subset of the unit sphere, which is a compact set, we can use the Bolzano-Weierstrass Theorem to know there exists a convergent subsequence {dkj}\{d_{k_j}\} that converges to some unit vector dd^*. Since dk=1\|d_k\|=1 and HH is a fixed matrix, we know that dkTHdkd^T_kHd_k is always bounded. This means as kk\rightarrow\infty, for dkTHdk+kAdk20d^T_kHd_k+k\|Ad_k\|^2\leq 0 to stay true, we need Adk20\|Ad_k\|^2\rightarrow 0, which means that the limit below needs to hold.

limkAdk2=Ad2=0    Ad=0\lim_{k\rightarrow\infty}\|Ad_k\|^2=\|Ad^*\|^2=0\implies Ad^*=0

Since dd^* is in the null space of AA and we know that the following holds, we reach a contradiction with (1)(1), thus proving that (1)    (2)(1)\implies(2).

limj(dkjTHdkj+kjAdkj2)0dTHd0\begin{gather*} \lim_{j\rightarrow\infty}(d^T_{k_j}Hd_{k_j}+k_j\|Ad_{k_j}\|^2)\leq 0\\ {d^*}^THd^*\leq 0 \end{gather*}

To prove (2)    (1)(2)\implies(1), we assume that (2)(2) holds which means that for some ρ>ρ^\rho>\hat{\rho} we know that for any non-zero vector dd the following holds.

dT(H+ρATA)d>0d^T(H+\rho A^TA)d>0

We can distribute and simplify to make the inequality workable with what we need to prove (1)(1).

dTHd+dT(ρATA)d>0dTHd+ρ(Ad)T(Ad)>0dTHd+ρAd2>0\begin{gather*} d^THd+d^T(\rho A^TA)d>0\\ d^THd+\rho(Ad)^T(Ad)>0\\ d^THd+\rho\|Ad\|^2>0 \end{gather*}

If the chosen dd satisfies Ad=0Ad=0, then this simplifies into the definition of (1)(1), thus proving that (2)    (1)(2)\implies(1).

dTHd+ρ(0)>0dTHd>0\begin{gather*} d^THd+\rho(0)>0\\ d^THd>0 \end{gather*}

Since we have proven both (1)    (2)(1)\implies(2) and (2)    (1)(2)\implies(1), we know that they are equivalent statements, thus proving the theorem.