This is going to be a continuation of the second half of MATH 173B at UCSD (which at the time of writing was Optimization Methods in Data Science II), which covered Dual Ascent, the Augmented Lagrangian Method of Multipliers, and ADMM. The original lecture notes didn’t cover the convergence theorems and proofs at a depth that I wanted, so I thought that I’d write something to extend them. The proof of Danskin’s Theorem and Finsler’s Lemma, which are used in the convergence proofs, are also included at the end to round things out.
Dual Ascent:
For simplicity we consider an equality constrained problem of the following form.
minf(x)s.t. Ax=b
For all algorithms shown here, one can simply replace the constraints in each of the parts of the algorithm to use another set of constraints. As a refresher we can define the dual problem for this using the Lagrangian L(x,y)=f(x)+yT(Ax−b) and the Lagrange Dual g(y)=minxL(x,y) below.
ymaxg(y)
Since this removes the need to explicitly handle the constraint, we could consider running Gradient Ascent on the dual problem rather than running Gradient Descent on the primal, which allows us to derive the update rule for Dual Ascent.
y(t+1)=y(t)+αt(∇g(y(t)))
This leads to the following algorithm for the problem we considered originally since ∇g(y)=Ax∗(y)−b, where x∗(y) is the optimal x given some y.
One of the main benefits of this method to be taken into consideration in later sections is that it can be parallelized in many real world applications. Consider the optimization of a f(w) of the following form, one that is commonly found in applications of optimization over a dataset (which has w being split into many xi terms to allow for parallelization).
f(w)=i=1∑nfi(w)=i=1∑nfi(xi)
Due to the structure of the Lagrangian we can separate the Lagrangian over the variables of x.
Given f(x) is m-strongly convex and L-smooth and 0<α<∥A∥22m, we know that Ax(t)−b→0, x(t)→x∗, and y(t)→y∗.
The proof follows by showing that Dual Ascent makes constant improvements on g(y) and the fact that x∗=argminxL(x,y∗).
We can first use the assumption that f is strongly convex to prove that g(y) is L-smooth. This can be trivially done with known properties or even the connection of g(y) to the convex conjugate, but for completeness we can derive it from scratch here through algebraic manipulation of two points x1=x(y1) and x2=x(y2), where x(y)=argminxL(x,y). Since x1 and x2 minimize the Lagrangian, we know that they satisfy the optimality conditions below.
∇f(x1)+ATy1=0∇f(x2)+ATy2=0
We can subtract these to derive a statement involving both points.
∇f(x1)−∇f(x2)=AT(y2−y1)
Since f is m-strongly convex, we know that the following holds from an equivalent condition of strongly convex functions and then simplify using the previous statement.
By Danskin’s Theorem we know that ∇g(y)=Ax(y)−b, and since the linear transformation of an L-smooth function is also L-smooth, we have proven that g(y) is L-smooth. This means we can use the Descent Lemma on g(y) between y(t+1) and y(t), and then use y(t+1)−y(t)=α∇g to simplify.
This shows that g(y(t+1))>g(y(t)) if we have α(1−2αL)>0, which simplifies to α<L2. With this step-size we know that as long as ∥∇g(y(t))∥=0, we are making constant progress on reaching optimality. This also proves g(y(t))→g(y∗) which will be used later.
Since g(y) is bounded from above by the primal optimal value p∗, the sum of the increases must also be bounded since it forms a telescoping sum, which is detailed below. Since this is still running a standard gradient descent algorithm we know that the iterates will have non-increasing objective values, so the only way for this to be the case is if ∥∇g(y(t))∥→0 (since g(y(t+1))−g(y(t))≈α∥∇g(y(t))∥2), which proves that primal faesibility is reached.
k=0∑∞g(y(t+1))−g(y(t))≤p∗−g(y(0))<∞
To prove that y→y∗, we can first prove that the dual problem is strongly concave given the assumptions made. Since f is strongly convex, we know that ∇2f(x)⪰mI and that (∇2f)−1⪯m1I from the inversion property of eigenvalues. Since A is full rank, we know the smallest eigenvalue of AAT is greater than zero, letting the following prove that g(y) is strongly concave.
Since g(y) is strongly concave, we know that it satisfies the PL-Condition, which we can use to write the following inequality which along with g(y(t))→g(y∗) proves that y→y∗.
g(y∗)−g(y(t))≥2mσmin(AAT)∥y(t)−y∗∥2
Finally, since we know that both primal feasibility and dual optimality are reached, we know that x→x∗ from the fact that x∗=argminxL(x,y∗) and the fact that f is strongly convex, leading to one minimum.
Augmented Lagrangian Method of Multipliers:
The main problem with Dual Ascent is that it has some strict requirements to guarantee convergence. Problems that are not strictly convex or problems where the function can take values of +∞ will not work, as the update for x becomes a problem if the lagrangian does not have a unique minimum. The Augmented Lagrangian Lρ(x,y) tries to fix this issue by adding a penalty term to the Lagrangian, which allows for more stability under weaker assumptions.
Lρ(x,y)=f(x)+yT(Ax−b)+2ρ∥Ax−b∥22
The Augmented Lagrange Method of Multipliers, shortened ALM, then performs the exact same calculations as Dual Ascent with this new Augmented Lagrangian.
The convergence theorem for ALM relies on converging to a local minimum, contrasting with the other two focusing on global minimums. Assume f(x) and Ax=b are twice continuously differentiable, x∗ is a local minimum satisfying second-order sufficient condition for the Lagrangian, and A has full rank. With these we know that there exists a threshold ρ^ such that for all ρ>ρ^ we know x(t)→x∗, y(t)→y∗, and the convergence of y(t) is linear with a ratio proportional to ρ1. A lot of the proof solely cares about the existence of a ρ, but in practical applications the ρ that works best is still reasonable.
The proof follows by first ensuring there is some ρ that can make Lρ convex, using it to guarantee that the algorithm’s iterations are well-defined, and then proving the linear convergence rate by relating iterates with a contraction.
By definition of x∗ satisfying second order sufficient conditions, we know that ∇2L(x∗,y∗) is positive definite on the null space of A. By Finsler’s Lemma, we know that for large enough ρ the augmented matrix (which forms the hessian of the augmented lagrangian) is positive definite everywhere, which means the augmented lagrangian is convex everywhere for a large enough ρ.
∇xx2Lρ(x∗,y∗)=∇xx2L(x∗,y∗)+ρATA
Since the Hessian is positive definite, we know that ∇xx2Lρ(x∗,y∗) is non-singular. Along with the fact that G(x∗,y∗)=∇f(x∗)+ATy∗=0 by KKT, we can use the Implicit Function Theorem to define a unique function x(y) (the corresponding minimizing x for any given y) such that ∇xLρ(x(y),y)=0 in a neighborhood around y∗. The theorem gives us the derivative of the function, and since the size of the neighborhood depends on ρ (as we are able to overcome non-convexity), we are able to make general statements about the problem as a whole using it.
∇x(y)=−[∇xx2Lρ(x(y),y)]−1AT
Now that we have a derivative of x(y), we can start showing that the update steps of the algorithm converge. To start we can define a surrogate T(y) for the updates as follows.
y(t+1)=y(t)+ρ(Ax(y(t))−b)T(y)=y+ρ(Ax(y)−b)
We can then use a first-order Taylor expansion of T(y(t)) to find an approximation of the above statement, which can be simplified since y(t+1)=T(y(t)) and by definition y∗=T(y∗).
This is where we can use the previous gradients found for x(y) to help define the gradient ∇T(y∗). Once the previous terms are plugged in we can simplify further using the Woodbury Matrix Identity.
As ρ gets larger, the impact of the added identity matrix becomes negligible in comparison to the other terms, so we can remove it for now.
∇T(y∗)=(I+ρM)−1≈(ρM)−1=ρ1M−1M=A(∇xx2L)AT
This means that the norm of the gradient is bounded by C/ρ for some arbitrary constant C, so we know that T(y) forms a contraction for large enough ρ.
∥∇T(y∗)∥≤ρCy(t+1)−y∗≈ρC(y(t)−y∗)
This means that the error is multiplied by a factor of 1/ρ at each step, so the convergence to y∗ is proven to be linear since there exists some ρ where that would be the case. Since this means that y(t)→y∗, and we know that x(y) is continuous, we know that x(k)→x∗ as well, thus concluding the proof.
Alternating Direction Method of Multipliers:
Although ALM allows us to get much more stable and faster convergence results in comparison to Dual Ascent, by using ALM we miss out on the ability to parallelize the primal update. The Alternating Direction Method of Multipliers, shortened ADMM, is an algorithm that allows for convergence under weak assumptions in the same way as ALM and also is able to achieve parallelization in the same way as Dual Ascent, getting the best of both worlds.
ADMM considers a problem over two decision variables x and z, where we will consider a problem of the form below to simplify.
minf(x)+g(z)s.t. Ax+Bz=c
The method then performs the ALM with this problem, minimizing x, z, and y at each iteration, with one key distinction in how updates on x and z are made. Instead of updating both jointly like what ALM would do with (x(t+1),z(t+1))=argminx,zLρ(x,z,y(t)), ADMM uses an alternating update, first updating x and then z.
This separation of variables actually allows for many use cases that the first two algorithms are not capable of. Parallelization can be achieved in a number of different ways, but one of the most common is for problems in the following form with a consensus variable z.
mini=1∑Nfi(xi)s.t. xi−z=0, for all i=1,…,N
Under this we can separate the primal update the same way we did with Dual Ascent.
If f(x) and g(z) are convex (updates will be well-behaved) and Slater’s Condition holds (for the existence of a proper solution), we know that Ax(t)+Bz(t)−c→0, f(x(t))+g(z(t))→p∗, and y(t)→y∗.
The proof follows by deriving inequalities about iterates to the optimal and to the next iteration, combining them, and then using those to derive relationships.
To derive an inequality about global relationships, we can use the optimality conditions that a given optimal (x∗,z∗,y∗) would need to satisfy. Since y interacts with x and z separately, we can separate the derivative optimality conditions, where ∂ is used to denote the subdifferential of the corresponding function.
Ax∗+Bz∗−c=0ATy∗∈∂f(x∗) and BTy∗∈∂g(z∗)
Since we know that convex functions satisfy f(x)≥f(x∗)+⟨g,x−x∗⟩ for all subderivatives g at x∗, by the optimality conditions we know that the following hold as well.
f(x)−f(x∗)+⟨ATy∗,x−x∗⟩≥0g(z)−g(z∗)+⟨BTy∗,z−z∗⟩≥0
We can then combine these along with the primal feasibility condition to get the first inequality we want to derive.
(f(x)+g(z))−(f(x∗)+g(z∗))+⟨y∗,Ax+Bz−c⟩
To derive our second set of inequalities, we can instead use the optimality conditions of the ADMM updates themselves.
These both give us subderivatives that we can plug into the same property of convex functions that we did for the first inequality to derive the two new inequalities.
We can apply the identity 2⟨a−b,a−c⟩=∥a−b∥2+∥a−c∥2−∥b−c∥2 to both terms and simplify to get the following inequality which is simplified using the Lyapunov Function V, which is found after separating step t and step t+1 terms and multiplying by 2. It is also further simplified by using the fact that y(t+1)−y(t)=ρr(t+1).
Since the right-hand side of the inequality is always non-negative, we know that Vt≥Vt+1. As well, the definition of each Vk means that it forms a sequence of non-negative numbers and is monotonically non-increasing, so the sequence must converge to some limit V∞. If we examine the following sum as N→∞, we see that it forms a telescoping sum which leads to the entire sum being bounded by the initial error.
Since this applies for N→∞, we know that the terms in the sum themselves must go to zero for the sum to be finite. This means that we need both ∥r(t)∥→0 and ∥B(z(t+1)−z(t))∥→0, which proves both feasibility and optimality of z. Since B(z(k+1)−z(k))→0, we know that the update for x will converge to 0∈∂f(x(t+1))+ATy(t+1), which is the optimality condition for x, so x→x∗ arises. Since y(t+1)=y(t)+ρr(t+1) and ∥r(t)∥→0, we know that y(t+1)−y(t)→0, so y→y∗ since we know that feasibility is reached. From ∥B(z(t+1)−z(t))∥→0 we know that z(t+1)−z(t)→0, and since the iteration of z is defined so that the optimality condition for z is always satisfied with the current y, we know that z→z∗.
Since Strong Duality holds from Slater’s Condition, we know that the KKT conditions are sufficient for optimality, so proving that these points satisfy the optimality conditions we laid out originally is enough to prove that they are converging to their optimal points, which concludes the proof.
Appendix:
This is going to include proofs for all supporting theorems except for the Implicit Function Theorem, because the proof for that theorem relies on a chain of theorems that would trail a little of course for this.
Danskin’s Theorem:
Consider the function ϕ(x) below defined for some continuous function f(x,z), where z∈R and z is a parameter from a compact set Z⊂Rm (keep in mind that z being from a compact set is only so that a stable minimizer z∗ exists, so it can be replaced by other conditions).
ϕ(x)=z∈Zmaxf(x,z)
If f is convex in x for every z∈Z and f has continuous partial derivatives with respect to x, then the directional derivative of ϕ(x) can be defined below with the set of maximizing points for any given xZ0(x).
The proof follows by proving that the maximum of the individual gradients is both an upper and lower bound on the directional derivative and then combining them to prove equality.
Proving the lower bound is a simple manipulation of the definition of each function. By definition of the maximum function, we know that for some z∗∈Z0(x) the maximum function forms an upper bound for the original function. We can use to form an inequality with a perturbation away from x in direction y and then subtract ϕ(x)=f(x,z∗) from both sides to get it into a more workable form.
We can then divide both sides by t>0 and take the limit as t→0+ to turn both sides of the inequality into statements about the derivatives of each function.
The proof for the upper bound follows a similar case of manipulation of definitions with some use of the fact that Z is a closed set. By definition of the maximum function, we know that the following apply for some t>0 and zt∈Z0(x+ty).
ϕ(x+ty)=f(x+ty,zt)ϕ(x)≥f(x,zt)
We can subtract these two terms to derive an inequality.
ϕ(x+ty)−ϕ(x)≤f(x+ty,zt)−f(x,zt)
We can use the Mean Value Theorem on the right-hand side to know that there must be some θt that satisfies the following, which we can plug into the inequality and divide by t to get into the familiar pre limit derivative form.
Since zt changes as we take the limit of t, we need to make sure that zt has a convergent point as t→0+. Consider the sequence of shrinking step sizes tk→0+, where each is guaranteed by the Extreme Value Theorem to have a corresponding maximizer ztk∈Z. Since Z is a compact set, by the Heine-Borel Theorem we know that the set is both closed and bounded. By the Bolzano-Weierstrass Theorem, we know that {ztk} must contain a convergent subsequence defined as {ztkj} that converges to some z0, which we know satisfies z0∈Z since Z is closed.
For z0 to be meaningful to the bound we need to prove that z0∈Z0(x). For any arbitrary z∈Z, the fact that ztkj is the maximizer of the step tk means the following is satisfied.
f(x+tkjy,ztkj)≥f(x+tkjy,z)
We can then take the limit as j→∞ which leads to f(x,z0)≥f(x,z), which along with the fact that f is continuous leads to z0 being the maximizer at x, so z0∈Z0(x). This lets us take the limit to derive an inequality that we can use to derive the desired bound.
As this applies for every convergent subsequence, we know that the limit superior of the sequence itself must be bounded by the maximum among the individual bounds, completing the upper bound section of the proof.
Since we know that both the limsup of the sequence is upper bounded by the same term that the liminf is lower bounded by, we know that the limsup and liminf are equal, proving that the limit exists and that it has the following value, proving the theorem.
Given a symmetric matrix H∈Rn×n and a matrix A∈Rm×n, we know the following statements are equivalent.
dTHd>0 for all d∈Rn\{0} such that Ad=0
There exists a scalar ρ^∈R such that for all ρ>ρ^, we know H+ρATA≻0
This allows us to know that there exists a threshold ρ^ such that an augmented form of H is positive definite everywhere, making the utility to the convergence proof for ALM clear.
Since the theorem is one of equivalence, we need to prove both that (1)⟹(2) and (2)⟹(1), where the first is proven through contradiction and the second is proven through definition.
To prove (1)⟹(2), we suppose that statement (1) holds but statement (2) is false and then reach a contradiction. If (2) is false then for every ρk=k, k=1,2,…, there exists a vector dk∈Rn such that the following holds. We can enforce ∥dk∥=1 since only the direction of the vector matters for the inequality.
dkT(H+kATA)dk=dkTHdk+k∥Adk∥2≤0
Since the sequence {dk} is a subset of the unit sphere, which is a compact set, we can use the Bolzano-Weierstrass Theorem to know there exists a convergent subsequence {dkj} that converges to some unit vector d∗. Since ∥dk∥=1 and H is a fixed matrix, we know that dkTHdk is always bounded. This means as k→∞, for dkTHdk+k∥Adk∥2≤0 to stay true, we need ∥Adk∥2→0, which means that the limit below needs to hold.
k→∞lim∥Adk∥2=∥Ad∗∥2=0⟹Ad∗=0
Since d∗ is in the null space of A and we know that the following holds, we reach a contradiction with (1), thus proving that (1)⟹(2).
j→∞lim(dkjTHdkj+kj∥Adkj∥2)≤0d∗THd∗≤0
To prove (2)⟹(1), we assume that (2) holds which means that for some ρ>ρ^ we know that for any non-zero vector d the following holds.
dT(H+ρATA)d>0
We can distribute and simplify to make the inequality workable with what we need to prove (1).
dTHd+dT(ρATA)d>0dTHd+ρ(Ad)T(Ad)>0dTHd+ρ∥Ad∥2>0
If the chosen d satisfies Ad=0, then this simplifies into the definition of (1), thus proving that (2)⟹(1).
dTHd+ρ(0)>0dTHd>0
Since we have proven both (1)⟹(2) and (2)⟹(1), we know that they are equivalent statements, thus proving the theorem.