This is another quick little review like the last ones. It’s going to cover the general problem setup and the background of Proximal Point Methods, and then cover Proximal Gradient, and Proximal Newton and some of their convergence properties. It’s also going to cover Accelerated Proximal Gradient and Regularized Proximal Quasi-Newton, and the convergence information from their papers including some explanations that I personally needed.
Problem Setup:
A composite optimization problem is an optimization problem over a function with two parts, a differentiable f(x) and a potentially non-differentiable g(x).
minf(x)+g(x)
It is often assumed that f(x) will have some highly complex structure while g(x) is a simple function that has a structure that can be taken advantage of. Consider the function f(x)+∥x∥1, where the L1 regularization promotes sparsity in the weights by pushing those close to zero to zero exactly, which is very helpful, although the non-differentiability of ∥x∥1 can cause headaches.
Although there are many methods of non-differentiable optimization out there, like Subgradient Descent (which for future reference would have a convergence rate of O(1/ϵ2) to reach an accuracy of ϵ), they lose a lot of information by not taking advantage of the fact that f(x) is differentiable, which is where specific composite optimization problems come in to play.
Proximal Point Methods:
The main piece of important background before getting into the algorithms themselves is the concept of a Proximal Point Method. Standard gradient descent uses the gradient to gauge how far a step should be based on how much the function is supposed to decrease by. Proximal point methods do the same, but through a minimization subproblem since the gradient information is not available. They make use of the proximal operator proxf(⋅), which balances minimizing the function value and how far the step itself is.
proxf(v)=argxmin(f(x)+21∥x−v∥2)
A proximal point method simply takes this operator with a stepsize, defined below as η, and uses it for each iteration.
xk+1=argzmin{f(z)+2η1∥z−xk∥2}=proxηf(xk)
Although this general form is almost impossible to solve at each iteration, there is often some sort of structure in f(⋅) that we can take advantage of to find a better method of finding it. All of the composite optimization problems covered here will also make the act of solving the subproblem much simpler.
Interpretations:
There are two interpretations of the algorithm that I want to cover to hopefully help with the intuition behind it if it didn’t make sense originally. The more intuitive of the two relates PPMs with the Moreau Envelope Mf(⋅), which provides a way to create a smoothed version of any lower semi-continuous convex function. It can be defined below with some parameter λ∈R.
Mηf(v)=xinf(f(x)+2η1∥x−v∥22)
The Moreau Envelope keeps the original function’s minimal value and has the same minimizers, so it’s extremely helpful for non-smooth optimization. The main benefit of the envelope comes in the fact that Mλf is always differentiable even if f isn’t.
∇Mηf(xk)=η1(xk−proxηf(xk))
By isolating the proximal operator, one can see that a PPM is simply performing gradient descent on the Moreau Envelope of the function.
xk+1=xk−η∇Mηf(xk)
Another equally important interpretation is that PPMs are performing implicit subgradient updates, so updating based on the subgradient gk+1∈∂f(xk+1) of the next iterate.
xk+1=xk−ρgk+1
This can be intuitively understood since the PPM iteration looks for the exact properties of the next point rather than solely relying on the current. It’s derived from the optimality condition of the proximal operator 0∈∂f(xk+1)+ρ1(xk+1−xk) after some basic manipulation.
These both go to show that PPM has a much more solid foundation for convergence than a simple subgradient descent would, which is why it is so commonly used.
Proximal Gradient Method:
Since we assume that f(x) is a hard function to optimize over, to simplify the proximal calculations we can consider using a quadratic approximation of f(x) constructed at each iterate xk, where L defines the curvature of the approximation.
Through minimizing this surrogate we can derive Proximal Gradient. Since f(xk) is a constant with respect to x it does not affect the location of the minimum so we can remove it from the subproblem.
argxmin{⟨x−xk,∇f(xk)⟩+2L∥x−xk∥2+g(x)}
Using the same constant logic as before, we can add in the term 2L1∥∇f(xk)∥2 to the subproblem without changing the optimal x. We can factor out 2L from all of the terms excluding g(x) to simplify, which shows that the term we added in lets us turn the f(x) terms into a collective squared term.
This final simplified form gives us the proximal gradient update rule.
xk+1=argxmin{2Lx−(xk−L1∇f(xk))2+g(x)}
Since this same quadratic approximation is what gradient descent does under the hood, this simplifies down to using the proximal operator on a standard gradient descent step of f. This leads to an algorithm that can optimize over f much quicker than the previous subgradient descent while still being feasible at each iteration.
zk=xk−L1∇f(xk)xk+1=proxL1g(zk)
Since g is assumed to have a simple structure, there is more often than not a simple closed form solution to the proximal operator step. Consider the case of a simple case of g(x) being an indicator function for some convex set C.
g(x)=1C(x)={0∞if x∈Cif x∈/C
This definition of g makes the proximal operator become a projection onto C, causing the algorithm to turn into projected gradient descent.
proxL11C(x)(z)=ΠC(z)=argx∈Cmin∥x−z∥2
We can also return back to our original example of g(x)=∥x∥1, which once simplified has a proximal operator with a very simple closed form.
This specific form of the algorithm is called the Iterative Soft-Thresholding Algorithm, shortened to ISTA, and is going to be of slight importance later for naming conventions.
Convergence:
If f is an L-Smooth convex function, g is a lower-semicontinuous convex function, and a stepsize L1 is used, then we know that the method has a convergence rate of O(1/ϵ) to reach an error ϵ, which is detailed below.
F(xk)−F(x∗)≤2kL∥x0−x∗∥2
In order to prove this rate, we first need to prove that the iterates satisfy F(xk+1)≤F(xk), which is a property that is guaranteed when f is L-Smooth and a stepsize L1 is used. The proof is a simple one that relies on the properties of the surrogate QL(x,xk). By definition we already know that QL(xk,x)=F(xk) and since xk+1 is derived through a minimization we have QL(xk+1,xk)≤QL(xk,xk). Since f is L-smooth, we know that the descent lemma holds, which if g(xk+1) is added to both sides proves F(xk+1)≤QL(xk+1,xk). Putting these together we can arrive at the desired result.
F(xk+1)≤QL(xk+1,xk)≤Q(xk,xk)=F(xk)
Now we can return to the original convergence theorem. Since f is L-smooth, we can apply the descent lemma to the function to provide a bound. Adding g(xk+1) to both sides gives us an inequality in terms of F and our surrogate QL.
Since xk+1 is the result of a minimization subproblem, which we know is a convex problem since g and the quadratic approximation are both convex, we can use the standard derivative optimality conditions to derive some properties of xk+1. We can specifically derive the following, which allows us to know that there is some subderivative of g at xk+1 that is equal to −∇f(xk)−L(xk+1−xk).
0∈∇f(xk)+L(xk+1−xk)+∂g(xk+1)
We can then use the convexity of f and g to derive the following bounds for some arbitrary x using the tangent line property, where we replace the subgradient in the inequality with our above defined subgradient.
We can then define x=x∗ which defines a recursive relationship between xk and xk+1, which means we can also bound the total sum of both over the k iterations.
Now we can finally use the fact that F(xk) is non-increasing under this setting so that we can know that F(xk)−F(x∗) will be the smallest term in the sum. This means that k(F(xk)−F(x∗)) will be less than the entire sum, so we can derive the bound below, which completes the proof.
One of the methods of speeding up the convergence of gradient descent with momentum is Nesterov’s Acceleration, where we evaluate the gradient at the point that the momentum term is going towards, rather than at the current iterate. The same principle can be used to derive Accelerated Proximal Gradient, which is what the paper on the Fast Iterative Shrinkage-Thresholding (FISTA) algorithm does. First we can redefine the original proximal gradient method by defining pL(y)=argmin{QL(x,y):x∈Rn} which uses a stepsize L.
The accelerated proximal gradient method introduces a momentum term yk which is given its own stepsize tk. For later convergence analysis we define tk2−tk=tk−12 so that tk+12−tk+1≤tk2 holds. Using this along with our previously defined pL(⋅) gives the update rule for FISTA.
While this may feel unintuitive, the same intuition can be used for this as Nesterov acceleration itself. Rather than solely relying on the current information, using yk allows the method to look ahead before taking a step, allowing it to correct its trajectory faster. These properties are also shown in the improved convergence guarantees.
Convergence:
If f is an L-Smooth convex function and g is a proper, lower-semicontinuous convex function, then we know that the method has a convergence rate of O(1/ϵ) to reach an error ϵ.
F(xk)−F(x∗)≤(k+1)22L∥x0−x∗∥2
To start the proof, we need to reintroduce three lemmas from the original proximal gradient definition. The first is the standard descent lemma coming from f being L-smooth.
f(x)≤f(y)+⟨x−y,∇f(y)⟩+2L∥x−y∥2
The second comes from the optimality conditions of pL(y) and the fact that it’s a convex problem. We know that one has z=pL(y) if and only if there exists γ(y)∈∂g(z) such that the following holds.
∇f(y)+L(z−y)+γ(y)=0
The third comes from careful analysis of the convexity of both f and g. For any x if we have a point y that satisfies F(pL(y))≤Q(pL(y),y) then we have the following.
F(x)−F(pL(y))≥2L∥pL(y)−y∥2+L⟨y−x,pL(y)−y⟩
To prove the third lemma, we can first look at the inequalities from the definition of convexity for both functions and then combine them to get a bound on F(x).
After turning the product into L⟨x−pL(y)+y−y,y−pL(y)⟩ by adding y−y, we can simplify into our original statement, proving the lemma.
F(x)−F(pL(y))≥2L∥pL(y)−y∥2+L⟨y−x,pL(y)−y⟩
Now that we have the prep work laid out, we can start on the convergence proof itself. The proof follows by first establishing a recursive relationship between steps and then using it to bound the suboptimality. We start by applying the third lemma to at x=xk and x=x∗ to derive two bounds, where we use y=yk+1 and L=Lk+1 for both.
To simplify the notation, we can define vk=F(xk)−F(x∗) and also the fact that F(xk)−F(xk+1)=(F(xk)−F(x∗))−(F(xk+1)−F(x∗)) to make both bounds be on the same terms.
To make this a meaningful bound, we need to make the second norm on the right-hand side match the form of the first. From the defined update for yk+1 we know that tk+1yk+1=tk+1xk+(tk−1)(xk−xk+1), so we can simplify the second norm into the following.
Now we can define uk=tkxk−(tk−1)xk+1−x∗ to finally get the bound into a more workable form.
Lk+12(tk+12vk−tk+12vk+1)≥∥uk+1∥2−∥uk∥2
Using the fact that we know that Lk≥Lk+1 (either from a static step size or from the specific backtracking line search method defined in the paper), we can define the left-hand side with two terms, one dealing only with iteration k and the other dealing only with iteration k+1. This acts as our main recursive relationship for the rest of the proof.
Lk2tk2vk−Lk+12tk+12vk+1≥∥uk+1∥2−∥uk∥2
To simplify the notation of the analysis in the next steps, we define ak=Lk2tk2vk and bk=∥uk∥2 to rewrite the relationship as follows.
ak−ak+1≥bk+1−bk
Since we know that ak,bk>0 and that the sequence ak+bk is monotonically non-decreasing (from the fact that we can rearrange and see that ak+bk≥ak+1+bk+1), we can know that if a1+b1≤c with some constant c, then for all k we have ak≤c. If we define c=∥x0−x∗∥2 and using the fact that we initialize t1=1, we can get an inequality that would bound the entire sequence ak if proven correct.
L12v1=∥x1−x∗∥2≤∥y1−x∗∥2
To start the proof of this inequality, we apply the third lemma on x=x∗, y=y1, and L=L1, and then swap in pL1(y)=x1 from the method’s definition.
Using the same norm identity that we used for the recursive relation proof lets us simplify in the same way, proving that the base case we needed holds.
From this, we know that we have ak≤c for all k, which after isolating vk and using the fact that we define tk in a way that makes tk≥2k+1 hold can be used to simplify the denominator, proving the convergence theorem.
In the same way that we can apply the ideas behind gradient descent and Nesterov acceleration to PPMs, we can also apply Newton’s Method. The same general principle of using a quadratic approximation of f is done in Proximal Newton, but the hessian of f is used to improve the approximation.
We can then simplify this subproblem into the iteration rule for the method, where we can define v=xk−H−1∇f(xk) and ∇2f(xk)=Hk for notational simplicity.
As will be covered later, proximal newton has the best convergence we have seen yet once within a certain range of the optimum, which is the range where we know that the quadratic approximation made is highly accurate to the actual landscape of the function. The biggest issue with the algorithm is that it still suffers from the same instability that Newton’s Method itself suffers from. When far from the optimum in strongly convex settings, the function itself does not need to have the same format as a quadratic, so although the quadratic approximation is highly accurate locally, it often makes incorrect assumptions about the surrounding landscape.
The aggressiveness of the approximation makes the general direction of each iteration quite accurate, but the scale of each step is often inaccurate and is the main problem when it comes to the algorithm diverging. This is why the use of a backtracking line search stepsize tk is so common, and will be required by the theorems following.
xk+1=tkproxgHk(xk−Hk−1∇f(xk))
As a quick recap, a backtracking line search is built on the idea of satisfying the Armijo Condition, an inequality that ensures that the actual decrease of the step taken is at least α∈(0,0.5) times the predicted decrease.
F(xk+tdk)≤F(xk)+αtΔk
To ensure that each step satisfies this, we start with a full step tk=1. If the Armijo condition holds, then we simply take the step. If not, we shrink the stepsize tk←βtk by some β∈(0,1) and repeat until it does satisfy the inequality.
Global Convergence:
If f is m-strongly convex and L-smooth, g is convex, and tk is chosen using a backtracking line search, then we know that the method has a convergence rate of O(log(1/ϵ)). This can be detailed below with some constant γ∈(0,1) which is proportional to Lm and penalized by the search parameters for the algorithm behind tk.
F(xk+1)−F(x∗)≤(1−γ)(F(xk)−F(x∗))
To make the relationship between proximal newton and the previous methods clearer, we can see the convergence rates of proximal gradient and accelerated proximal gradient under the same conditions as this theorem in terms of the condition number κ=mL which conveys how stretched the function is, so functions with a higher κ are harder to optimize. Since strong convexity is added to f, Proximal gradient’s convergence rate becomes O(κlog(1/ϵ)) and the acceleration gets it to O(κlog(1/ϵ)). Proximal Newton’s convergence rate doesn’t rely on the condition number since it uses the hessian of the function, removing any guess work that the other two make about the curvature of the function.
The proof follows by first proving that the direction of the iteration is a strict descent direction, and then using the line search to derive a recursive relationship between the function values of iterates.
We can first define some properties of the direction dk=x^k+1−xk by using the optimality conditions of the subproblem since f and g are convex. We define v∈∂(xk+dk) and Hk=∇2f(xk) for simplicity.
We can prove that dk is a descent direction by looking at the directional derivative of the problem towards dk, where we need to use the change in g rather than its derivative due to its non-differentiability.
Δk=∇f(xk)Tdk+g(xk+dk)−g(xk)
Since g is convex, we know that g(xk)≥g(xk+dk)−vTdk, so along with the properties from optimality above we can simplify.
Δk≤∇f(xk)Tdk+vTdk=(∇f(xk)+v)Tdk=−dkTHkdk
Since f is m-strongly convex, we know that Hk⪰mI with m>0. This means we can bound the right-hand side, proving that dk is a strict descent direction as Δk=0 only if ∥dk∥2=0.
Δk≤−m∥dk∥2≤0
Next, we can use the backtracking line search to show that the algorithm leads to a sufficient decrease in F(x). Since we know that the Armijo Condition is satisfied, we know that the following applies.
F(xk+tkdk)≤F(xk)+αtkΔk
To make this descent meaningful, we need to lower bound tk to make sure that there is progress being made, which we do through manipulation of the function properties. Since f is L-smooth, we can use the descent lemma to know the first bound, and since g is convex and t∈(0,1], we can derive the second bound.
This gives us an upper bound parabola with terms we can work with. To show that the Armijo Condition can be satisfied, we can require that the right-hand side of the Armijo equation is below the parabola we defined here.
Since α∈(0,0.5), we know that t≤Lm will satisfy the Armijo Condition. This leads to either tk=1 being accepted, or a lower bound of βLm if we do need to scale by β, so tk≥min(1,βLm). Since we know that m≤L by definition of the properties and 0<β<1, this simplifies to the following.
tk≥βLm
Now that we’ve done the proper setup, we can start working on making the final bound on suboptimality. To start we use the fact that f(x) is strongly convex and that g(x) is convex, where again v∈∂g(xk+dk).
We can then combine these bounds to derive a bound on the global minimum F(x∗). We can add ∇f(xk)Tdk+g(xk) to both sides to simplify, in turn introducing our previous Δk.
Using the Peter-Paul inequality, which states that for any two vectors a and b and any constant γ>0, we know aTb≤2γ1∥a∥2+2γ∥b∥2, we can simplify again. Specifically we use the inequality with a=Hkdk and b=x∗−xk to remove the ∥x∗−xk∥2 term.
F(x∗)≥F(xk)+Δk−2m1∥Hkdk∥2+dkTHkdk
Since f is L-smooth, we know that Hk⪯LI, so ∥Hkdk∥2≤L(dkTHkdk), which we can use along with dkTHkdk≤−Δk to get the following.
Finally, we can rearrange to isolate Δk to derive a bound on it, which we can substitute into the Armijo Condition to derive our final inequality. We can define γ=αβL22m2, which satisfies 0<γ<1, proving the theorem.
F(xk+1)≤F(xk)−α(βLm)[L2m(F(xk)−F(x∗))]
Local Convergence:
The real strength of the method can be seen once the iterate gets close enough to the optimal x∗, when the quadratic approximation becomes highly accurate. If f is m-strongly convex, L-smooth, and has an M-Lipschitz hessian, g is convex, and ∥xk−x∗∥<ϵ where ϵ≈Mm, then we know the method has a convergence rate O(loglog(1/ϵ)) which is detailed below.
∥xk+1−x∗∥≤2mM∥xk−x∗∥2
Once the iterate gets close enough to x∗, we also know that the backtracking line search will always return tk=1, leading to the steps not being damped.
The proof follows by first proving that the error between the actual function and the approximation is bounded, and using it to prove the recursive relationship we want.
We start by finding two subgradients v∗∈∂g(x∗) and vk+1∈∂g(xk+1) from the optimality conditions of the problem as a whole and of the subproblem for each iteration.
−v∗=∇f(x∗)−vk+1=∇f(xk)+∇2f(xk)(xk+1−xk)
We can use these two to define an inequality using the monotone gradient property of the convex g, which we can combine with the strong convexity of f to define an inequality with all of the terms we need.
The term ∇f(xk+1)+vk+1 represents the error between the expected gradient at the next iterate and the real gradient, which we can use the M-Lipschitz continuity of the hessian to bound. This is where the importance of the use of the hessian becomes important, as it makes this error much smaller than in typical gradient descent methods.
To denote the difference ∇f(xk+1)−∇f(xk) in terms that can be combined with the hessian term, we can use the Fundamental Theorem of Calculus to denote the term using a path integral over p(τ)=xk+τ(xk+1−xk), since we know that dτdp=xk+1−xk.
Since the hessian is M-Lipschitz, we can take the norm of both sides and bound the terms inside the bracket. This leads to an integral with a constant integrand, so we can simplify using the fact that C∫01τdτ=2C.
To finish the proof, we substitute this into our original bound to derive the desired result.
∥xk+1−xk∥≤2mM∥xk+1−xk∥2
Regularized Proximal Quasi-Newton:
One of the main problems with the default proximal newton setup is that in modern applications, the added function calls for a backtracking line search are extremely computationally expensive. This is an annoyance especially in Newton’s method since we only need the backtracking line search due to the opportunity for the approximation to overfit locally and choose a bad step direction. Regularized Proximal Quasi-Newton aims to fix this by updating and reevaluating the proximal subproblem itself, rather than trying to work with the first direction that was obtained.
To follow the syntax of the paper, we redefine the composite optimization problem as one over a continuously differentiable function f and a convex function φ. This means we redefine the original proximal newton step as a minimization subproblem over an approximation qk and then perform xk+1=xk+dk.
Instead of using this approximation, the regularized proximal quasi-newton method uses a regularized approximation q^k with the regularization controlled by a parameter μk>0.
q^k(d)=qk(d)+21μk∥d∥2
To help define optimality for the problem, the paper uses the residual of a proximal gradient step at the current iterate r(xk) as a gauge, since ∥dk∥=0 is a sufficient but not necessary property for stationarity if Bk+μkI isn’t positive definite (which is not required for anything here).
r(x)=proxφ(x−∇f(x))−x
As well, in order to help quantify the quality of the step dk the paper also introduces two terms for each iteration, the predicted reduction predk (which uses the nonregularized form as a better gauge of how much confidence we should have in it) and the actual reduction aredk, where the ratio ρk=aredk/predk gives a good idea of how accurate the approximation was.
predk=ψ(xk)−qk(dk)aredk=ψ(xk)−ψ(xk+dk)
The method starts by initializing parameters x0∈Rn, μ0>0, pmin∈(0,21) (a tolerance parameter), c1∈(0,21) (lower threshold for ρk), c2∈(c1,1) (upper threshold for ρk), σ1∈(0,1) (regularization shrink parameter), σ2>1 (regularization growth parameter), and k=0. Then the method can be broken into three steps.
First, we use a suitable termination criterion to check the current xk and terminate accordingly.
Second, we choose Bk∈Rn×n and find a solution dk to mindq^k(d). If the problem has no solution or if predk≤pmin∥dk∥⋅∥r(xk)∥ (checking that the predicted drop is a sufficient decrease since Bk is not necessarily positive definite), we update xk+1=xk and μk+1=σ2μk and move on to the next iteration.
Third, if the previous dk passed, we set ρk=aredk/predk and perform the following updates based on how successful the step was. We break down the updates on μk+1 into unsuccessful if the iteration was skipped or ρk≤c1, successful if c1<ρk≤c2, and highly successful if ρk>c2.
A bunch of preliminaries are mentioned in the paper before going into the convergence proofs, so I’ll include the four that were novel to me personally here since they are also important for later steps of the proof.
As a refresher for the first two, we define a scaled proximal operator with some matrix H and a function φ as follows, where in the following theorems we assume that H∈S++n (symmetric positive definite) and that φ is convex.
proxφH(v)=argxmin{φ(x)+21(y−x)TH(y−x)}
First, we can prove that p=proxφH(x) if and only if p∈x−H−1∂φ(p). We start by defining the objective function J(z) of the operator.
J(z)=φ(z)+21(z−x)TH(z−x)
Since φ is convex and H is positive definite, we know J(z) is strictly convex, thus we know that a point p is the global minimizer of J(z) if and only if 0∈∂J(p). We can then use the sum rule for subdifferentials to simplify this statement into what we need.
∂J(p)=∂φ(p)+H(p−x)0∈∂φ(p)+H(p−x)H(x−p)∈∂φ(p)
Since H is positive definite, it is invertible, which allows us to derive what we originally stated, concluding the proof.
x−p∈H−1∂φ(p)p∈x−H−1∂φ(p)
Second, we can prove that the proximal operator is firmly nonexpansive with respect to the norm induced by H, i.e. for every x,y∈Rn we have the following.
For simplicity, we can define p=proxφH(x) and q=proxφH(y). From the optimality conditions we showed previously we know the following.
H(x−p)∈∂φ(p)H(y−p)∈∂φ(q)
Since φ is a convex function, we know its subdifferential operator is monotone, thus for any p,q and their corresponding subgradients u,v we have ⟨u−v,p−q⟩≥0, thus we know the following for our points.
For the third, we have to prove that if dk=0 then xk is a stationary point of ψ, with the converse of the statement being true if Bk+μkI is positive definite. Keep in mind that the algorithm itself does not require Bk+μkI to be positive definite anywhere, so the more reliable r(xk) stopping criterion is used instead, we just need this for some assumptions made later.
To prove that dk=0 leads to xk being a stationary point, then we only need to use the definition of dk and Fermat’s Rule, which trivially simplifies into the statement for the stationarity of xk.
0∈∇f(xk)+(Bk+μkI)dk+∂φ(xk+dk)0∈∇f(xk)+∂φ(xk)
To prove that xk being a stationary point leads to dk=0 if Bk+μkI is positive definite, we first get the fact that −∇f(xk)∈∂ψ(xk) from stationarity. We can then use this subgradient in the definition of φ being convex to know φ(xk+d)≥φ(xk)−∇f(xk)Td. Using this we can analyze the objective for our iteration subproblem.
q^k(0)=f(xk)+φ(xk)≤f(xk)+∇f(xk)Td+φ(xk+d)
We can add the term 21dT(Bk+μkI)d to the right-hand side, since by Bk+μkI≻0 we know that the term will be positive. This leads to the statement that dk=0 is the global minimizer, which is guaranteed to be the only solution by Bk+μkI‘s positive definiteness causing the subproblem to become strictly convex, thus proving the statement.
The proof borrows a result from a work that will be mentioned later again, Block Coordinate Descent. From a lemma in the paper we know the following, where Q=H−1/2H~H−1/2. The proof for the lemma follows from the optimality conditions of the residuals, Fermat’s Rule, and some careful algebraic manipulation.
If {Bk} is a bounded sequence of symmetric matrices and ψ is bounded from below (along with the requirements from the problem definition), then any sequence {xk} generated by the method satisfies liminfk→∞∥r(xk)∥=0. This ensures that at least one subsequence of the iterations converges towards a stationary point, although this doesn’t require that the method itself converges (with that being proven in a later section).
The proof follows through a long series of lemma work to prove through contradiction that there will be an infinite number of successful steps if the algorithm were to never reach termination, and then use that to prove the theorem itself through another contradiction.
To start the prep work for the first contradiction, we establish an inequality that relates r(xk) and dk if we assume that {xk} converges to a nonstationary point xˉ of ψ. We also assume that {Bk} is a bounded sequence of symmetric matrices and that μk→∞.
Since {Bk} is bounded, we know that there is some constant C such that λmin(Bk)≥−C, which combined with the fact that we know μk→∞ means that we can guarantee λmin(Bk+μkI)≥−C+μk>0 for large enough k, meaning we can guarantee that Bk+μkI is positive definite for large enough k. This lets us use the lemma we established previously about the weighted residuals with H=Bk+μkI and H~=I.
Since ∥r(x)∥=0 if and only if x is a stationary point, we know that ∥rBk+μkI(xk)∥=0, so we can divide it from both sides, which along with dividing by μk and taking k→∞ lets us derive the bound we need for future steps.
Second, we show that under high regularization we know that dk will always converge to 0 under the assumptions that {Bk} and {xk} are bounded sequences, μk→∞.
Since {Bk} is bounded and μk→∞ lead to Bk+μkI being positive definite from previous steps, we know that the subproblem turns into a strictly convex problem for sufficiently large k, meaning that dk is well defined in such cases. As well, since each iteration is only accepted if its successful or highly successful (which for the rest of the proof will be lumped into being successful), we know that the sequence can’t worsen in the objective, so {ψ(xk)} is monotonically increasing. From these two we know that we have the following from the definition of dk.
ψ(x0)≥ψ(xk)=q^k(0)≥q^k(dk)
We can then use the convexity of φ to know that φ(xk+dk)≥φ(xk)+(uk)Tdk (for some subderivative uk∈∂φ(xk)) after expanding the defined q^k and ψ to get the bound in more separable terms.
Since {xk} and {Bk} are bounded, from continuity we know that {f(xk)}, {ψ(xk)}, and {∇f(xk)} are bounded. As well, since the subdifferential maps bounded sets to bounded sets we know that {uk} is bounded as well. Since this entire inequality is bounded by a fixed value ψ(x0), the right-hand side needs to be bounded as μk→∞. This means that 21(dk)T(Bk+μkI)dk needs to be bounded, so dk→0.
Third, we use this statement on dk to prove that the directional derivative along the normalized step is strictly negative, which is going to be a key property for the later contradiction.
We assume that {Bk} is a bounded sequence of symmetric matrices, μk→∞, and xˉ is a nonstationary point of ψ. We also define dˉk=rBk+μkI(xˉ) and some point s has an accumulation point of the sequence {dˉk/∥dˉk∥} (which we know has accumulation points since the sequence is by definition bounded). Due to Fermat’s Rule we know that we have the following from the iteration subproblem for some uk∈∂φ(xˉ+dˉk).
0=∇f(xˉ)+(Bk+μkI)dˉk+uk
Since (Bk+μkI)dˉk is exactly equal to −(∇f(xˉ)+uk), we know that both terms cannot be 0 due to the nonstationarity of the convergent point. This allows us to know that even though dˉk→0, it’s still not zero, so we can know that the sequence {dˉk/∥dˉk∥} is well-defined, so s is a real point.
Since the sequence of normalized vectors and the sequence of subgradients are both bounded, we can use the Bolzano-Weierstrass Theorem to form a subsequence K that satisfies the following. Since the subdifferential is closed we also know that uˉ∈∂φ(xˉ), which means we have ∇f(xˉ)+uˉ=0 from nonstationarity.
∥dˉk∥dˉk→Ksuk→Kuˉ
From the previous proofs on Proximal Newton, we know that for some regularized proximal step d we have ψ′(x;d)≤−dTHd, where H is the hessian of the quadratic model. Since we have H=Bk+μkI, we can substitute the value in and then use the eigenvalues to simplify.
As we have φ as a convex function, we know that ψ′(xˉ;d) is sublinear, so we know it satisfies positive homogeneity (the first line of the following), which we can use bound the derivative in terms of our regularized sequence.
Since we have k→K∞, we can use our previous analysis of the convergent subsequence to prove that the directional derivative is strictly negative. Since ∇f(xˉ)+uˉ is nonzero, we know its norm has to be positive, and since we have μk→∞, the multiplicative term including it converges to 1.
ψ′(xˉ,s)=K∋k→∞limψ′(xˉ,∥dˉk∥dˉk)≤−∥∇f(xˉ)+u∥<0
Finally for the prep work, we can use this and our previous results to show that the algorithm will perform infinitely many successful steps if it goes on forever.
We assume for contradiction that there exists some k0∈N such that all steps k≥k0 are unsuccessful. This implies that xk=xk0 for all k≥k0, means that we have ∥r(xk)∥=0 since we haven’t reached the stopping criterion yet (so xk0 is nonstationary), and also μk→∞ by the algorithm definition. Since {Bk} is a bounded sequence, this means that the matrices converge to being positive definite and dk is well-defined. Also from previous lemmas is the fact that dk=0 and that we have the following (from the first lemma since we have pmin<0.5 and dk=rBk+μkI(xk) from the regularized matrix being positive definite).
∥dk∥μk∥r(xk)∥<2pmin1
From the optimality of the subproblem we have q^k(dk)≤q^k(0), which we can use to bound predk to prove that the failure point is not from the second step of the algorithm, meaning we need to fail in the first.
We can then divide by ∥dk∥ to not only introduce our previous regularized sequence, but also introduce some terms in derivative form, for which we define tk=∥dk∥ for simplicity in notation.
From previous results, we can choose a subsequence K such that dk/∥dk∥→s, which along with the fact that φ is locally Lipschitz continuous (allows the left-hand term to turn into ψ′(xk0;s)) gives the following. The term on the right-hand side converges in a similar way since dk→0 and {Bk} is bounded.
Since c1∈(0,1), the only way for this to be true is for ψ′(xk0;s)≥0, which leads to a contradiction with our previous proof that ψ′(xk0;s)<0, which means that the algorithm can not get stuck at a nonstationary point.
With this, we can finally construct the original convergence theorem through another proof by contradiction. We assume that liminfk→∞∥r(xk)∥>0 for contradiction along with the assumption that ψ is bounded. We also use the previous theorem to define a set of successful iterations S (which we use since almost all of the properties we prove for the sequence of steps in S is trivial for unsuccessful steps since the difference between unsuccessful steps is 0).
Our assumption means that there exists some k0∈N and ϵ>0 such that ∥r(xk)∥≥ϵ for all k≥k0. By definition of successful steps, we get the following for all k∈S.
We can then sum over this recursive relationship to create a telescoping sum. Since ψ is bounded we know the telescoping sum is bounded. We can then isolate the sum over ∥dk∥ to get the following.
This means that {xk} is a Cauchy Sequence, and therefore convergent to some xˉ, which from our assumptions is not a stationary point of ψ since ∥r(xˉ)∥=limk→∞∥r(xk)∥.
Since we have infinitely many successful steps, for this to stay bounded we need ∥dk∥→S0. We can use Fermat’s Rule to see that if we assume {μk} is bounded, it would result in a contradiction.
0=∇f(xk)+(Bk+μkI)dk+uk→∇f(xk)+uk
This means that we would have 0=∇f(xˉ)+uˉ, which contradicts with the nonstationarity of xˉ, so we need {μk}S→∞. That also means that {μk}→∞ since μk cannot decrease during unsuccessful steps, and also that the algorithm performs infinitely many unsuccessful steps since μk can’t grow during successful iterations.
From previous steps we know that the following holds for sufficiently large k.
predk≥pmin∥dk∥⋅∥r(xk)∥≥pminϵ∥dk∥
By the Mean Value Theorem, for every k there exists some ξk on the straight line between xk and xk+dk such that f(xk+dk)−f(xk)=∇f(ξk)Tdk. By the convergence of {xk} to xˉ, and since {dk}→0, the sequence {ξk} must also converge to xˉ, which means we can obtain the following for k→∞.
This means that {ρk}→1, which means eventually all steps are successful, which yields a contradiction meaning that we need some eventual k that satisfies ∥r(xk)∥=0, which concludes the proof.
Global Strong Convergence:
If {Bk} is a bounded sequence of symmetric matrices, ψ is bounded from below, and ∇f is uniformly continuous on X satisfying {xk}⊂X (along with the requirements from the problem definition), then any sequence {xk} generated by the method satisfies limk→∞∥r(xk)∥=0. This ensures that all of the subsequences mentioned in the global weak convergence theorem all converge towards a stationary point, although they can be different stationary points (with convergence to a single stationary point being handled in a later section).
The proof follows by assuming for contradiction that the residual sequence does not strictly converge to 0, then using the uniform continuity of ∇f to show that the residual cannot change drastically over the shrinking step sizes.
From the previous proof, we know there is an index ℓ(k)>k such that ∥r(xl)∥≥δ for all k≤l<ℓ(k) and ∥r(xℓ(k))∥<δ. If for k∈K, an iteration k≤l<ℓ(k) is successful, we get the following (which also holds trivially for unsuccessful iterations so we can generalize to a sum over all iterations).
By assumption we know ψ is bounded and from the algorithm definition we know {ψ(xk)} is monotonically decreasing, so the sequence is convergent. This implies that {ψ(xk)−ψ(xℓ(k))}→K0, so we get {∥xℓ(k)−xk∥}→K0.
The uniform continuity of ∇f and the proximal operator means that r(⋅) is uniformly continuous, so we also get {∥r(xℓ(k))−r(xk)∥}→K0, which leads to the following contradiction for our definition of ℓ(k), concluding the proof.
∥r(xk)−r(xℓ(k))∥≥∥r(xk)∥−∥r(xℓ(k))∥≥2δ−δ≥δ
Convergence Under an Error Bound Condition:
The paper rounds out the convergence analysis of the algorithm by using a similar error bound strategy that is used in Block Coordinate Descent. We assume that ψ is bounded from below, the set of stationary points X∗ of ψ is nonempty, and that two properties are satisfied to ensure that the optimization landscape is regular enough to provide stable convergence. One is a local error bound for stationarity and the other is a separation between stationary points. First, for any ξ≥minxψ(x), there exists scalars τ>0 and ϵ>0 such that dist(x,X∗)≤τ∥r(x)∥ whenever ψ(x)≤ξ,∥r(x)∥≤ϵ (ensures that the residual is a good measure of distance to stationary points). Second, there exists a scalar δ>0 such that ∥x−y∥≥δ whenever x∈X∗, y∈X∗, and ψ(x)=ψ(y) (ensures that stationary points with different function values are properly separated).
If f has a Lipschitz continuous gradient and MI⪰Hk⪰mI for some M≥m>0 (along with the above being true), then the sequence {xk} converges to some xˉ. A byproduct of the proof is also the fact that the algorithm satisfies Q-linear convergence, meaning it satisfies the following for some stationary point ψˉ and some θ∈(0,1).
ψ(xk+1)−ψˉ≤θ(ψ(xk)−ψˉ)
The proof follows a very similar structure to that in Block Coordinate Descent’s paper after some initial prep work. We start by assuming the sequence {Hk} is uniformly bounded and positive definite, i.e. for some 0<m≤M we have mI⪯Hk⪯MI. Since Hk⪰mI, we know that dkTHkdk≥m∥dk∥2, which can also be applied to Hk+μkI giving dkT(Hk+μkI)dk≥(m+μk)∥dk∥2.
Using these two facts allows us to bound predk. We first use the convexity of φ to know φ(xk+dk)+φ(xk)≥uTdk and then the optimality conditions of the subproblem to know −(∇f(xk)+uk)=(Hk+μkI)dk.
Using the lemma we established about the residual gives us two more inequalities which we will use later if we use λmax(Hk+μkI)≤M+μk and λmin(Hk+μkI)≥m+μk.
Next, we have to prove that {μk} is a bounded sequence by first proving that if f is L-smooth, then if in some xk we have μk≥μˉ=max{L−m,0}, then we have aredk>c1predk.
By the Lipschitz continuity of ∇f, we have the following, which is further simplified since we assume that μk≥μˉ, meaning μk≥L−m and that we have Hk+μkI≥mI+(L−m)I=LI.
By definition this leads to −aredk≤−predk+2μk∥dk∥2, which when combined with the first bound from the previous steps gives the following by using μk+m/2μk+m>21 and c≤21, concluding the proof for the lemma.
Although this ensures that a successful step is called if μk is large enough, we still need to ensure that the early exit isn’t called each time, so we use the previous result to prove that {μk} is bounded if f is L-Smooth and MI⪰Hk⪰mI through a contradiction.
We assume that {μk} is unbounded, meaning that there is a subsequence K such that {μk}K→∞, which implies there are infinitely many unsuccessful steps. Without loss of generality, we can assume that all steps k∈K are unsuccessful. From the previous lemma, this is only possible if we have predk<pmin∥dk∥⋅∥r(xk)∥, which along with the first bound from the first lemma gives the following.
We can then combine this with the second bound from the first lemma to get the following for all k∈K.
(1+m+μk1)μkM+μk>2pminμkm+2μk
If we take the limit in K, this leads to 1/pmin, which by definition of pmin∈(0,21), which leads to a contradiction, completing the proof of the boundedness of {μk}.
Finally, we can complete the proof of the original theorem by using these lemmas and a couple of lemmas from previous proofs (the fact that ∥r(xk)∥→0 and dk→0 under the assumptions from the previous convergence theorems and that {ψ(xk)} is a monotonically decreasing sequence from the algorithm definition). Since {ψ(xk)} is monotonically decreasing and bounded below, we know it must converge to some finite ψˉ which will be important for future steps.
Since ∥r(xk)∥→0 for all sufficiently large k, we have ∥r(xk)∥≤ϵ, so we can apply the first error bound assumption to define some xˉk∈X∗ that satisfies the following.
dist(x,X∗)=∥xk−xˉk∥≤τ∥r(xk)∥
Since ∥r(xk)∥→0, we can simplify this to ∥xk−xˉk∥→0. We can then use this to get a decomposed bound on the difference between each iterate and xˉ.
From the above convergence statement we know that the first and third term go to 0, and the second term ∥xk+1−xk∥=∥dk∥ also goes to 0, so we have ∥xˉk+1−xˉk∥→0.
From the second error bound assumption, we know that any two stationary points in X∗ need to have a minimum distance δ, and since the points are getting infinitely close this cannot be satisfied, so they must share the same objective value ψ(xˉk)=ψˉ.
We can then find the suboptimality bound between the current iterate’s objective value and our convergent ψˉ. We first use the convexity of φ and the fact that we have −∇f(xˉk)∈∂φ(xˉk) to get the following.
φ(xk+1)−φ(xˉk)≥−∇f(xˉk)T(xk+1−xˉk)
Plugging this inequality into the objective gap we wanted to define lets us simplify along with the descent lemma from the L-smoothness of f.
From the second inequality of the first lemma of this proof, we know that ∥r(xk)∥≤Cr∥dk∥, which we can use to bound the previous optimality gap by defining Cr=mm+1(M+μk) and then CL=2L(1+τCr)2 for simplicity.
To make this bound meaningful, we need to replace the dk term. For any successful step we know aredk≥c1predk holds, which combined with predk≥2m∥dk∥2 from the previous lemma gives the following (which also holds trivially for unsuccessful steps).
We can plug this into the previous suboptimality gap bound we had to get the desired convergence result by defining K=c1m2CL. If we define θ=1+KK, we are guaranteed θ∈(0,1) since K>0, proving the convergence result shown.
Finally, to prove the convergence to a single iterate xˉ, we use the bound on ∥dk∥2. Since the difference between consecutive objectives is bounded by the distance in the limit (the objectives of two iterates can’t be farther than the distance between initial and fully optimal), we can simplify.
Since the term inside the square root shrinks at a rate θ, we know that ∥dk∥ shrinks at a rate of θ, which since 0<θ<1, we have θ<1 as well. This means that a sum of all dk creates a geometric series with a common ratio less than 1, which converges to a finite number.
k=0∑∞∥xk+1−xk∥=k=0∑∞∥dk∥<∞
This means that {xk} is a Cauchy Sequence, which must converge to a unique limit point, completing the proof.