Composite Optimization


This is another quick little review like the last ones. It’s going to cover the general problem setup and the background of Proximal Point Methods, and then cover Proximal Gradient, and Proximal Newton and some of their convergence properties. It’s also going to cover Accelerated Proximal Gradient and Regularized Proximal Quasi-Newton, and the convergence information from their papers including some explanations that I personally needed.

Problem Setup:

A composite optimization problem is an optimization problem over a function with two parts, a differentiable f(x)f(x) and a potentially non-differentiable g(x)g(x).

minf(x)+g(x)\min f(x)+g(x)

It is often assumed that f(x)f(x) will have some highly complex structure while g(x)g(x) is a simple function that has a structure that can be taken advantage of. Consider the function f(x)+x1f(x)+\|x\|_1, where the L1L_1 regularization promotes sparsity in the weights by pushing those close to zero to zero exactly, which is very helpful, although the non-differentiability of x1\|x\|_1 can cause headaches.

Although there are many methods of non-differentiable optimization out there, like Subgradient Descent (which for future reference would have a convergence rate of O(1/ϵ2)O(1/\epsilon^2) to reach an accuracy of ϵ\epsilon), they lose a lot of information by not taking advantage of the fact that f(x)f(x) is differentiable, which is where specific composite optimization problems come in to play.

Proximal Point Methods:

The main piece of important background before getting into the algorithms themselves is the concept of a Proximal Point Method. Standard gradient descent uses the gradient to gauge how far a step should be based on how much the function is supposed to decrease by. Proximal point methods do the same, but through a minimization subproblem since the gradient information is not available. They make use of the proximal operator proxf()\text{prox}_f(\cdot), which balances minimizing the function value and how far the step itself is.

proxf(v)=argminx(f(x)+12xv2)\text{prox}_f(v)=\text{arg}\min_{x}\left(f(x)+\frac{1}{2}\|x-v\|^2\right)

A proximal point method simply takes this operator with a stepsize, defined below as η\eta, and uses it for each iteration.

xk+1=argminz{f(z)+12ηzxk2}=proxηf(xk)x_{k+1}=\text{arg}\min_z\{f(z)+\frac{1}{2\eta}\|z-x_k\|^2\}=\text{prox}_{\eta f}(x_k)

Although this general form is almost impossible to solve at each iteration, there is often some sort of structure in f()f(\cdot) that we can take advantage of to find a better method of finding it. All of the composite optimization problems covered here will also make the act of solving the subproblem much simpler.

Interpretations:

There are two interpretations of the algorithm that I want to cover to hopefully help with the intuition behind it if it didn’t make sense originally. The more intuitive of the two relates PPMs with the Moreau Envelope Mf()M_f(\cdot), which provides a way to create a smoothed version of any lower semi-continuous convex function. It can be defined below with some parameter λR\lambda\in\mathbb{R}.

Mηf(v)=infx(f(x)+12ηxv22)M_{\eta f}(v)=\inf_{x}\left(f(x)+\frac{1}{2\eta}\|x-v\|^2_2\right)

The Moreau Envelope keeps the original function’s minimal value and has the same minimizers, so it’s extremely helpful for non-smooth optimization. The main benefit of the envelope comes in the fact that MλfM_{\lambda f} is always differentiable even if ff isn’t.

Mηf(xk)=1η(xkproxηf(xk))\nabla M_{\eta f}(x_k)=\frac{1}{\eta}(x_k-\text{prox}_{\eta}f(x_k))

By isolating the proximal operator, one can see that a PPM is simply performing gradient descent on the Moreau Envelope of the function.

xk+1=xkηMηf(xk)x_{k+1}=x_k-\eta\nabla M_{\eta f}(x_k)

Another equally important interpretation is that PPMs are performing implicit subgradient updates, so updating based on the subgradient gk+1f(xk+1)g_{k+1}\in\partial f(x_{k+1}) of the next iterate.

xk+1=xkρgk+1x_{k+1}=x_k-\rho g_{k+1}

This can be intuitively understood since the PPM iteration looks for the exact properties of the next point rather than solely relying on the current. It’s derived from the optimality condition of the proximal operator 0f(xk+1)+1ρ(xk+1xk)0\in\partial f(x_{k+1})+\frac{1}{\rho}(x_{k+1}-x_k) after some basic manipulation.

xkxk+1ρf(xk+1)gk+1=xkxk+1ρxk+1=xkρgk+1\begin{gather*} \frac{x_k-x_{k+1}}{\rho}\in\partial f(x_{k+1})\\ g_{k+1}=\frac{x_k-x_{k+1}}{\rho}\\ x_{k+1}=x_k-\rho g_{k+1} \end{gather*}

These both go to show that PPM has a much more solid foundation for convergence than a simple subgradient descent would, which is why it is so commonly used.

Proximal Gradient Method:

Since we assume that f(x)f(x) is a hard function to optimize over, to simplify the proximal calculations we can consider using a quadratic approximation of f(x)f(x) constructed at each iterate xkx_k, where LL defines the curvature of the approximation.

QL(x,xk)=f(xk)+<xxk,f(xk)>+L2xxk2+g(x)Q_L(x,x_k)=f(x_k)+\left<x-x_k,\nabla f(x_k)\right>+\frac{L}{2}\|x-x_k\|^2+g(x)

Through minimizing this surrogate we can derive Proximal Gradient. Since f(xk)f(x_k) is a constant with respect to xx it does not affect the location of the minimum so we can remove it from the subproblem.

argminx{<xxk,f(xk)>+L2xxk2+g(x)}\text{arg}\min_x\left\{\left<x-x_k,\nabla f(x_k)\right>+\frac{L}{2}\|x-x_k\|^2+g(x)\right\}

Using the same constant logic as before, we can add in the term 12Lf(xk)2\frac{1}{2L}\|\nabla f(x^k)\|^2 to the subproblem without changing the optimal xx. We can factor out L2\frac{L}{2} from all of the terms excluding g(x)g(x) to simplify, which shows that the term we added in lets us turn the f(x)f(x) terms into a collective squared term.

argminx{L2(xxk2+2<xxk,1Lf(xk)>+1Lf(xk)2)+g(x)}argminx{L2(xxk)1Lf(xk)2+g(x)}\begin{gather*} \text{arg}\min_x\left\{\frac{L}{2}\left(\|x-x_k\|^2+2\left<x-x_k,\frac{1}{L}\nabla f(x_k)\right>+\left\|\frac{1}{L}\nabla f(x_k)\right\|^2\right)+g(x)\right\}\\ \text{arg}\min_x\left\{\frac{L}{2}\left\|(x-x_k)-\frac{1}{L}\nabla f(x_k)\right\|^2+g(x)\right\} \end{gather*}

This final simplified form gives us the proximal gradient update rule.

xk+1=argminx{L2x(xk1Lf(xk))2+g(x)}x_{k+1}=\text{arg}\min_x\left\{\frac{L}{2}\left\|x-\left(x_k-\frac{1}{L}\nabla f(x_k)\right)\right\|^2+g(x)\right\}

Since this same quadratic approximation is what gradient descent does under the hood, this simplifies down to using the proximal operator on a standard gradient descent step of ff. This leads to an algorithm that can optimize over ff much quicker than the previous subgradient descent while still being feasible at each iteration.

zk=xk1Lf(xk)xk+1=prox1Lg(zk)\begin{gather*} z_k=x_k-\frac{1}{L}\nabla f(x_k)\\ x_{k+1}=\text{prox}_{\frac{1}{L}g}(z_k) \end{gather*}

Since gg is assumed to have a simple structure, there is more often than not a simple closed form solution to the proximal operator step. Consider the case of a simple case of g(x)g(x) being an indicator function for some convex set CC.

g(x)=1C(x)={0if xCif xCg(x)=\mathbb{1}_{C}(x)=\begin{cases}0&\text{if }x\in C\\\infty&\text{if }x\notin C\end{cases}

This definition of gg makes the proximal operator become a projection onto CC, causing the algorithm to turn into projected gradient descent.

prox1L1C(x)(z)=ΠC(z)=argminxCxz2\text{prox}_{\frac{1}{L}\mathbb{1}_C(x)}(z)=\Pi_C(z)=\text{arg}\min_{x\in C}\|x-z\|^2

We can also return back to our original example of g(x)=x1g(x)=\|x\|_1, which once simplified has a proximal operator with a very simple closed form.

g(x)=x1=xiprox1L1(z)i=sgn(zi)max(zi1L,0)\begin{gather*} g(x)=\|x\|_1=\sum|x_i|\\ \text{prox}_{\frac{1}{L}\|\cdot\|_1}(z)_i=\text{sgn}(z_i)\max\left(|z_i|-\frac{1}{L},0\right) \end{gather*}

This specific form of the algorithm is called the Iterative Soft-Thresholding Algorithm, shortened to ISTA, and is going to be of slight importance later for naming conventions.

Convergence:

If ff is an L-Smooth convex function, gg is a lower-semicontinuous convex function, and a stepsize 1L\frac{1}{L} is used, then we know that the method has a convergence rate of O(1/ϵ)O(1/\epsilon) to reach an error ϵ\epsilon, which is detailed below.

F(xk)F(x)Lx0x22kF(x_k)-F(x^*)\leq\frac{L\|x_0-x^*\|^2}{2k}

In order to prove this rate, we first need to prove that the iterates satisfy F(xk+1)F(xk)F(x_{k+1})\leq F(x_k), which is a property that is guaranteed when ff is LL-Smooth and a stepsize 1L\frac{1}{L} is used. The proof is a simple one that relies on the properties of the surrogate QL(x,xk)Q_L(x,x_k). By definition we already know that QL(xk,x)=F(xk)Q_L(x_k,x)=F(x_k) and since xk+1x_{k+1} is derived through a minimization we have QL(xk+1,xk)QL(xk,xk)Q_L(x_{k+1},x_k)\leq Q_L(x_k,x_k). Since ff is LL-smooth, we know that the descent lemma holds, which if g(xk+1)g(x_{k+1}) is added to both sides proves F(xk+1)QL(xk+1,xk)F(x_{k+1})\leq Q_L(x_{k+1},x_k). Putting these together we can arrive at the desired result.

F(xk+1)QL(xk+1,xk)Q(xk,xk)=F(xk)F(x_{k+1})\leq Q_L(x_{k+1},x_k)\leq Q(x_k,x_k)=F(x_k)

Now we can return to the original convergence theorem. Since ff is LL-smooth, we can apply the descent lemma to the function to provide a bound. Adding g(xk+1)g(x_{k+1}) to both sides gives us an inequality in terms of FF and our surrogate QLQ_L.

f(xk+1)f(xk)+<f(xk),xk+1xk>+L2xk+1xk2F(xk+1)QL(xk+1,xk)\begin{gather*} f(x_{k+1})\leq f(x_k)+\left<\nabla f(x_k),x_{k+1}-x_k\right>+\frac{L}{2}\|x_{k+1}-x_k\|^2\\ F(x_{k+1})\leq Q_L(x_{k+1},x_k) \end{gather*}

Since xk+1x_{k+1} is the result of a minimization subproblem, which we know is a convex problem since gg and the quadratic approximation are both convex, we can use the standard derivative optimality conditions to derive some properties of xk+1x_{k+1}. We can specifically derive the following, which allows us to know that there is some subderivative of gg at xk+1x_{k+1} that is equal to f(xk)L(xk+1xk)-\nabla f(x_k)-L(x_{k+1}-x_k).

0f(xk)+L(xk+1xk)+g(xk+1)0\in\nabla f(x_k)+L(x_{k+1}-x_k)+\partial g(x_{k+1})

We can then use the convexity of ff and gg to derive the following bounds for some arbitrary xx using the tangent line property, where we replace the subgradient in the inequality with our above defined subgradient.

g(x)g(xk+1)+<f(xk)L(xk+1xk),xxk+1>f(x)f(xk)+<f(xk),xxk>\begin{gather*} g(x)\geq g(x_{k+1})+\left<-\nabla f(x_k)-L(x_{k+1}-x_k),x-x_{k+1}\right>\\ f(x)\geq f(x_k)+\left<\nabla f(x_k),x-x_k\right> \end{gather*}

Using these two we can redefine the bound F(xk+1)QL(xk+1,xk)F(x_{k+1})\leq Q_L(x_{k+1},x_k) into one over an arbitrary xx and simplify by expanding the inner products.

F(xk+1)F(x)+<L(xkxk1),xk1x>+L2xk+1xk2F(xk+1)F(x)+L2(xkx2xk+1x2)\begin{gather*} F(x_{k+1})\leq F(x)+\left<L(x_k-x_{k-1}),x_{k-1}-x\right>+\frac{L}{2}\|x_{k+1}-x_k\|^2\\ F(x_{k+1})\leq F(x)+\frac{L}{2}(\|x_k-x\|^2-\|x_{k+1}-x\|^2) \end{gather*}

We can then define x=xx=x^* which defines a recursive relationship between xkx_k and xk+1x_{k+1}, which means we can also bound the total sum of both over the kk iterations.

F(xk+1)F(x)L2(xkx2xk+1x2)i=0k1(F(xi+1)F(x))L2i=0k1(xix2xi+1x2)\begin{gather*} F(x_{k+1})-F(x^*)\leq\frac{L}{2}(\|x_k-x^*\|^2-\|x_{k+1}-x^*\|^2)\\ \sum^{k-1}_{i=0}(F(x_{i+1})-F(x^*))\leq\frac{L}{2}\sum^{k-1}_{i=0}(\|x_i-x^*\|^2-\|x_{i+1}-x^*\|^2) \end{gather*}

Since the right-hand side forms a telescoping sum, we can collapse it down to the initial and final terms and simplify.

i=0k1(F(xi+1)F(x))L2(x0x2xkx2)=L2(x0x2)\sum^{k-1}_{i=0}(F(x_{i+1})-F(x^*))\leq\frac{L}{2}(\|x_0-x^*\|^2-\|x_k-x^*\|^2)=\frac{L}{2}(\|x_0-x^*\|^2)

Now we can finally use the fact that F(xk)F(x_k) is non-increasing under this setting so that we can know that F(xk)F(x)F(x_k)-F(x^*) will be the smallest term in the sum. This means that k(F(xk)F(x))k(F(x_k)-F(x^*)) will be less than the entire sum, so we can derive the bound below, which completes the proof.

k(F(xk)F(x))i=0k1(F(xi+1)F(x))L2(x0x2)k(F(x_k)-F(x^*))\leq\sum^{k-1}_{i=0}(F(x_{i+1})-F(x^*))\leq\frac{L}{2}(\|x_0-x^*\|^2)

Accelerated Proximal Gradient:

One of the methods of speeding up the convergence of gradient descent with momentum is Nesterov’s Acceleration, where we evaluate the gradient at the point that the momentum term is going towards, rather than at the current iterate. The same principle can be used to derive Accelerated Proximal Gradient, which is what the paper on the Fast Iterative Shrinkage-Thresholding (FISTA) algorithm does. First we can redefine the original proximal gradient method by defining pL(y)=argmin{QL(x,y):xRn}p_L(y)=\text{argmin}\{Q_L(x,y):x\in\mathbb{R}^n\} which uses a stepsize LL.

pL(y)=argminx{g(x)+L2x(y1Lf(y))2}xk=pL(xk1)\begin{gather*} p_L(y)=\text{argmin}_x\left\{g(x)+\frac{L}{2}\left\|x-\left(y-\frac{1}{L}\nabla f(y)\right)\right\|^2\right\}\\ x_k=p_L(x_{k-1}) \end{gather*}

The accelerated proximal gradient method introduces a momentum term yky_k which is given its own stepsize tkt_k. For later convergence analysis we define tk2tk=tk12t^2_k-t_k=t^2_{k-1} so that tk+12tk+1tk2t^2_{k+1}-t_{k+1}\leq t^2_k holds. Using this along with our previously defined pL()p_L(\cdot) gives the update rule for FISTA.

xk=pL(yk)tk+1=1+1+4tk22yk+1=xk+(tk1tk+1)(xkxk1)\begin{gather*} x_k=p_L(y_k)\\ t_{k+1}=\frac{1+\sqrt{1+4t^2_k}}{2}\\ y_{k+1}=x_k+\left(\frac{t_k-1}{t_{k+1}}\right)(x_k-x_{k-1}) \end{gather*}

While this may feel unintuitive, the same intuition can be used for this as Nesterov acceleration itself. Rather than solely relying on the current information, using yky_k allows the method to look ahead before taking a step, allowing it to correct its trajectory faster. These properties are also shown in the improved convergence guarantees.

Convergence:

If ff is an LL-Smooth convex function and gg is a proper, lower-semicontinuous convex function, then we know that the method has a convergence rate of O(1/ϵ)O(1/\sqrt{\epsilon}) to reach an error ϵ\epsilon.

F(xk)F(x)2Lx0x2(k+1)2F(x_k)-F(x^*)\leq\frac{2L\|x_0-x^*\|^2}{(k+1)^2}

To start the proof, we need to reintroduce three lemmas from the original proximal gradient definition. The first is the standard descent lemma coming from ff being LL-smooth.

f(x)f(y)+<xy,f(y)>+L2xy2f(x)\leq f(y)+\left<x-y,\nabla f(y)\right>+\frac{L}{2}\|x-y\|^2

The second comes from the optimality conditions of pL(y)p_L(y) and the fact that it’s a convex problem. We know that one has z=pL(y)z=p_L(y) if and only if there exists γ(y)g(z)\gamma(y)\in\partial g(z) such that the following holds.

f(y)+L(zy)+γ(y)=0\nabla f(y)+L(z-y)+\gamma(y)=0

The third comes from careful analysis of the convexity of both ff and gg. For any xx if we have a point yy that satisfies F(pL(y))Q(pL(y),y)F(p_L(y))\leq Q(p_L(y),y) then we have the following.

F(x)F(pL(y))L2pL(y)y2+L<yx,pL(y)y>F(x)-F(p_L(y))\geq\frac{L}{2}\|p_L(y)-y\|^2+L\left<y-x,p_L(y)-y\right>

To prove the third lemma, we can first look at the inequalities from the definition of convexity for both functions and then combine them to get a bound on F(x)F(x).

f(x)f(y)+<xy,f(y)>g(x)g(pL(y))+<xpL(y),γ(y)>F(x)f(y)+<xy,f(y)>+g(pL(y))+<xpL(y),γ(y)>\begin{gather*} f(x)\geq f(y)+\left<x-y,\nabla f(y)\right>\\ g(x)\geq g(p_L(y))+\left<x-p_L(y),\gamma(y)\right>\\ F(x)\geq f(y)+\left<x-y,\nabla f(y)\right>+g(p_L(y))+\left<x-p_L(y),\gamma(y)\right> \end{gather*}

From the original lemma requirement that we need F(pL(y))Q(pL(y),y)F(p_L(y))\leq Q(p_L(y),y) we can simplify.

F(x)F(pL(y))L2pL(y)y2+<xpL(y),f(y)+γ(y)>F(x)-F(p_L(y))\geq-\frac{L}{2}\|p_L(y)-y\|^2+\left<x-p_L(y),\nabla f(y)+\gamma(y)\right>

From the second lemma we know that f(y)+γ(y)=L(ypL(y))\nabla f(y)+\gamma(y)=L(y-p_L(y)) which we use to simplify the rightmost product.

F(x)F(pL(y))L2pL(y)y2+L<xpL(y),ypL(y)>F(x)-F(p_L(y))\geq-\frac{L}{2}\|p_L(y)-y\|^2+L\left<x-p_L(y),y-p_L(y)\right>

After turning the product into L<xpL(y)+yy,ypL(y)>L\left<x-p_L(y)+y-y,y-p_L(y)\right> by adding yyy-y, we can simplify into our original statement, proving the lemma.

F(x)F(pL(y))L2pL(y)y2+L<yx,pL(y)y>F(x)-F(p_L(y))\geq\frac{L}{2}\|p_L(y)-y\|^2+L\left<y-x,p_L(y)-y\right>

Now that we have the prep work laid out, we can start on the convergence proof itself. The proof follows by first establishing a recursive relationship between steps and then using it to bound the suboptimality. We start by applying the third lemma to at x=xkx=x_k and x=xx=x^* to derive two bounds, where we use y=yk+1y=y_{k+1} and L=Lk+1L=L_{k+1} for both.

2Lk+1(F(xk)F(xk+1))xk+1yk+12+2<xk+1yk+1,yk+1xk>2Lk+1(F(x)F(xk+1))xk+1yk+12+2<xk+1yk+1,yk+1x>\begin{gather*} \frac{2}{L_{k+1}}(F(x_k)-F(x_{k+1}))\geq\|x_{k+1}-y_{k+1}\|^2+2\left<x_{k+1}-y_{k+1},y_{k+1}-x_k\right>\\ \frac{2}{L_{k+1}}(F(x^*)-F(x_{k+1}))\geq\|x_{k+1}-y_{k+1}\|^2+2\left<x_{k+1}-y_{k+1},y_{k+1}-x^*\right> \end{gather*}

To simplify the notation, we can define vk=F(xk)F(x)v_k=F(x_k)-F(x^*) and also the fact that F(xk)F(xk+1)=(F(xk)F(x))(F(xk+1)F(x))F(x_k)-F(x_{k+1})=(F(x_k)-F(x^*))-(F(x_{k+1})-F(x^*)) to make both bounds be on the same terms.

2Lk+1(vkvk+1)xk+1yk+12+2<xk+1yk+1,yk+1xk>2Lk+1(vk+1)xk+1yk+12+2<xk+1yk+1,yk+1x>\begin{gather*} \frac{2}{L_{k+1}}(v_k-v_{k+1})\geq\|x_{k+1}-y_{k+1}\|^2+2\left<x_{k+1}-y_{k+1},y_{k+1}-x_k\right>\\ -\frac{2}{L_{k+1}}(v_{k+1})\geq\|x_{k+1}-y_{k+1}\|^2+2\left<x_{k+1}-y_{k+1},y_{k+1}-x^*\right> \end{gather*}

Then we multiply the first by tk+11t_{k+1}-1 before adding the inequalities together so that we can derive a relationship weighted by tk+1t_{k+1}.

2Lk+1((tk+11)vktk+1vk+1)tk+1xk+1yk+12+2<xk+1yk+1,tk+1yk+1(tk+11)>\frac{2}{L_{k+1}}((t_{k+1}-1)v_k-t_{k+1}v_{k+1})\geq t_{k+1}\|x_{k+1}-y_{k+1}\|^2+2\left<x_{k+1}-y_{k+1},t_{k+1}y_{k+1}-(t_{k+1}-1)\right>

To simplify, we first use the fact that we defined tkt_k such that tk2=tk+12tk+1t^2_k=t^2_{k+1}-t_{k+1} and then use the identity ba2+2<ba,ac>=ba2ac2\|b-a\|^2+2\left<b-a,a-c\right>=\|b-a\|^2-\|a-c\|^2.

2Lk+1(tk2vktk+12vk+1)tk+1(xk+1yk+1)2+2tk+1<xk+1yk+1,tk+1yk+1(tk+11)xkx>2Lk+1(tk2vktk+12vk+1)tk+1xk+1(tk+11)xkx2tk+1yk+1(tk+11)xkx2\begin{gather*} \frac{2}{L_{k+1}}(t^2_kv_k-t^2_{k+1}v_{k+1})\geq\|t_{k+1}(x_{k+1}-y_{k+1})\|^2+2t_{k+1}\left<x_{k+1}-y_{k+1},t_{k+1}y_{k+1}-(t_{k+1}-1)x_k-x^*\right>\\ \frac{2}{L_{k+1}}(t^2_kv_k-t^2_{k+1}v_{k+1})\geq\|t_{k+1}x_{k+1}-(t_{k+1}-1)x_k-x^*\|^2-\|t_{k+1}y_{k+1}-(t_{k+1}-1)x_k-x^*\|^2 \end{gather*}

To make this a meaningful bound, we need to make the second norm on the right-hand side match the form of the first. From the defined update for yk+1y_{k+1} we know that tk+1yk+1=tk+1xk+(tk1)(xkxk+1)t_{k+1}y_{k+1}=t_{k+1}x_k+(t_k-1)(x_k-x_{k+1}), so we can simplify the second norm into the following.

[tk+1xk+(tk1)(xkxk+1)]tk+1xk+xkx2=tkxk(tk1)xk+1x2\|[t_{k+1}x_k+(t_k-1)(x_k-x_{k+1})]-t_{k+1}x_k+x_k-x^*\|^2=\|t_kx_k-(t_k-1)x_{k+1}-x^*\|^2

Now we can define uk=tkxk(tk1)xk+1xu_k=t_kx_k-(t_k-1)x_{k+1}-x^* to finally get the bound into a more workable form.

2Lk+1(tk+12vktk+12vk+1)uk+12uk2\frac{2}{L_{k+1}}(t^2_{k+1}v_k-t^2_{k+1}v_{k+1})\geq\|u_{k+1}\|^2-\|u_k\|^2

Using the fact that we know that LkLk+1L_k\geq L_{k+1} (either from a static step size or from the specific backtracking line search method defined in the paper), we can define the left-hand side with two terms, one dealing only with iteration kk and the other dealing only with iteration k+1k+1. This acts as our main recursive relationship for the rest of the proof.

2Lktk2vk2Lk+1tk+12vk+1uk+12uk2\frac{2}{L_k}t^2_kv_k-\frac{2}{L_{k+1}}t^2_{k+1}v_{k+1}\geq\|u_{k+1}\|^2-\|u_k\|^2

To simplify the notation of the analysis in the next steps, we define ak=2Lktk2vka_k=\frac{2}{L_k}t^2_kv_k and bk=uk2b_k=\|u_k\|^2 to rewrite the relationship as follows.

akak+1bk+1bka_k-a_{k+1}\geq b_{k+1}-b_k

Since we know that ak,bk>0a_k,b_k>0 and that the sequence ak+bka_k+b_k is monotonically non-decreasing (from the fact that we can rearrange and see that ak+bkak+1+bk+1a_k+b_k\geq a_{k+1}+b_{k+1}), we can know that if a1+b1ca_1+b_1\leq c with some constant cc, then for all kk we have akca_k\leq c. If we define c=x0x2c=\|x_0-x^*\|^2 and using the fact that we initialize t1=1t_1=1, we can get an inequality that would bound the entire sequence aka_k if proven correct.

2L1v1=x1x2y1x2\frac{2}{L_1}v_1=\|x_1-x^*\|^2\leq\|y_1-x^*\|^2

To start the proof of this inequality, we apply the third lemma on x=xx=x^*, y=y1y=y_1, and L=L1L=L_1, and then swap in pL1(y)=x1p_{L_1}(y)=x_1 from the method’s definition.

F(x)F(p(y1))L12p(y1)y12+L1<y1x,p(y1)y1>F(x)F(x1)L12x1y12+L1<yx,x1y1>\begin{gather*} F(x^*)-F(p(y_1))\geq\frac{L_1}{2}\|p(y_1)-y_1\|^2+L_1\left<y_1-x^*,p(y_1)-y_1\right>\\ F(x^*)-F(x_1)\geq\frac{L_1}{2}\|x_1-y_1\|^2+L_1\left<y-x^*,x_1-y_1\right> \end{gather*}

Using the same norm identity that we used for the recursive relation proof lets us simplify in the same way, proving that the base case we needed holds.

F(x)F(x1)L12(x1x2y1x2)v1L12(x1x2y1x2)2L1v1+x1x2y1x2\begin{gather*} F(x^*)-F(x_1)\geq\frac{L_1}{2}(\|x_1-x^*\|^2-\|y_1-x^*\|^2)\\ -v_1\geq \frac{L_1}{2}(\|x_1-x^*\|^2-\|y_1-x^*\|^2)\\ \frac{2}{L_1}v_1+\|x_1-x^*\|^2\leq\|y_1-x^*\|^2 \end{gather*}

From this, we know that we have akca_k\leq c for all kk, which after isolating vkv_k and using the fact that we define tkt_k in a way that makes tkk+12t_k\geq\frac{k+1}{2} hold can be used to simplify the denominator, proving the convergence theorem.

2L1tk2vkx0x2vkLkx0x22tk2vk2Lkx0x2(k+1)2\begin{gather*} \frac{2}{L_1}t^2_kv_k\leq\|x_0-x^*\|^2\\ v_k\leq\frac{L_k\|x_0-x^*\|^2}{2t^2_k}\\ v_k\leq\frac{2L_k\|x_0-x^*\|^2}{(k+1)^2} \end{gather*}

Proximal Newton Method:

In the same way that we can apply the ideas behind gradient descent and Nesterov acceleration to PPMs, we can also apply Newton’s Method. The same general principle of using a quadratic approximation of ff is done in Proximal Newton, but the hessian of ff is used to improve the approximation.

f^k(x)=f(xk)+f(xk)T(xxk)+12(xxk)T2f(xk)(xxk)minx(f^k(x)+g(x))\begin{gather*} \hat{f}_k(x)=f(x_k)+\nabla f(x_k)^T(x-x_k)+\frac{1}{2}(x-x_k)^T\nabla^2f(x_k)(x-x_k)\\ \min_x\left(\hat{f}_k(x)+g(x)\right) \end{gather*}

We can then simplify this subproblem into the iteration rule for the method, where we can define v=xkH1f(xk)v=x_k-H^{-1}\nabla f(x_k) and 2f(xk)=Hk\nabla^2f(x_k)=H_k for notational simplicity.

xk+1=proxgHk(xkHk1f(xk))xk+1=argminx(g(x)+12xvHk2)\begin{gather*} x_{k+1}=\text{prox}^{H_k}_g(x_k-H^{-1}_k\nabla f(x_k))\\ x_{k+1}=\text{arg}\min_x\left(g(x)+\frac{1}{2}\|x-v\|^2_{H_k}\right) \end{gather*}

To derive this update rule, we can first use the variable Δx=xxk\Delta x=x-x_k to make the subproblem simpler to work with.

minΔx{f(xk)TΔx+12ΔxTHkΔx+g(xk+Δx)}\min_{\Delta x}\left\{\nabla f(x_k)^T\Delta x+\frac{1}{2}\Delta x^TH_k\Delta x+g(x_k+\Delta x)\right\}

The first two terms of the minimization are the first two terms of an expanded square missing the constant third term.

12Δx+Hk1f(xk)Hk2=12ΔxTHkΔx+f(xk)TΔx+constant\frac{1}{2}\|\Delta x+H^{-1}_k\nabla f(x_k)\|^2_{H_k}=\frac{1}{2}\Delta x^TH_k\Delta x+\nabla f(x_k)^T\Delta x+\text{constant}

This means we can simplify the iteration to reveal the scaled proximal operator used in the algorithm.

xk+1=argminx{12x(xkHk1f(xk))Hk2+g(x)}x_{k+1}=\text{arg}\min_x\left\{\frac{1}{2}\|x-(x_k-H^{-1}_k\nabla f(x_k))\|^2_{H_k}+g(x)\right\}

As will be covered later, proximal newton has the best convergence we have seen yet once within a certain range of the optimum, which is the range where we know that the quadratic approximation made is highly accurate to the actual landscape of the function. The biggest issue with the algorithm is that it still suffers from the same instability that Newton’s Method itself suffers from. When far from the optimum in strongly convex settings, the function itself does not need to have the same format as a quadratic, so although the quadratic approximation is highly accurate locally, it often makes incorrect assumptions about the surrounding landscape.

The aggressiveness of the approximation makes the general direction of each iteration quite accurate, but the scale of each step is often inaccurate and is the main problem when it comes to the algorithm diverging. This is why the use of a backtracking line search stepsize tkt_k is so common, and will be required by the theorems following.

xk+1=tkproxgHk(xkHk1f(xk))x_{k+1}=t_k\text{prox}^{H_k}_g(x_k-H^{-1}_k\nabla f(x_k))

As a quick recap, a backtracking line search is built on the idea of satisfying the Armijo Condition, an inequality that ensures that the actual decrease of the step taken is at least α(0,0.5)\alpha\in(0,0.5) times the predicted decrease.

F(xk+tdk)F(xk)+αtΔkF(x_k+td_k)\leq F(x_k)+\alpha t\Delta_k

To ensure that each step satisfies this, we start with a full step tk=1t_k=1. If the Armijo condition holds, then we simply take the step. If not, we shrink the stepsize tkβtkt_k\leftarrow \beta t_k by some β(0,1)\beta\in(0,1) and repeat until it does satisfy the inequality.

Global Convergence:

If ff is mm-strongly convex and LL-smooth, gg is convex, and tkt_k is chosen using a backtracking line search, then we know that the method has a convergence rate of O(log(1/ϵ))O(\log(1/\epsilon)). This can be detailed below with some constant γ(0,1)\gamma\in(0,1) which is proportional to mL\frac{m}{L} and penalized by the search parameters for the algorithm behind tkt_k.

F(xk+1)F(x)(1γ)(F(xk)F(x))F(x_{k+1})-F(x^*)\leq(1-\gamma)(F(x_k)-F(x^*))

To make the relationship between proximal newton and the previous methods clearer, we can see the convergence rates of proximal gradient and accelerated proximal gradient under the same conditions as this theorem in terms of the condition number κ=Lm\kappa=\frac{L}{m} which conveys how stretched the function is, so functions with a higher κ\kappa are harder to optimize. Since strong convexity is added to ff, Proximal gradient’s convergence rate becomes O(κlog(1/ϵ))O(\kappa\log(1/\epsilon)) and the acceleration gets it to O(κlog(1/ϵ))O(\sqrt{\kappa}\log(1/\epsilon)). Proximal Newton’s convergence rate doesn’t rely on the condition number since it uses the hessian of the function, removing any guess work that the other two make about the curvature of the function.

The proof follows by first proving that the direction of the iteration is a strict descent direction, and then using the line search to derive a recursive relationship between the function values of iterates.

We can first define some properties of the direction dk=x^k+1xkd_k=\hat{x}_{k+1}-x_k by using the optimality conditions of the subproblem since ff and gg are convex. We define v(xk+dk)v\in\partial(x_k+d_k) and Hk=2f(xk)H_k=\nabla^2 f(x_k) for simplicity.

0f(xk)+Hk(x^k+1xk)+g(x^k+1)f(xk)+Hkdk+v=0\begin{gather*} 0\in\nabla f(x_k)+H_k(\hat{x}_{k+1}-x_k)+\partial g(\hat{x}_{k+1})\\ \nabla f(x_k)+H_kd_k+v=0 \end{gather*}

We can prove that dkd_k is a descent direction by looking at the directional derivative of the problem towards dkd_k, where we need to use the change in gg rather than its derivative due to its non-differentiability.

Δk=f(xk)Tdk+g(xk+dk)g(xk)\Delta_k=\nabla f(x_k)^Td_k+g(x_k+d_k)-g(x_k)

Since gg is convex, we know that g(xk)g(xk+dk)vTdkg(x_k)\geq g(x_k+d_k)-v^Td_k, so along with the properties from optimality above we can simplify.

Δkf(xk)Tdk+vTdk=(f(xk)+v)Tdk=dkTHkdk\Delta_k\leq\nabla f(x_k)^Td_k+v^Td_k=(\nabla f(x_k)+v)^Td_k=-d^T_kH_kd_k

Since ff is mm-strongly convex, we know that HkmIH_k\succeq mI with m>0m>0. This means we can bound the right-hand side, proving that dkd_k is a strict descent direction as Δk=0\Delta_k=0 only if dk2=0\|d_k\|^2=0.

Δkmdk20\Delta_k\leq -m\|d_k\|^2\leq 0

Next, we can use the backtracking line search to show that the algorithm leads to a sufficient decrease in F(x)F(x). Since we know that the Armijo Condition is satisfied, we know that the following applies.

F(xk+tkdk)F(xk)+αtkΔkF(x_k+t_kd_k)\leq F(x_k)+\alpha t_k\Delta_k

To make this descent meaningful, we need to lower bound tkt_k to make sure that there is progress being made, which we do through manipulation of the function properties. Since ff is LL-smooth, we can use the descent lemma to know the first bound, and since gg is convex and t(0,1]t\in(0,1], we can derive the second bound.

f(xk+tdk)f(xk)+tf(xk)Tdk+L2t2dk2g(xk+tdk)=g((1t)xk+t(xk+dk))(1t)g(xk)+tg(xk+dk)g(xk+tdk)g(xk)+t(g(xk+dk)g(xk))\begin{gather*} f(x_k+td_k)\leq f(x_k)+t\nabla f(x_k)^Td_k+\frac{L}{2}t^2\|d_k\|^2\\ g(x_k+td_k)=g((1-t)x_k+t(x_k+d_k))\leq(1-t)g(x_k)+tg(x_k+d_k)\\ g(x_k+td_k)\leq g(x_k)+t(g(x_k+d_k)-g(x_k)) \end{gather*}

Combining these two lets us derive a bound on the entire objective.

F(xk+tdk)f(xk)+tf(xk)Tdk+L2t2dk2+g(xk)+t(g(xk+dk)g(xk))F(x_k+td_k)\leq f(x_k)+t\nabla f(x_k)^Td_k+\frac{L}{2}t^2\|d_k\|^2+g(x_k)+t(g(x_k+d_k)-g(x_k))

Grouping the first order terms and the previous defined Δk\Delta_k, along with the fact that we know dk21mΔk\|d_k\|^2\leq-\frac{1}{m}\Delta_k from previous steps lets us simplify.

F(xk+tdk)F(xk)+tΔk+L2t2dk2F(xk+tdk)F(xk)+tΔk+L2t2(1mΔk)F(xk+tdk)F(xk)+tΔk(1Lt2m)\begin{gather*} F(x_k+td_k)\leq F(x_k)+t\Delta_k+\frac{L}{2}t^2\|d_k\|^2\\ F(x_k+td_k)\leq F(x_k)+t\Delta_k+\frac{L}{2}t^2\left(-\frac{1}{m}\Delta_k\right)\\ F(x_k+td_k)\leq F(x_k)+t\Delta_k\left(1-\frac{Lt}{2m}\right) \end{gather*}

This gives us an upper bound parabola with terms we can work with. To show that the Armijo Condition can be satisfied, we can require that the right-hand side of the Armijo equation is below the parabola we defined here.

F(xk)+tΔk(1Lt2m)F(xk)+αtΔktΔk(1Lt2m)αtΔk1Lt2mαt2m(1α)L\begin{gather*} F(x_k)+t\Delta_k\left(1-\frac{Lt}{2m}\right)\leq F(x_k)+\alpha t\Delta_k\\ t\Delta_k\left(1-\frac{Lt}{2m}\right)\leq\alpha t\Delta_k\\ 1-\frac{Lt}{2m}\geq\alpha\\ t\leq \frac{2m(1-\alpha)}{L} \end{gather*}

Since α(0,0.5)\alpha\in(0,0.5), we know that tmLt\leq \frac{m}{L} will satisfy the Armijo Condition. This leads to either tk=1t_k=1 being accepted, or a lower bound of βmL\beta\frac{m}{L} if we do need to scale by β\beta, so tkmin(1,βmL)t_k\geq\min\left(1,\beta\frac{m}{L}\right). Since we know that mLm\leq L by definition of the properties and 0<β<10<\beta<1, this simplifies to the following.

tkβmLt_k\geq\beta\frac{m}{L}

Now that we’ve done the proper setup, we can start working on making the final bound on suboptimality. To start we use the fact that f(x)f(x) is strongly convex and that g(x)g(x) is convex, where again vg(xk+dk)v\in\partial g(x_k+d_k).

f(x)f(xk)+f(xk)T(xxk)+m2xxk2g(x)g(xk+dk)+vT(x(xk+dk))\begin{gather*} f(x^*)\geq f(x_k)+\nabla f(x_k)^T(x^*-x_k)+\frac{m}{2}\|x^*-x_k\|^2\\ g(x^*)\geq g(x_k+d_k)+v^T(x^*-(x_k+d_k)) \end{gather*}

We can then combine these bounds to derive a bound on the global minimum F(x)F(x^*). We can add f(xk)Tdk+g(xk)\nabla f(x_k)^Td_k+g(x_k) to both sides to simplify, in turn introducing our previous Δk\Delta_k.

F(x)f(xk)+f(xk)T(xxk)+m2xxk2+g(xk+dk)+vT(xxkdk)F(x)F(xk)+Δk+(f(xk)+v)T(xxkdk)+m2xxk2\begin{gather*} F(x^*)\geq f(x_k)+\nabla f(x_k)^T(x^*-x_k)+\frac{m}{2}\|x^*-x_k\|^2+g(x_k+d_k)+v^T(x^*-x_k-d_k)\\ F(x^*)\geq F(x_k)+\Delta_k+(\nabla f(x_k)+v)^T(x^*-x_k-d_k)+\frac{m}{2}\|x^*-x_k\|^2 \end{gather*}

We can then substitute in f(x)+v=Hkdk\nabla f(x)+v=-H_kd_k from our optimality condition analysis and simplify further.

F(x)F(xk)+ΔkdkTHk(xxk)+dkTHkdk+m2xxk2F(x^*)\geq F(x_k)+\Delta_k-d^T_kH_k(x^*-x_k)+d^T_kH_kd_k+\frac{m}{2}\|x^*-x_k\|^2

Using the Peter-Paul inequality, which states that for any two vectors aa and bb and any constant γ>0\gamma>0, we know aTb12γa2+γ2b2a^Tb\leq\frac{1}{2\gamma}\|a\|^2+\frac{\gamma}{2}\|b\|^2, we can simplify again. Specifically we use the inequality with a=Hkdka=H_kd_k and b=xxkb=x^*-x_k to remove the xxk2\|x^*-x_k\|^2 term.

F(x)F(xk)+Δk12mHkdk2+dkTHkdkF(x^*)\geq F(x_k)+\Delta_k-\frac{1}{2m}\|H_kd_k\|^2+d^T_kH_kd_k

Since ff is LL-smooth, we know that HkLIH_k\preceq LI, so Hkdk2L(dkTHkdk)\|H_kd_k\|^2\leq L(d^T_kH_kd_k), which we can use along with dkTHkdkΔkd^T_kH_kd_k\leq-\Delta_k to get the following.

F(xk)F(x)Δk+L2m(dkTHkdk)dkTHkdkF(xk)F(x)L2mΔk\begin{gather*} F(x_k)-F(x^*)\leq -\Delta_k+\frac{L}{2m}(d^T_kH_kd_k)-d^T_kH_kd_k\\ F(x_k)-F(x^*)\leq-\frac{L}{2m}\Delta_k \end{gather*}

Finally, we can rearrange to isolate Δk\Delta_k to derive a bound on it, which we can substitute into the Armijo Condition to derive our final inequality. We can define γ=αβ2m2L2\gamma=\alpha\beta\frac{2m^2}{L^2}, which satisfies 0<γ<10<\gamma<1, proving the theorem.

F(xk+1)F(xk)α(βmL)[2mL(F(xk)F(x))]F(x_{k+1})\leq F(x_k)-\alpha\left(\beta\frac{m}{L}\right)\left[\frac{2m}{L}(F(x_k)-F(x^*))\right]

Local Convergence:

The real strength of the method can be seen once the iterate gets close enough to the optimal xx^*, when the quadratic approximation becomes highly accurate. If ff is mm-strongly convex, LL-smooth, and has an MM-Lipschitz hessian, gg is convex, and xkx<ϵ\|x_k-x^*\|<\epsilon where ϵmM\epsilon\approx\frac{m}{M}, then we know the method has a convergence rate O(loglog(1/ϵ))O(\log\log(1/\epsilon)) which is detailed below.

xk+1xM2mxkx2\|x_{k+1}-x^*\|\leq\frac{M}{2m}\|x_k-x^*\|^2

Once the iterate gets close enough to xx^*, we also know that the backtracking line search will always return tk=1t_k=1, leading to the steps not being damped.

The proof follows by first proving that the error between the actual function and the approximation is bounded, and using it to prove the recursive relationship we want.

We start by finding two subgradients vg(x)v^*\in\partial g(x^*) and vk+1g(xk+1)v_{k+1}\in\partial g(x_{k+1}) from the optimality conditions of the problem as a whole and of the subproblem for each iteration.

v=f(x)vk+1=f(xk)+2f(xk)(xk+1xk)\begin{gather*} -v^*=\nabla f(x^*)\\ -v_{k+1}=\nabla f(x_k)+\nabla^2f(x_k)(x_{k+1}-x_k) \end{gather*}

We can use these two to define an inequality using the monotone gradient property of the convex gg, which we can combine with the strong convexity of ff to define an inequality with all of the terms we need.

(vk+1v)T(xk+1x)0(f(xk+1)f(x))T(xk+1x)mxk+1x2(f(xk+1)vk+1(f(x)v))Tmxk+1x2\begin{gather*} (v_{k+1}-v^*)^T(x_{k+1}-x^*)\geq 0\\ (\nabla f(x_{k+1})-\nabla f(x^*))^T(x_{k+1}-x^*)\geq m\|x_{k+1}-x^*\|^2\\ (\nabla f(x_{k+1})-v_{k+1}-(\nabla f(x^*)-v^*))^T\geq m\|x_{k+1}-x^*\|^2 \end{gather*}

This can be simplified using f(x)v=0\nabla f(x^*)-v^*=0, and then by applying Cauchy-Schwarz and dividing by mxk+1x2m\|x_{k+1}-x^*\|^2.

(f(xk+1)+vk+1)T(xk+1x)mxk+1x2xk+1x21mf(xk+1)+vk+1\begin{gather*} (\nabla f(x_{k+1})+v_{k+1})^T(x_{k+1}-x^*)\geq m\|x_{k+1}-x^*\|^2\\ \|x_{k+1}-x^*\|^2\leq\frac{1}{m}\|\nabla f(x_{k+1})+v_{k+1}\| \end{gather*}

The term f(xk+1)+vk+1\nabla f(x_{k+1})+v_{k+1} represents the error between the expected gradient at the next iterate and the real gradient, which we can use the MM-Lipschitz continuity of the hessian to bound. This is where the importance of the use of the hessian becomes important, as it makes this error much smaller than in typical gradient descent methods.

f(xk+1)+vk+1=f(xk+1)[f(xk)+2f(xk)(xk+1xk)]\nabla f(x_{k+1})+v_{k+1}=\nabla f(x_{k+1})-[\nabla f(x_k)+\nabla^2f(x_k)(x_{k+1}-x_k)]

To denote the difference f(xk+1)f(xk)\nabla f(x_{k+1})-\nabla f(x_k) in terms that can be combined with the hessian term, we can use the Fundamental Theorem of Calculus to denote the term using a path integral over p(τ)=xk+τ(xk+1xk)p(\tau)=x_k+\tau(x_{k+1}-x_k), since we know that dpdτ=xk+1xk\frac{dp}{d\tau}=x_{k+1}-x_k.

f(xk+1)f(xk)=01ddτf(p(τ))dτf(xk+1)f(xk)=012f(xk+τ(xk+1xk))(xk+1xk)dτ\begin{gather*} \nabla f(x_{k+1})-\nabla f(x_k)=\int^1_0\frac{d}{d\tau}\nabla f(p(\tau))d\tau\\ \nabla f(x_{k+1})-\nabla f(x_k)=\int^1_0\nabla^2f(x_k+\tau(x_{k+1}-x_k))(x_{k+1}-x_k)d\tau \end{gather*}

Substituting this into the original statement allows us to simplify.

f(xk+1)+vk+1=01[2f(xk+τ(xk+1xk))2f(xk)](xk+1xk)dτ\nabla f(x_{k+1})+v_{k+1}=\int^1_0[\nabla^2f(x_k+\tau(x_{k+1}-x_k))-\nabla^2f(x_k)](x_{k+1}-x_k)d\tau

Since the hessian is MM-Lipschitz, we can take the norm of both sides and bound the terms inside the bracket. This leads to an integral with a constant integrand, so we can simplify using the fact that C01τdτ=C2C\int^1_0\tau d\tau=\frac{C}{2}.

f(xk+1)vk+101Mτxk+1xk2dτf(xk+1)vk+1M2xk+1xk2\begin{gather*} \|\nabla f(x_{k+1})-v_{k+1}\|\leq\int^1_0M\tau\|x_{k+1}-x_k\|^2d\tau\\ \|\nabla f(x_{k+1})-v_{k+1}\|\leq\frac{M}{2}\|x_{k+1}-x_k\|^2 \end{gather*}

To finish the proof, we substitute this into our original bound to derive the desired result.

xk+1xkM2mxk+1xk2\|x_{k+1}-x_k\|\leq\frac{M}{2m}\|x_{k+1}-x_k\|^2

Regularized Proximal Quasi-Newton:

One of the main problems with the default proximal newton setup is that in modern applications, the added function calls for a backtracking line search are extremely computationally expensive. This is an annoyance especially in Newton’s method since we only need the backtracking line search due to the opportunity for the approximation to overfit locally and choose a bad step direction. Regularized Proximal Quasi-Newton aims to fix this by updating and reevaluating the proximal subproblem itself, rather than trying to work with the first direction that was obtained.

To follow the syntax of the paper, we redefine the composite optimization problem as one over a continuously differentiable function ff and a convex function φ\varphi. This means we redefine the original proximal newton step as a minimization subproblem over an approximation qkq_k and then perform xk+1=xk+dkx^{k+1}=x^k+d^k.

minxψ(x)=f(x)+φ(x)mindqk(d)withqk(d)=f(xk)+f(xk)Td+12dTBkd+φ(xk+d)\begin{gather*} \min_x\psi(x)=f(x)+\varphi(x)\\ \min_dq_k(d)\quad\text{with}\quad q_k(d)=f(x^k)+\nabla f(x^k)^Td+\frac{1}{2}d^TB_kd+\varphi(x^k+d) \end{gather*}

Instead of using this approximation, the regularized proximal quasi-newton method uses a regularized approximation q^k\hat{q}_k with the regularization controlled by a parameter μk>0\mu_k>0.

q^k(d)=qk(d)+12μkd2\hat{q}_k(d)=q_k(d)+\frac{1}{2}\mu_k\|d\|^2

To help define optimality for the problem, the paper uses the residual of a proximal gradient step at the current iterate r(xk)r(x^k) as a gauge, since dk=0\|d_k\|=0 is a sufficient but not necessary property for stationarity if Bk+μkIB_k+\mu_k I isn’t positive definite (which is not required for anything here).

r(x)=proxφ(xf(x))xr(x)=\text{prox}_\varphi(x-\nabla f(x))-x

As well, in order to help quantify the quality of the step dkd_k the paper also introduces two terms for each iteration, the predicted reduction predk\text{pred}_k (which uses the nonregularized form as a better gauge of how much confidence we should have in it) and the actual reduction aredk\text{ared}_k, where the ratio ρk=aredk/predk\rho_k=\text{ared}_k/\text{pred}_k gives a good idea of how accurate the approximation was.

predk=ψ(xk)qk(dk)aredk=ψ(xk)ψ(xk+dk)\begin{gather*} \text{pred}_k=\psi(x^k)-q_k(d^k)\\ \text{ared}_k=\psi(x^k)-\psi(x^k+d^k) \end{gather*}

The method starts by initializing parameters x0Rnx^0\in\mathbb{R}^n, μ0>0\mu_0>0, pmin(0,12)p_\text{min}\in(0,\frac{1}{2}) (a tolerance parameter), c1(0,12)c_1\in(0,\frac{1}{2}) (lower threshold for ρk\rho_k), c2(c1,1)c_2\in(c_1,1) (upper threshold for ρk\rho_k), σ1(0,1)\sigma_1\in(0,1) (regularization shrink parameter), σ2>1\sigma_2>1 (regularization growth parameter), and k=0k=0. Then the method can be broken into three steps.

First, we use a suitable termination criterion to check the current xkx^k and terminate accordingly.

Second, we choose BkRn×nB_k\in\mathbb{R}^{n\times n} and find a solution dkd^k to mindq^k(d)\min_d\hat{q}_k(d). If the problem has no solution or if predkpmindkr(xk)\text{pred}_k\leq p_\text{min}\|d^k\|\cdot\|r(x^k)\| (checking that the predicted drop is a sufficient decrease since BkB_k is not necessarily positive definite), we update xk+1=xkx^{k+1}=x^k and μk+1=σ2μk\mu_{k+1}=\sigma_2\mu_k and move on to the next iteration.

Third, if the previous dkd^k passed, we set ρk=aredk/predk\rho_k=\text{ared}_k/\text{pred}_k and perform the following updates based on how successful the step was. We break down the updates on μk+1\mu_{k+1} into unsuccessful if the iteration was skipped or ρkc1\rho_k\leq c_1, successful if c1<ρkc2c_1<\rho_k\leq c_2, and highly successful if ρk>c2\rho_k>c_2.

xk+1={xkif ρkc1xk+dkotherwiseμk+1={σ2μkif ρkc1μkif c1<ρkc2σ1μkotherwise\begin{gather*} x^{k+1}=\begin{cases}x^k&\text{if }\rho_k\leq c_1\\x^k+d^k&\text{otherwise}\end{cases}\\ \mu_{k+1}=\begin{cases}\sigma_2\mu_k&\text{if }\rho_k\leq c_1\\\mu_k&\text{if }c_1<\rho_k\leq c_2\\\sigma_1\mu_k&\text{otherwise}\end{cases} \end{gather*}

Convergence Preliminaries:

A bunch of preliminaries are mentioned in the paper before going into the convergence proofs, so I’ll include the four that were novel to me personally here since they are also important for later steps of the proof.

As a refresher for the first two, we define a scaled proximal operator with some matrix HH and a function φ\varphi as follows, where in the following theorems we assume that HS++nH\in\mathbb{S}^n_{++} (symmetric positive definite) and that φ\varphi is convex.

proxφH(v)=argminx{φ(x)+12(yx)TH(yx)}\text{prox}^H_{\varphi}(v)=\text{arg}\min_x\{\varphi(x)+\frac{1}{2}(y-x)^TH(y-x)\}

First, we can prove that p=proxφH(x)p=\text{prox}^H_\varphi(x) if and only if pxH1φ(p)p\in x-H^{-1}\partial\varphi(p). We start by defining the objective function J(z)J(z) of the operator.

J(z)=φ(z)+12(zx)TH(zx)J(z)=\varphi(z)+\frac{1}{2}(z-x)^TH(z-x)

Since φ\varphi is convex and HH is positive definite, we know J(z)J(z) is strictly convex, thus we know that a point pp is the global minimizer of J(z)J(z) if and only if 0J(p)0\in\partial J(p). We can then use the sum rule for subdifferentials to simplify this statement into what we need.

J(p)=φ(p)+H(px)0φ(p)+H(px)H(xp)φ(p)\begin{gather*} \partial J(p)=\partial\varphi(p)+H(p-x)\\ 0\in\partial\varphi(p)+H(p-x)\\ H(x-p)\in\partial\varphi(p) \end{gather*}

Since HH is positive definite, it is invertible, which allows us to derive what we originally stated, concluding the proof.

xpH1φ(p)pxH1φ(p)\begin{gather*} x-p\in H^{-1}\partial\varphi(p)\\ p\in x-H^{-1}\partial\varphi(p) \end{gather*}

Second, we can prove that the proximal operator is firmly nonexpansive with respect to the norm induced by HH, i.e. for every x,yRnx,y\in\mathbb{R}^n we have the following.

proxφH(x)proxφH(y)H2<proxφH(x)proxφH(y),xy>H\|\text{prox}^H_\varphi(x)-\text{prox}^H_\varphi(y)\|^2_H\leq\left<\text{prox}^H_\varphi(x)-\text{prox}^H_\varphi(y),x-y\right>_H

For simplicity, we can define p=proxφH(x)p=\text{prox}^H_{\varphi}(x) and q=proxφH(y)q=\text{prox}^H_\varphi(y). From the optimality conditions we showed previously we know the following.

H(xp)φ(p)H(yp)φ(q)\begin{gather*} H(x-p)\in\partial\varphi(p)\\ H(y-p)\in\partial\varphi(q) \end{gather*}

Since φ\varphi is a convex function, we know its subdifferential operator is monotone, thus for any p,qp,q and their corresponding subgradients u,vu,v we have <uv,pq>0\left<u-v,p-q\right>\geq 0, thus we know the following for our points.

<H(xp)H(yq),pq>0<H((xp)(yq)),pq>=<(xy)(pq),pq>H0\begin{gather*} \left<H(x-p)-H(y-q),p-q\right>\geq 0\\ \left<H((x-p)-(y-q)),p-q\right>=\left<(x-y)-(p-q),p-q\right>_H\geq0 \end{gather*}

We can distribute the inner product and simplify, which allows us to derive what we originally stated, concluding the proof.

<xy,pq>H<pq,pq>H0<xy,pq>HpqH2<proxφH(x)proxφH(y),xy>HproxφH(x)proxφH(y)H2\begin{gather*} \left<x-y,p-q\right>_H-\left<p-q,p-q\right>_H\geq 0\\ \left<x-y,p-q\right>_H\geq\|p-q\|^2_H\\ \left<\text{prox}^H_\varphi(x)-\text{prox}^H_\varphi(y),x-y\right>_H\geq \|\text{prox}^H_\varphi(x)-\text{prox}^H_\varphi(y)\|^2_H \end{gather*}

For the third, we have to prove that if dk=0d^k=0 then xkx^k is a stationary point of ψ\psi, with the converse of the statement being true if Bk+μkIB_k+\mu_kI is positive definite. Keep in mind that the algorithm itself does not require Bk+μkIB_k+\mu_kI to be positive definite anywhere, so the more reliable r(xk)r(x^k) stopping criterion is used instead, we just need this for some assumptions made later.

To prove that dk=0d^k=0 leads to xkx^k being a stationary point, then we only need to use the definition of dkd^k and Fermat’s Rule, which trivially simplifies into the statement for the stationarity of xkx^k.

0f(xk)+(Bk+μkI)dk+φ(xk+dk)0f(xk)+φ(xk)\begin{gather*} 0\in\nabla f(x^k)+(B_k+\mu_kI)d^k+\partial\varphi(x^k+d^k)\\ 0\in\nabla f(x^k)+\partial\varphi(x^k) \end{gather*}

To prove that xkx^k being a stationary point leads to dk=0d^k=0 if Bk+μkIB_k+\mu_kI is positive definite, we first get the fact that f(xk)ψ(xk)-\nabla f(x^k)\in\partial\psi(x^k) from stationarity. We can then use this subgradient in the definition of φ\varphi being convex to know φ(xk+d)φ(xk)f(xk)Td\varphi(x^k+d)\geq\varphi(x^k)-\nabla f(x^k)^Td. Using this we can analyze the objective for our iteration subproblem.

q^k(0)=f(xk)+φ(xk)f(xk)+f(xk)Td+φ(xk+d)\hat{q}_k(0)=f(x^k)+\varphi(x^k)\leq f(x^k)+\nabla f(x^k)^Td+\varphi(x^k+d)

We can add the term 12dT(Bk+μkI)d\frac{1}{2}d^T(B_k+\mu_kI)d to the right-hand side, since by Bk+μkI0B_k+\mu_kI\succ 0 we know that the term will be positive. This leads to the statement that dk=0d^k=0 is the global minimizer, which is guaranteed to be the only solution by Bk+μkIB_k+\mu_kI‘s positive definiteness causing the subproblem to become strictly convex, thus proving the statement.

q^k(0)f(xk)+f(xk)Td+12dT(Bk+μkI)d+φ(xk+d)=q^k(d)\hat{q}_k(0)\leq f(x^k)+\nabla f(x^k)^Td+\frac{1}{2}d^T(B_k+\mu_kI)d+\varphi(x^k+d)=\hat{q}_k(d)

For the fourth, we need to derive an upper bound for the general residual rH(x)r_H(x), where the one we discussed previously was defined as r(x)=rI(x)r(x)=r_I(x).

rH(x)=argmind{f(x)Td+12dTHd+φ(x+d)}rH(x)=proxφH(xH1f(x))x\begin{gather*} r_H(x)=\text{arg}\min_d\left\{\nabla f(x)^Td+\frac{1}{2}d^THd+\varphi(x+d)\right\}\\ r_H(x)=\text{prox}^H_\varphi(x-H^{-1}\nabla f(x))-x \end{gather*}

For some xx, a matrix HH, and a symmetric positive definite matrix H~\tilde{H}, we know the following.

rH~(x)(1+λmax(H~)λmin(H))λmax(H)λmin(H~)rH(x)\|r_{\tilde{H}}(x)\leq\left(1+\frac{\lambda_\text{max}(\tilde{H})}{\lambda_\text{min}(H)}\right)\cdot\frac{\lambda_\text{max}(H)}{\lambda_\text{min}(\tilde{H})}\cdot\|r_H(x)\|

The proof borrows a result from a work that will be mentioned later again, Block Coordinate Descent. From a lemma in the paper we know the following, where Q=H1/2H~H1/2Q=H^{-1/2}\tilde{H}H^{-1/2}. The proof for the lemma follows from the optimality conditions of the residuals, Fermat’s Rule, and some careful algebraic manipulation.

rH~(x)1+λmax(Q)+12λmin(Q)+λmax(Q)22λmax(H)λmax(H~)rH(x)\|r_{\tilde{H}}\|(x)\leq\frac{1+\lambda_\text{max}(Q)+\sqrt{1-2\lambda_\text{min}(Q)+\lambda_\text{max}(Q)^2}}{2}\frac{\lambda_\text{max}(H)}{\lambda_\text{max}(\tilde{H})}\cdot\|r_H(x)\|

We can then derive two bounds for the terms within the bound derived above.

12λmin(Q)+λmax(Q)21+λmax(Q)2(1+λmax(Q))2λmax(Q)λmax(H~)/λmin(H)\begin{gather*} 1-2\lambda_{\min}(Q)+\lambda_{\max}(Q)^2\leq 1+\lambda_{\max}(Q)^2\leq (1+\lambda_{\max}(Q))^2\\ \lambda_{\max}(Q)\leq \lambda_{\max}(\tilde{H})/\lambda_{\min}(H) \end{gather*}

These can then be plugged into the bound, which after being simplified gives us the desired result.

λmax(Q)=maxx0xTH1/2H~H1/2xxTx=maxz0zTH~zzTHzλmax(Q)=maxz0(zTH~zzTzzTzzTHz)(maxz0zTH~zzTz)λmax(Q)(maxz01zTHzzTz)=λmax(H~)1minz0zTHzzTz=λmax(H~)1λmin(H)\begin{gather*} \lambda_{\max} (Q) = \max_{x \neq 0} \frac{x^T H^{-1/2} \tilde{H} H^{-1/2} x}{x^T x} = \max_{z \neq 0} \frac{z^T \tilde{H} z}{z^T H z}\\ \lambda_{\max} (Q)= \max_{z \neq 0} \bigg( \frac{z^T \tilde{H} z}{z^T z} \frac{z^T z}{z^T H z} \bigg) \leq \bigg( \max_{z \neq 0} \frac{z^T \tilde{H} z}{z^T z} \bigg)\\ \lambda_{\max}(Q)\leq \bigg( \max_{z \neq 0} \frac{1}{\frac{z^T H z}{z^T z}} \bigg) = \lambda_{\max} (\tilde{H}) \frac{1}{\min_{z \neq 0} \frac{z^T H z}{z^T z}} = \lambda_{\max} (\tilde{H}) \cdot \frac{1}{\lambda_{\min} (H)} \end{gather*}

Global Weak Convergence:

If {Bk}\{B_k\} is a bounded sequence of symmetric matrices and ψ\psi is bounded from below (along with the requirements from the problem definition), then any sequence {xk}\{x^k\} generated by the method satisfies lim infkr(xk)=0\liminf_{k\rightarrow\infty}\|r(x^k)\|=0. This ensures that at least one subsequence of the iterations converges towards a stationary point, although this doesn’t require that the method itself converges (with that being proven in a later section).

The proof follows through a long series of lemma work to prove through contradiction that there will be an infinite number of successful steps if the algorithm were to never reach termination, and then use that to prove the theorem itself through another contradiction.

To start the prep work for the first contradiction, we establish an inequality that relates r(xk)r(x^k) and dkd^k if we assume that {xk}\{x^k\} converges to a nonstationary point xˉ\bar{x} of ψ\psi. We also assume that {Bk}\{B_k\} is a bounded sequence of symmetric matrices and that μk\mu_k\rightarrow\infty.

Since {Bk}\{B_k\} is bounded, we know that there is some constant CC such that λmin(Bk)C\lambda_\text{min}(B_k)\geq -C, which combined with the fact that we know μk\mu_k\rightarrow\infty means that we can guarantee λmin(Bk+μkI)C+μk>0\lambda_\text{min}(B_k+\mu_kI)\geq -C+\mu_k>0 for large enough kk, meaning we can guarantee that Bk+μkIB_k+\mu_kI is positive definite for large enough kk. This lets us use the lemma we established previously about the weighted residuals with H=Bk+μkIH=B_k+\mu_kI and H~=I\tilde{H}=I.

rH~(x)(1+λmax(H~)λmin(H))λmax(H)λmin(H~)rH(x)\|r_{\tilde{H}}(x)\|\leq\left(1+\frac{\lambda_\text{max}(\tilde{H})}{\lambda_\text{min}(H)}\right)\cdot\frac{\lambda_\text{max}(H)}{\lambda_\text{min}(\tilde{H})}\cdot\|r_H(x)\|

Since r(x)=0\|r(x)\|=0 if and only if xx is a stationary point, we know that rBk+μkI(xk)0\|r_{B_k+\mu_kI}(x^k)\|\neq 0, so we can divide it from both sides, which along with dividing by μk\mu_k and taking kk\rightarrow\infty lets us derive the bound we need for future steps.

r(xk)rBk+μkI(xk)(1+1λmin(Bk)+μk)(λmax(Bk)+μk)lim supkr(xk)rBk+μkI(xk)μk1\begin{gather*} \frac{\|r(x^k)\|}{\|r_{B_k+\mu_kI}(x^k)\|}\leq\left(1+\frac{1}{\lambda_\text{min}(B_k)+\mu_k}\right)\cdot(\lambda_\text{max}(B_k)+\mu_k)\\ \limsup_{k\rightarrow\infty}\frac{\|r(x^k)\|}{\|r_{B_k+\mu_kI}(x^k)\|\cdot\mu_k}\leq 1 \end{gather*}

Second, we show that under high regularization we know that dkd^k will always converge to 00 under the assumptions that {Bk}\{B_k\} and {xk}\{x^k\} are bounded sequences, μk\mu_k\rightarrow\infty.

Since {Bk}\{B_k\} is bounded and μk\mu_k\rightarrow\infty lead to Bk+μkIB_k+\mu_kI being positive definite from previous steps, we know that the subproblem turns into a strictly convex problem for sufficiently large kk, meaning that dkd^k is well defined in such cases. As well, since each iteration is only accepted if its successful or highly successful (which for the rest of the proof will be lumped into being successful), we know that the sequence can’t worsen in the objective, so {ψ(xk)}\{\psi(x^k)\} is monotonically increasing. From these two we know that we have the following from the definition of dkd^k.

ψ(x0)ψ(xk)=q^k(0)q^k(dk)\psi(x^0)\geq\psi(x^k)=\hat{q}_k(0)\geq\hat{q}_k(d^k)

We can then use the convexity of φ\varphi to know that φ(xk+dk)φ(xk)+(uk)Tdk\varphi(x^k+d^k)\geq\varphi(x^k)+(u^k)^Td^k (for some subderivative ukφ(xk)u^k\in\partial\varphi(x^k)) after expanding the defined q^k\hat{q}_k and ψ\psi to get the bound in more separable terms.

q^k(dk)=f(xk)+f(xk)Tdk+12(dk)T(Bk+μkI)dk+φ(xk+dk)q^k(dk)f(xk)+f(xk)Tdk+12(dk)T(Bk+μkI)dk+φ(xk)+(uk)Tdk\begin{gather*} \hat{q}_k(d^k)=f(x^k)+\nabla f(x^k)^Td^k+\frac{1}{2}(d^k)^T(B_k+\mu_kI)d^k+\varphi(x^k+d^k)\\ \hat{q}_k(d^k)\geq f(x^k)+\nabla f(x^k)^Td^k+\frac{1}{2}(d^k)^T(B_k+\mu^kI)d^k+\varphi(x^k)+(u^k)^Td^k \end{gather*}

Since {xk}\{x^k\} and {Bk}\{B_k\} are bounded, from continuity we know that {f(xk)}\{f(x^k)\}, {ψ(xk)}\{\psi(x^k)\}, and {f(xk)}\{\nabla f(x^k)\} are bounded. As well, since the subdifferential maps bounded sets to bounded sets we know that {uk}\{u^k\} is bounded as well. Since this entire inequality is bounded by a fixed value ψ(x0)\psi(x^0), the right-hand side needs to be bounded as μk\mu^k\rightarrow\infty. This means that 12(dk)T(Bk+μkI)dk\frac{1}{2}(d^k)^T(B_k+\mu_kI)d^k needs to be bounded, so dk0d^k\rightarrow 0.

Third, we use this statement on dkd^k to prove that the directional derivative along the normalized step is strictly negative, which is going to be a key property for the later contradiction.

We assume that {Bk}\{B_k\} is a bounded sequence of symmetric matrices, μk\mu_k\rightarrow\infty, and xˉ\bar{x} is a nonstationary point of ψ\psi. We also define dˉk=rBk+μkI(xˉ)\bar{d}^k=r_{B_k+\mu_kI}(\bar{x}) and some point ss has an accumulation point of the sequence {dˉk/dˉk}\{\bar{d}^k/\|\bar{d}^k\|\} (which we know has accumulation points since the sequence is by definition bounded). Due to Fermat’s Rule we know that we have the following from the iteration subproblem for some ukφ(xˉ+dˉk)u^k\in\partial\varphi(\bar{x}+\bar{d}^k).

0=f(xˉ)+(Bk+μkI)dˉk+uk0=\nabla f(\bar{x})+(B_k+\mu_kI)\bar{d}^k+u^k

Since (Bk+μkI)dˉk(B_k+\mu_kI)\bar{d}^k is exactly equal to (f(xˉ)+uk)-(\nabla f(\bar{x})+u^k), we know that both terms cannot be 00 due to the nonstationarity of the convergent point. This allows us to know that even though dˉk0\bar{d}^k\rightarrow 0, it’s still not zero, so we can know that the sequence {dˉk/dˉk}\{\bar{d}^k/\|\bar{d}^k\|\} is well-defined, so ss is a real point.

Since the sequence of normalized vectors and the sequence of subgradients are both bounded, we can use the Bolzano-Weierstrass Theorem to form a subsequence KK that satisfies the following. Since the subdifferential is closed we also know that uˉφ(xˉ)\bar{u}\in\partial\varphi(\bar{x}), which means we have f(xˉ)+uˉ0\nabla f(\bar{x})+\bar{u}\neq 0 from nonstationarity.

dˉkdˉkKsukKuˉ\frac{\bar{d}^k}{\|\bar{d}^k\|}\rightarrow_Ks\quad u^k\rightarrow_K\bar{u}

From the previous proofs on Proximal Newton, we know that for some regularized proximal step dd we have ψ(x;d)dTHd\psi^\prime(x;d)\leq-d^THd, where HH is the hessian of the quadratic model. Since we have H=Bk+μkIH=B_k+\mu_kI, we can substitute the value in and then use the eigenvalues to simplify.

ψ(xˉ,dˉk)(dˉk)T(Bk+μkI)dˉk(λmin(Bk)+μk)dˉk2\psi^\prime(\bar{x},\bar{d}^k)\leq-(\bar{d}^k)^T(B_k+\mu_kI)\bar{d}^k\leq-(\lambda_\text{min}(B_k)+\mu_k)\|\bar{d}^k\|^2

From the previous suboptimality analysis we know f(xˉ)+uk=(Bk+μkI)dˉk(Bk+μk)dˉk\|\nabla f(\bar{x})+u^k\|=\|(B_k+\mu_kI)\bar{d}^k\|\leq(\|B_k\|+\mu_k)\|\bar{d}^k\|, which we can use to further simplify.

ψ(xˉ,dˉk)f(xˉ)+ukλmin(Bk)+μkBk+μkdˉk\psi^\prime(\bar{x},\bar{d}^k)\leq-\|\nabla f(\bar{x})+u^k\|\cdot\frac{\lambda_\text{min}(B_k)+\mu_k}{\|B_k\|+\mu_k}\cdot\|\bar{d}^k\|

As we have φ\varphi as a convex function, we know that ψ(xˉ;d)\psi^\prime(\bar{x};d) is sublinear, so we know it satisfies positive homogeneity (the first line of the following), which we can use bound the derivative in terms of our regularized sequence.

ψ(x;αd)=αψ(x;d)ψ(xˉ;dˉkdˉk)f(xˉ)+ukλmin(Bk)+μkBk+μk\begin{gather*} \psi^\prime(x;\alpha d)=\alpha\psi^\prime(x;d)\\ \psi^\prime\left(\bar{x};\frac{\bar{d}^k}{\|\bar{d}^k\|}\right)\leq-\|\nabla f(\bar{x})+u^k\|\cdot\frac{\lambda_\text{min}(B_k)+\mu_k}{\|B_k\|+\mu_k} \end{gather*}

Since we have kKk\rightarrow_K\infty, we can use our previous analysis of the convergent subsequence to prove that the directional derivative is strictly negative. Since f(xˉ)+uˉ\nabla f(\bar{x})+\bar{u} is nonzero, we know its norm has to be positive, and since we have μk\mu_k\rightarrow\infty, the multiplicative term including it converges to 11.

ψ(xˉ,s)=limKkψ(xˉ,dˉkdˉk)f(xˉ)+u<0\psi^\prime(\bar{x},s)=\lim_{K\ni k\rightarrow\infty}\psi^\prime\left(\bar{x},\frac{\bar{d}^k}{\|\bar{d}^k\|}\right)\leq-\|\nabla f(\bar{x})+u\|<0

Finally for the prep work, we can use this and our previous results to show that the algorithm will perform infinitely many successful steps if it goes on forever.

We assume for contradiction that there exists some k0Nk_0\in\mathbb{N} such that all steps kk0k\geq k_0 are unsuccessful. This implies that xk=xk0x^k=x^{k_0} for all kk0k\geq k_0, means that we have r(xk)0\|r(x^k)\|\neq 0 since we haven’t reached the stopping criterion yet (so xk0x^{k_0} is nonstationary), and also μk\mu_k\rightarrow\infty by the algorithm definition. Since {Bk}\{B_k\} is a bounded sequence, this means that the matrices converge to being positive definite and dkd^k is well-defined. Also from previous lemmas is the fact that dk0d^k\neq 0 and that we have the following (from the first lemma since we have pmin<0.5p_\text{min}<0.5 and dk=rBk+μkI(xk)d^k=r_{B_k+\mu_kI}(x^k) from the regularized matrix being positive definite).

r(xk)dkμk<12pmin\frac{\|r(x^k)\|}{\|d^k\|\mu_k}<\frac{1}{2p_\text{min}}

From the optimality of the subproblem we have q^k(dk)q^k(0)\hat{q}_k(d^k)\leq\hat{q}_k(0), which we can use to bound predk\text{pred}_k to prove that the failure point is not from the second step of the algorithm, meaning we need to fail in the first.

predk=ψ(xk)qk(dk)=ψ(xk)q^k(dk)+μk2dk2ψ(xk)q^k(dk)+μk2dk2ψ(xk)q^k(0)+μk2dk2=μk2dk2ψ(xk)q^k(dk)+μk2dk2>pminr(xk)dk\begin{gather*} \text{pred}_k=\psi(x^k)-q_k(d^k)=\psi(x^k)-\hat{q}_k(d^k)+\frac{\mu_k}{2}\|d^k\|^2\\ \psi(x^k)-\hat{q}_k(d^k)+\frac{\mu_k}{2}\|d^k\|^2\geq \psi(x^k)-\hat{q}_k(0)+\frac{\mu_k}{2}\|d^k\|^2=\frac{\mu_k}{2}\|d^k\|^2\\ \psi(x^k)-\hat{q}_k(d^k)+\frac{\mu_k}{2}\|d^k\|^2>p_\text{min}\|r(x^k)\|\cdot\|d^k\| \end{gather*}

This means we need aredkc1predk\text{ared}_k\leq c_1\text{pred}_k for the steps to be unsuccessful.

ψ(xk0+dk)ψ(xk0)c1(f(xk0)Tdk+φ(xk0+dk)φ(xk0)+12(dk)TBkdk)\psi(x^{k_0}+d^k)-\psi(x^{k_0})\geq c_1(\nabla f(x^{k_0})^Td^k+\varphi(x^{k_0}+d^k)-\varphi(x^{k_0})+\frac{1}{2}(d^k)^TB_kd^k)

We can then divide by dk\|d^k\| to not only introduce our previous regularized sequence, but also introduce some terms in derivative form, for which we define tk=dkt_k=\|d^k\| for simplicity in notation.

ψ(xk0+tkdkdk)ψ(xk0)tkc1(f(xk0)Tdkdk+φ(xk0+tkdkdk)φ(xk0)tk+12(dk)TdkBkdk)\frac{\psi(x^{k_0}+t_k\frac{d^k}{\|d^k\|})-\psi(x^{k_0})}{t_k}\geq c_1\left(\nabla f(x^{k_0})^T\frac{d^k}{\|d^k\|}+\frac{\varphi(x^{k_0}+t_k\frac{d^k}{\|d^k\|})-\varphi(x^{k_0})}{t_k}+\frac{1}{2}\frac{(d^k)^T}{\|d^k\|}B_kd^k\right)

From previous results, we can choose a subsequence KK such that dk/dksd^k/\|d^k\|\rightarrow s, which along with the fact that φ\varphi is locally Lipschitz continuous (allows the left-hand term to turn into ψ(xk0;s)\psi^\prime(x^{k_0};s)) gives the following. The term on the right-hand side converges in a similar way since dk0d^k\rightarrow 0 and {Bk}\{B_k\} is bounded.

ψ(xk0;s)c1(f(xk0)Ts+φ(xk0;s))ψ(xk0;s)c1ψ(xk0;s)\begin{gather*} \psi^\prime(x^{k_0};s)\geq c_1(\nabla f(x^{k_0})^Ts+\varphi^\prime(x^{k_0};s))\\ \psi^\prime(x^{k_0};s)\geq c_1\psi^\prime(x^{k_0};s) \end{gather*}

Since c1(0,1)c_1\in(0,1), the only way for this to be true is for ψ(xk0;s)0\psi^\prime(x^{k_0};s)\geq 0, which leads to a contradiction with our previous proof that ψ(xk0;s)<0\psi^\prime(x^{k_0};s)<0, which means that the algorithm can not get stuck at a nonstationary point.

With this, we can finally construct the original convergence theorem through another proof by contradiction. We assume that lim infkr(xk)>0\liminf_{k\rightarrow\infty}\|r(x^k)\|>0 for contradiction along with the assumption that ψ\psi is bounded. We also use the previous theorem to define a set of successful iterations S\mathcal{S} (which we use since almost all of the properties we prove for the sequence of steps in S\mathcal{S} is trivial for unsuccessful steps since the difference between unsuccessful steps is 00).

Our assumption means that there exists some k0Nk_0\in\mathbb{N} and ϵ>0\epsilon>0 such that r(xk)ϵ\|r(x^k)\|\geq\epsilon for all kk0k\geq k_0. By definition of successful steps, we get the following for all kSk\in\mathcal{S}.

ψ(xk)ψ(xk+1)c1predkpminc1dkr(xk)pminc1ϵdk\psi(x^k)-\psi(x^{k+1})\geq c_1\text{pred}_k\geq p_\text{min}c_1\|d^k\|\cdot\|r(x^k)\|\geq p_\text{min}c_1\epsilon\|d^k\|

We can then sum over this recursive relationship to create a telescoping sum. Since ψ\psi is bounded we know the telescoping sum is bounded. We can then isolate the sum over dk\|d^k\| to get the following.

>k=0[ψ(xk)ψ(xk+1)]=kS[ψ(xk)ψ(xk+dk)]pminc1ϵkSdk>kSdk=kSxk+1xk=k=0xk+1xk\begin{gather*} \infty>\sum^\infty_{k=0}[\psi(x^k)-\psi(x^{k+1})]=\sum_{k\in\mathcal{S}}[\psi(x^k)-\psi(x^k+d^k)]\geq p_\text{min}c_1\epsilon\sum_{k\in\mathcal{S}}\|d^k\|\\ \infty>\sum_{k\in\mathcal{S}}\|d^k\|=\sum_{k\in\mathcal{S}}\|x^{k+1}-x^k\|=\sum^\infty_{k=0}\|x^{k+1}-x^k\| \end{gather*}

This means that {xk}\{x^k\} is a Cauchy Sequence, and therefore convergent to some xˉ\bar{x}, which from our assumptions is not a stationary point of ψ\psi since r(xˉ)=limkr(xk)\|r(\bar{x})\|=\lim_{k\rightarrow\infty}\|r(x^k)\|.

Since we have infinitely many successful steps, for this to stay bounded we need dkS0\|d^k\|\rightarrow_\mathcal{S}0. We can use Fermat’s Rule to see that if we assume {μk}\{\mu_k\} is bounded, it would result in a contradiction.

0=f(xk)+(Bk+μkI)dk+ukf(xk)+uk0=\nabla f(x^k)+(B_k+\mu_kI)d^k+u^k\rightarrow \nabla f(x^k)+u^k

This means that we would have 0=f(xˉ)+uˉ0=\nabla f(\bar{x})+\bar{u}, which contradicts with the nonstationarity of xˉ\bar{x}, so we need {μk}S\{\mu^k\}_\mathcal{S}\rightarrow\infty. That also means that {μk}\{\mu_k\}\rightarrow\infty since μk\mu_k cannot decrease during unsuccessful steps, and also that the algorithm performs infinitely many unsuccessful steps since μk\mu_k can’t grow during successful iterations.

From previous steps we know that the following holds for sufficiently large kk.

predkpmindkr(xk)pminϵdk\text{pred}_k\geq p_\text{min}\|d^k\|\cdot\|r(x^k)\|\geq p_\text{min}\epsilon\|d^k\|

By the Mean Value Theorem, for every kk there exists some ξk\xi^k on the straight line between xkx^k and xk+dkx^k+d^k such that f(xk+dk)f(xk)=f(ξk)Tdkf(x^k+d^k)-f(x^k)=\nabla f(\xi^k)^Td^k. By the convergence of {xk}\{x^k\} to xˉ\bar{x}, and since {dk}0\{d^k\}\rightarrow 0, the sequence {ξk}\{\xi^k\} must also converge to xˉ\bar{x}, which means we can obtain the following for kk\rightarrow\infty.

ρk1=ψ(xk)ψ(xk+dk)ψ(xk)qk(dk)1=ψ(xk+dk)qk(dk)ψ(xk)qk(dk)ρk11pminϵf(xk+dk)f(xk)f(xk)Tdk+12(dk)TBkdkdkρk11pminϵf(ξk)dkf(xk)Tdkdk+12pminϵ(dk)TBkdkdk0\begin{gather*} |\rho_k-1|=\left|\frac{\psi(x^k)-\psi(x^k+d^k)}{\psi(x^k)-q_k(d^k)}-1\right|=\left|\frac{\psi(x^k+d^k)-q_k(d^k)}{\psi(x^k)-q_k(d^k)}\right|\\ |\rho_k-1|\leq\frac{1}{p_\text{min}\epsilon}\frac{|f(x^k+d^k)-f(x^k)-\nabla f(x^k)^Td^k|+\frac{1}{2}|(d^k)^TB_kd^k|}{\|d^k\|}\\ |\rho_k-1|\leq \frac{1}{p_\text{min}\epsilon}\frac{|\nabla f(\xi^k)d^k-\nabla f(x^k)^Td^k|}{\|d^k\|}+\frac{1}{2p_\text{min}\epsilon}\left|(d^k)^TB_k\frac{d^k}{\|d^k\|}\right|\rightarrow 0 \end{gather*}

This means that {ρk}1\{\rho_k\}\rightarrow 1, which means eventually all steps are successful, which yields a contradiction meaning that we need some eventual kk that satisfies r(xk)=0\|r(x^k)\|=0, which concludes the proof.

Global Strong Convergence:

If {Bk}\{B_k\} is a bounded sequence of symmetric matrices, ψ\psi is bounded from below, and f\nabla f is uniformly continuous on XX satisfying {xk}X\{x^k\}\subset X (along with the requirements from the problem definition), then any sequence {xk}\{x^k\} generated by the method satisfies limkr(xk)=0\lim_{k\rightarrow\infty}\|r(x^k)\|=0. This ensures that all of the subsequences mentioned in the global weak convergence theorem all converge towards a stationary point, although they can be different stationary points (with convergence to a single stationary point being handled in a later section).

The proof follows by assuming for contradiction that the residual sequence does not strictly converge to 00, then using the uniform continuity of f\nabla f to show that the residual cannot change drastically over the shrinking step sizes.

From the previous proof, we know there is an index (k)>k\ell(k)> k such that r(xl)δ\|r(x^l)\|\geq\delta for all kl<(k)k\leq l<\ell (k) and r(x(k))<δ\|r(x^{\ell(k)})\|<\delta. If for kKk\in K, an iteration kl<(k)k\leq l<\ell(k) is successful, we get the following (which also holds trivially for unsuccessful iterations so we can generalize to a sum over all iterations).

ψ(xl)ψ(xl+1)c1predlc1pminr(xl)dlc1pminδxl+1xlc1pminδx(k)xkc1pminδl=k(k)1xl+1xll=k(k)1ψ(xl)ψ(xl+1)=ψ(xk)ψ(x(k))\begin{gather*} \psi(x^l)-\psi(x^{l+1})\geq c_1\text{pred}_l\geq c_1p_\text{min}\|r(x^l)\|\cdot\|d^l\|\geq c_1p_\text{min}\delta\|x^{l+1}-x^l\|\\ c_1p_\text{min}\delta\|x^{\ell(k)}-x^k\|\leq c_1p_\text{min}\delta\sum^{\ell(k)-1}_{l=k}\|x^{l+1}-x^l\|\leq\sum^{\ell(k)-1}_{l=k}\psi(x^l)-\psi(x^{l+1})=\psi(x^k)-\psi(x^{\ell(k)}) \end{gather*}

By assumption we know ψ\psi is bounded and from the algorithm definition we know {ψ(xk)}\{\psi(x^k)\} is monotonically decreasing, so the sequence is convergent. This implies that {ψ(xk)ψ(x(k))}K0\{\psi(x^k)-\psi(x^{\ell(k)})\}\rightarrow_K0, so we get {x(k)xk}K0\{\|x^{\ell(k)}-x^k\|\}\rightarrow_K0.

The uniform continuity of f\nabla f and the proximal operator means that r()r(\cdot) is uniformly continuous, so we also get {r(x(k))r(xk)}K0\{\|r(x^{\ell(k)})-r(x^k)\|\}\rightarrow_K0, which leads to the following contradiction for our definition of (k)\ell(k), concluding the proof.

r(xk)r(x(k))r(xk)r(x(k))2δδδ\|r(x^k)-r(x^{\ell(k)})\|\geq\|r(x^k)\|-\|r(x^{\ell(k)})\|\geq 2\delta-\delta\geq \delta

Convergence Under an Error Bound Condition:

The paper rounds out the convergence analysis of the algorithm by using a similar error bound strategy that is used in Block Coordinate Descent. We assume that ψ\psi is bounded from below, the set of stationary points X\mathcal{X}^* of ψ\psi is nonempty, and that two properties are satisfied to ensure that the optimization landscape is regular enough to provide stable convergence. One is a local error bound for stationarity and the other is a separation between stationary points. First, for any ξminxψ(x)\xi\geq\min_x\psi(x), there exists scalars τ>0\tau>0 and ϵ>0\epsilon>0 such that dist(x,X)τr(x)\text{dist}(x,\mathcal{X}^*)\leq\tau\|r(x)\| whenever ψ(x)ξ,r(x)ϵ\psi(x)\leq\xi,\|r(x)\|\leq\epsilon (ensures that the residual is a good measure of distance to stationary points). Second, there exists a scalar δ>0\delta>0 such that xyδ\|x-y\|\geq\delta whenever xXx\in\mathcal{X}^*, yXy\in\mathcal{X^*}, and ψ(x)ψ(y)\psi(x)\neq\psi(y) (ensures that stationary points with different function values are properly separated).

If ff has a Lipschitz continuous gradient and MIHkmIMI\succeq H_k\succeq mI for some Mm>0M\geq m>0 (along with the above being true), then the sequence {xk}\{x^k\} converges to some xˉ\bar{x}. A byproduct of the proof is also the fact that the algorithm satisfies Q-linear convergence, meaning it satisfies the following for some stationary point ψˉ\bar{\psi} and some θ(0,1)\theta\in(0,1).

ψ(xk+1)ψˉθ(ψ(xk)ψˉ)\psi(x^{k+1})-\bar{\psi}\leq\theta(\psi(x^k)-\bar{\psi})

The proof follows a very similar structure to that in Block Coordinate Descent’s paper after some initial prep work. We start by assuming the sequence {Hk}\{H_k\} is uniformly bounded and positive definite, i.e. for some 0<mM0<m\leq M we have mIHkMImI\preceq H_k\preceq MI. Since HkmIH_k\succeq mI, we know that dkTHkdkmdk2d^T_kH_kd_k\geq m\|d_k\|^2, which can also be applied to Hk+μkIH_k+\mu_kI giving dkT(Hk+μkI)dk(m+μk)dk2d^T_k(H_k+\mu_kI)d_k\geq (m+\mu_k)\|d_k\|^2.

Using these two facts allows us to bound predk\text{pred}_k. We first use the convexity of φ\varphi to know φ(xk+dk)+φ(xk)uTdk\varphi(x^k+d^k)+\varphi(x^k)\geq u^Td^k and then the optimality conditions of the subproblem to know (f(xk)+uk)=(Hk+μkI)dk-(\nabla f(x^k)+u^k)=(H_k+\mu_kI)d^k.

predk=(f(xk)Tdk+φ(xk+dk)φ(xk))12(dk)THkdkpredk(dk)T(Hk+μkI)dk12(dk)THkdk12(m+2μk)dk2\begin{gather*} \text{pred}_k=-(\nabla f(x^k)^Td^k+\varphi(x^k+d^k)-\varphi(x^k))-\frac{1}{2}(d^k)^TH_kd^k\\ \text{pred}_k\geq (d^k)^T(H_k+\mu_kI)d^k-\frac{1}{2}(d^k)^TH_kd^k\geq\frac{1}{2}(m+2\mu_k)\|d^k\|^2 \end{gather*}

Using the lemma we established about the residual gives us two more inequalities which we will use later if we use λmax(Hk+μkI)M+μk\lambda_\text{max}(H_k+\mu_kI)\leq M+\mu_k and λmin(Hk+μkI)m+μk\lambda_\text{min}(H_k+\mu_kI)\geq m+\mu_k.

r(xk)dk(1+1m+μk)(M+μk)m+1m(M+μk)dkr(xk)1+M+μkm+μk1+Mm\begin{gather*} \frac{\|r(x^k)\|}{\|d^k\|}\leq\left(1+\frac{1}{m+\mu_k}\right)(M+\mu_k)\leq\frac{m+1}{m}(M+\mu_k)\\ \frac{\|d^k\|}{r(x^k)}\leq\frac{1+M+\mu_k}{m+\mu_k}\leq\frac{1+M}{m} \end{gather*}

Next, we have to prove that {μk}\{\mu_k\} is a bounded sequence by first proving that if ff is LL-smooth, then if in some xkx^k we have μkμˉ=max{Lm,0}\mu_k\geq\bar{\mu}=\max\{L-m,0\}, then we have aredk>c1predk\text{ared}_k>c_1\text{pred}_k.

By the Lipschitz continuity of f\nabla f, we have the following, which is further simplified since we assume that μkμˉ\mu_k\geq\bar{\mu}, meaning μkLm\mu_k\geq L-m and that we have Hk+μkImI+(Lm)I=LIH_k+\mu_kI\geq mI+(L-m)I=LI.

f(xk+dk)f(xk)f(xk)Tdk+12Ldk2f(xk)Tdk+12(dk)T(Hk+μkI)dkf(x^k+d^k)-f(x^k)\leq\nabla f(x^k)^Td^k+\frac{1}{2}L\|d^k\|^2\leq\nabla f(x^k)^Td^k+\frac{1}{2}(d^k)^T(H_k+\mu_kI)d^k

We can then subtract φ(xk+dk)φ(xk)\varphi(x^k+d^k)-\varphi(x^k) from both sides to get a more workable form of the inequality.

ψ(xk+dk)ψ(xk)f(xk)Tdk+ψ(xk+dk)ψ(xk)+12(dk)T(Hk+μkI)dk\psi(x^k+d^k)-\psi(x^k)\leq\nabla f(x^k)^Td^k+\psi(x^k+d^k)-\psi(x^k)+\frac{1}{2}(d^k)^T(H_k+\mu_kI)d^k

By definition this leads to aredkpredk+μk2dk2-\text{ared}_k\leq-\text{pred}_k+\frac{\mu_k}{2}\|d^k\|^2, which when combined with the first bound from the previous steps gives the following by using μk+m/2μk+m>12\mu_k+m/2\mu_k+m>\frac{1}{2} and c12c\leq\frac{1}{2}, concluding the proof for the lemma.

aredkpredkμk2dk2predkμk+m2μk+m>12predkc1predk\text{ared}_k\geq\text{pred}_k-\frac{\mu_k}{2}\|d^k\|^2\geq\text{pred}_k\cdot\frac{\mu_k+m}{2\mu_k+m}>\frac{1}{2}\text{pred}_k\geq c_1\text{pred}_k

Although this ensures that a successful step is called if μk\mu_k is large enough, we still need to ensure that the early exit isn’t called each time, so we use the previous result to prove that {μk}\{\mu_k\} is bounded if ff is LL-Smooth and MIHkmIMI\succeq H_k\succeq mI through a contradiction.

We assume that {μk}\{\mu_k\} is unbounded, meaning that there is a subsequence KK such that {μk}K\{\mu_k\}_K\rightarrow\infty, which implies there are infinitely many unsuccessful steps. Without loss of generality, we can assume that all steps kKk\in K are unsuccessful. From the previous lemma, this is only possible if we have predk<pmindkr(xk)\text{pred}_k<p_\text{min}\|d^k\|\cdot\|r(x^k)\|, which along with the first bound from the first lemma gives the following.

m+2μk2dk<pminr(xk)r(xk)μkdk>m+2μk2pminμk\begin{gather*} \frac{m+2\mu_k}{2}\|d^k\|<p_\text{min}\|r(x^k)\|\\ \frac{\|r(x^k)\|}{\mu_k\|d^k\|}>\frac{m+2\mu_k}{2p_\text{min}\mu_k} \end{gather*}

We can then combine this with the second bound from the first lemma to get the following for all kKk\in K.

(1+1m+μk)M+μkμk>m+2μk2pminμk\left(1+\frac{1}{m+\mu_k}\right)\frac{M+\mu_k}{\mu_k}>\frac{m+2\mu_k}{2p_\text{min}\mu_k}

If we take the limit in KK, this leads to 1/pmin1/p_\text{min}, which by definition of pmin(0,12)p_\text{min}\in(0,\frac{1}{2}), which leads to a contradiction, completing the proof of the boundedness of {μk}\{\mu_k\}.

Finally, we can complete the proof of the original theorem by using these lemmas and a couple of lemmas from previous proofs (the fact that r(xk)0\|r(x^k)\|\rightarrow 0 and dk0d^k\rightarrow 0 under the assumptions from the previous convergence theorems and that {ψ(xk)}\{\psi(x^k)\} is a monotonically decreasing sequence from the algorithm definition). Since {ψ(xk)}\{\psi(x^k)\} is monotonically decreasing and bounded below, we know it must converge to some finite ψˉ\bar{\psi} which will be important for future steps.

Since r(xk)0\|r(x^k)\|\rightarrow 0 for all sufficiently large kk, we have r(xk)ϵ\|r(x^k)\|\leq\epsilon, so we can apply the first error bound assumption to define some xˉkX\bar{x}^k\in\mathcal{X}^* that satisfies the following.

dist(x,X)=xkxˉkτr(xk)\text{dist}(x,\mathcal{X}^*)=\|x^k-\bar{x}^k\|\leq\tau\|r(x^k)\|

Since r(xk)0\|r(x^k)\|\rightarrow 0, we can simplify this to xkxˉk0\|x^k-\bar{x}^k\|\rightarrow 0. We can then use this to get a decomposed bound on the difference between each iterate and xˉ\bar{x}.

xˉk+1xˉk=xˉk+1xk+1+xk+1xk+xkxˉkxˉk+1xˉkxˉk+1xk+1+xk+1xk+xkxˉk\begin{gather*} \|\bar{x}^{k+1}-\bar{x}^k\|=\|\bar{x}^{k+1}-x^{k+1}+x^{k+1}-x^k+x^k-\bar{x}^k\|\\\\ \|\bar{x}^{k+1}-\bar{x}^k\|\leq\|\bar{x}^{k+1}-x^{k+1}\|+\|x^{k+1}-x^k\|+\|x^k-\bar{x}^k\| \end{gather*}

From the above convergence statement we know that the first and third term go to 00, and the second term xk+1xk=dk\|x^{k+1}-x^k\|=\|d^k\| also goes to 00, so we have xˉk+1xˉk0\|\bar{x}^{k+1}-\bar{x}^k\|\rightarrow 0.

From the second error bound assumption, we know that any two stationary points in X\mathcal{X}^* need to have a minimum distance δ\delta, and since the points are getting infinitely close this cannot be satisfied, so they must share the same objective value ψ(xˉk)=ψˉ\psi(\bar{x}^k)=\bar{\psi}.

We can then find the suboptimality bound between the current iterate’s objective value and our convergent ψˉ\bar{\psi}. We first use the convexity of φ\varphi and the fact that we have f(xˉk)φ(xˉk)-\nabla f(\bar{x}^k)\in\partial\varphi(\bar{x}^k) to get the following.

φ(xk+1)φ(xˉk)f(xˉk)T(xk+1xˉk)\varphi(x^{k+1})-\varphi(\bar{x}^k)\geq-\nabla f(\bar{x}^k)^T(x^{k+1}-\bar{x}^k)

Plugging this inequality into the objective gap we wanted to define lets us simplify along with the descent lemma from the LL-smoothness of ff.

ψ(xk+1)ψˉ=f(xk+1)f(xˉk)+φ(xk+1)φ(xˉk)ψ(xk+1)ψˉf(xk+1)f(xˉk)f(xˉk)T(xk+1xˉk)ψ(xk+1)ψˉL2xk+1xˉk2\begin{gather*} \psi(x^{k+1})-\bar{\psi}=f(x^{k+1})-f(\bar{x}^k)+\varphi(x^{k+1})-\varphi(\bar{x}^k)\\ \psi(x^{k+1})-\bar{\psi}\leq f(x^{k+1})-f(\bar{x}^k)-\nabla f(\bar{x}^k)^T(x^{k+1}-\bar{x}^k)\\ \psi(x^{k+1})-\bar{\psi}\leq\frac{L}{2}\|x^{k+1}-\bar{x}^k\|^2 \end{gather*}

To bound xk+1xˉk\|x^{k+1}-\bar{x}^k\|, we can use the triangle inequality, our definition of dkd^k, and the error bound xkxˉkτr(xk)\|x^k-\bar{x}^k\|\leq \tau\|r(x^k)\|.

xk+1xˉkxk+1xk+xkxˉkxk+1xˉkdk+xkxˉkxk+1xˉkdk+τr(xk)\begin{gather*} \|x^{k+1}-\bar{x}^k\|\leq\|x^{k+1}-x^k\|+\|x^k-\bar{x}^k\|\\ \|x^{k+1}-\bar{x}^k\|\leq\|d^k\|+\|x^k-\bar{x}^k\|\\ \|x^{k+1}-\bar{x}^k\|\leq\|d^k\|+\tau\|r(x^k)\| \end{gather*}

From the second inequality of the first lemma of this proof, we know that r(xk)Crdk\|r(x^k)\|\leq C_r\|d_k\|, which we can use to bound the previous optimality gap by defining Cr=m+1m(M+μk)C_r=\frac{m+1}{m}(M+\mu_k) and then CL=L2(1+τCr)2C_L=\frac{L}{2}(1+\tau C_r)^2 for simplicity.

xk+1xˉkdk+τCrdk=(1+τCr)dkψ(xk+1)ψˉL2(1+τCr)2dk2=CLdk2\begin{gather*} \|x^{k+1}-\bar{x}^k\|\leq\|d^k\|+\tau C_r\|d^k\|=(1+\tau C_r)\|d^k\|\\ \psi(x^{k+1})-\bar{\psi}\leq\frac{L}{2}(1+\tau C_r)^2\|d^k\|^2=C_L\|d^k\|^2 \end{gather*}

To make this bound meaningful, we need to replace the dkd^k term. For any successful step we know aredkc1predk\text{ared}_k\geq c_1\text{pred}_k holds, which combined with predkm2dk2\text{pred}_k\geq\frac{m}{2}\|d^k\|^2 from the previous lemma gives the following (which also holds trivially for unsuccessful steps).

ψ(xk)ψ(xk+1)c1m2dk2dk22c1m(ψ(xk)ψ(xk+1))\begin{gather*} \psi(x^k)-\psi(x^{k+1})\geq\frac{c_1m}{2}\|d^k\|^2\\ \|d^k\|^2\leq\frac{2}{c_1m}(\psi(x^k)-\psi(x^{k+1})) \end{gather*}

We can plug this into the previous suboptimality gap bound we had to get the desired convergence result by defining K=2CLc1mK=\frac{2C_L}{c_1m}. If we define θ=K1+K\theta=\frac{K}{1+K}, we are guaranteed θ(0,1)\theta\in(0,1) since K>0K>0, proving the convergence result shown.

ψ(xk+1)ψˉCL[2c1m(ψ(xk)ψ(xk+1))](1+K)(ψ(xk+1)ψˉ)K(ψ(xk)ψˉ)ψ(xk+1)ψˉK1+K(ψ(xk)ψˉ)\begin{gather*} \psi(x^{k+1})-\bar{\psi}\leq C_L\left[\frac{2}{c_1m}(\psi(x^k)-\psi(x^{k+1}))\right]\\ (1+K)(\psi(x^{k+1})-\bar{\psi})\leq K(\psi(x^k)-\bar{\psi})\\ \psi(x^{k+1})-\bar{\psi}\leq\frac{K}{1+K}(\psi(x^k)-\bar{\psi}) \end{gather*}

Finally, to prove the convergence to a single iterate xˉ\bar{x}, we use the bound on dk2\|d^k\|^2. Since the difference between consecutive objectives is bounded by the distance in the limit (the objectives of two iterates can’t be farther than the distance between initial and fully optimal), we can simplify.

xk+1xk=dk2c1m(ψ(xk)ψ(xk+1))dk2c1mψ(xk)ψˉ\begin{gather*} \|x^{k+1}-x^{k}\|=\|d^k\|\leq\sqrt{\frac{2}{c_1m}(\psi(x^k)-\psi(x^{k+1}))}\\ \|d^k\|\leq\sqrt{\frac{2}{c_1m}}\sqrt{\psi(x^k)-\bar{\psi}} \end{gather*}

Since the term inside the square root shrinks at a rate θ\theta, we know that dk\|d^k\| shrinks at a rate of θ\sqrt{\theta}, which since 0<θ<10<\theta<1, we have θ<1\sqrt{\theta}<1 as well. This means that a sum of all dkd^k creates a geometric series with a common ratio less than 11, which converges to a finite number.

k=0xk+1xk=k=0dk<\sum^\infty_{k=0}\|x^{k+1}-x^k\|=\sum^\infty_{k=0}\|d^k\|<\infty

This means that {xk}\{x^k\} is a Cauchy Sequence, which must converge to a unique limit point, completing the proof.