Frank-Wolfe and Convergence


This is the third of the quick summaries for the lab review. Going to cover Projected Gradient Descent for some background, Frank-Wolfe for some modern techniques, and their convergence on well-behaved functions. Also going to cover some new-ish convergence proofs for Frank-Wolfe on non-convex settings.

Prerequisites:

I’m going to assume that properties like convexity, L-smoothness, and standard Gradient Descent terminology are known, however I am going to prove one property of L-smooth functions. Any L-smooth function is upper bounded by some quadratic (shown below for any x,yRnx,y\in\mathbb{R}^n).

f(y)f(x)+<f(x),yx>+L2yx2f(y)\leq f(x)+\left<\nabla f(x),y-x\right>+\frac{L}{2}\|y-x\|^2

For the proof, we can first define the distance f(y)f(x)f(y)-f(x) with the line integral of the segment of the function between the two points, which is derived from some basic calculus properties.

g(1)=g(0)+01g(t)dtϕ(t)=x+t(yx)g(t)=f(ϕ(t))g=f(ϕ(t))ϕ(t)ϕ(t)=yxg(t)=<f(x+t(yx)),yx>f(y)=f(x)+01<f(x+t(yx)),yx>dt\begin{gather*} g(1)=g(0)+\int^1_0g^\prime(t)dt\quad \phi(t)=x+t(y-x)\quad g(t)=f(\phi(t))\\ g^\prime=\nabla f(\phi(t))\cdot\phi^\prime(t)\quad \phi^\prime(t)=y-x\\ g^\prime(t)=\left<\nabla f(x+t(y-x)),y-x\right>\\ f(y)=f(x)+\int^1_0\left<\nabla f(x+t(y-x)),y-x\right>dt \end{gather*}

We can then add and subtract <f(x),yx>\left<\nabla f(x),y-x\right> inside the integral to get it into a form that works well with the L-smooth property.

f(y)=f(x)+<f(x),yx>+01<f(x+t(yx))f(x),yx>dtf(y)=f(x)+\left<\nabla f(x),y-x\right>+\int^1_0\left<\nabla f(x+t(y-x))-\nabla f(x),y-x\right>dt

After this we can now use the Cauchy-Schwarz Inequality and the L-smoothness property to upper bound the integral.

01<f(x+t(yx))f(x),yx>dt ⁣01f(x+t(yx))f(x)yxdtf(x+t(yx))f(x)Lx+t(yx)x=Ltyx01(Ltyx)yxdt=Lyx201tdt=L2yx2\begin{gather*} \int^1_0\left<\nabla f(x+t(y-x))-\nabla f(x),y-x\right>dt\leq\dots\\\dots\int^1_0\|\nabla f(x+t(y-x))-\nabla f(x)\|\cdot\|y-x\|dt\\ \|\nabla f(x+t(y-x))-\nabla f(x)\|\leq L\|x+t(y-x)-x\|=Lt\|y-x\|\\ \int^1_0(Lt\|y-x\|)\cdot\|y-x\|dt=L\|y-x\|^2\int^1_0tdt=\frac{L}{2}\|y-x\|^2 \end{gather*}

Using this final bound on the first definition gives the desired property.

01<f(x+t(yx))f(x),yx>dtL2yx2f(y)f(x)+<f(x),yx>+L2yx2\begin{gather*} \int^1_0\left<\nabla f(x+t(y-x))-\nabla f(x),y-x\right>dt\leq\frac{L}{2}\|y-x\|^2\\ f(y)\leq f(x)+\left<\nabla f(x),y-x\right>+\frac{L}{2}\|y-x\|^2 \end{gather*}

Projected Gradient Descent:

Consider a constrained optimization problem over a set Ω\Omega defined below.

minxΩf(x)\min_{x\in\Omega}f(x)

One of the most basic ways to solve this problem is Projected Gradient Descent, which extends Gradient Descent by simply adding a projection step after the gradient step. The projection ΠΩ(x)\Pi_\Omega(x) returns the point within Ω\Omega that is closest to xx.

ΠΩ(x)=arg minyΩxyx(t+1)=ΠΩ(x(t)μf(x(t)))\begin{gather*} \Pi_\Omega(x)=\text{arg }\min_{y\in\Omega}\|x-y\|\\ x^{(t+1)}=\Pi_\Omega(x^{(t)}-\mu\nabla f(x^{(t)})) \end{gather*}

Under settings where both f(x)f(x) and Ω\Omega are convex, this method takes good advantage of the benefits that gradient descent gives.

PGD Non-Increasing Iterates:

To properly evaluate the convergence properties of the method, we first need to prove the guarantee that under an L-Smooth function, a convex Ω\Omega, and a step-size μ=1L\mu=\frac{1}{L}, then we know that the iterates x(t)x^{(t)} are non-increasing.

f(x(t))f(x(t1))f(x^{(t)})\leq f(x^{(t-1)})

For the proof, we can first see that the algorithm can be redefined as a minimization subproblem for each iterate by using the definition of the projection. Convexity being required for the set not only guarantees that the solution to this subproblem is unique, but also is a general requirement for this property to be meaningful, since it’s required to guarantee that the algorithm will reach the true minimum at all. We scale the term by L2\frac{L}{2} for simplicity later since it does not change the final minimum.

x(t+1)=argminzΩ{L2z(x(t)1Lf(x(t)))2}x^{(t+1)}=\text{arg}\min_{z\in\Omega}\left\{\frac{L}{2}\left\|z-\left(x^{(t)}-\frac{1}{L}\nabla f(x^{(t)})\right)\right\|^2\right\}

Using the quadratic upper bounding property from the fact that ff is L-smooth, we can define a surrogate function Qx(z)Q_x(z) below for use later. This definition intuitively means that f(x(t+1))Qx(t)(x(t+1))f(x^{(t+1)})\leq Q_{x^{(t)}}(x^{(t+1)}) and f(x(t))=Qx(t)(x(t))f(x^{(t)})=Q_{x^{(t)}}(x^{(t)}).

Qx(z)=f(x)+<f(x),zx>+L2zx2Q_x(z)=f(x)+\left<\nabla f(x),z-x\right>+\frac{L}{2}\|z-x\|^2

Expanding the square norm of the original minimization problem and isolating terms with zz (which are the only ones that impact the minimization) gives <f(x),zx>+L2zx2\left<\nabla f(x),z-x\right>+\frac{L}{2}\|z-x\|^2, which is equivalent to our surrogate Qx(z)Q_x(z) in the minimization. This means we can redefine the subproblem with the surrogate.

x(t+1)=argminxΩQx(t)(z)x^{(t+1)}=\text{arg}\min_{x\in\Omega}Q_{x^{(t)}}(z)

By definition, we know that f(x(t+1))Qx(t)(x(t+1))f(x^{(t+1)})\leq Q_{x^{(t)}}(x^{(t+1)}) and f(x(t))=Qx(t)(x(t))f(x^{(t)})=Q_{x^{(t)}}(x^{(t)}). As well, since x(t+1)x^{(t+1)} is the minimizer of the subproblem over all Ω\Omega and x(t)x^{(t)} is within Ω\Omega, we know Qx(t)(x(t+1))Qx(t)(x(t))Q_{x^{(t)}}(x^{(t+1)})\leq Q_{x^{(t)}}(x^{(t)}). Combining these together gives our desired result.

f(x(t+1))Qx(t)(x(t+1))Qx(t)(x(t))=f(x(t))f(x^{(t+1)})\leq Q_{x^{(t)}}(x^{(t+1)})\leq Q_{x^{(t)}}(x^{(t)})=f(x^{(t)})

PGD Convex Convergence:

Now that we’ve proved that iterates never get worse, we can derive what convergence rate PGD has under for well-behaved problems. If ff is an L-Smooth Convex Function over a Convex Set Ω\Omega and μ=12\mu=\frac{1}{2}, we can prove that PGD has a convergence rate of O(1/k)\mathcal{O}(1/k).

f(x(k))f(x)Lx(0)x22kf(x^{(k)})-f(x^*)\leq\frac{L\|x^{(0)}-x^*\|^2}{2k}

The proof is a standard one in optimization, simply using the different properties to derive some inequalities, combining them, and then using the derived properties to prove some global behavior.

From L-smoothness and convexity we know that the following hold by definition.

f(x(t+1))f(x(t))+<f(x(t)),x(t+1)x(t)>+L2x(t+1)x(t)2f(x)f(x(t))+<f(x(t)),xx(t)>\begin{gather*} f(x^{(t+1)})\leq f(x^{(t)})+\left<\nabla f(x^{(t)}),x^{(t+1)}-x^{(t)}\right>+\|\frac{L}{2}x^{(t+1)}-x^{(t)}\|^2\\ f(x^*)\geq f(x^{(t)})+\left<\nabla f(x^{(t)}),x^*-x^{(t)}\right> \end{gather*}

We can rearrange the convexity inequality to replkace f(x(t))f(x^{(t)}) in the L-smoothness inequality and the remove like terms to get the following.

f(x(t+1))f(x)+<f(x(t)),x(t+1)x>+L2x(t+1)x(t)2f(x^{(t+1)})\leq f(x^*)+\left<\nabla f(x^{(t)}),x^{(t+1)}-x^*\right>+\frac{L}{2}\|x^{(t+1)}-x^{(t)}\|^2

Since Ω\Omega is a convex set, we can use the a property of projections on convex sets that guarantees that for all yy <pz,yp>0\left<p-z,y-p\right>\geq 0 with any zz and its projection pp. We can set z=x(t)1Lf(x(t))z=x^{(t)}-\frac{1}{L}\nabla f(x^{(t)}) and p=x(t+1)p=x^{(t+1)} to get the inequality in terms of the algorithm, and then rearrange and set y=xy=x^* to add some more meaning.

<x(t+1)x(t)+1Lf(x(t)),yx(t+1)><f(x(t),x(t+1)x)>L<x(t)x(t+1),x(t+1)x>\begin{gather*} \left<x^{(t+1)}-x^{(t)}+\frac{1}{L}\nabla f(x^{(t)}),y-x^{(t+1)}\right>\\ \left<\nabla f(x^{(t)},x^{(t+1)}-x^*)\right>\leq L\left<x^{(t)}-x^{(t+1)},x^{(t+1)}-x^*\right> \end{gather*}

Using this we can replace the middle term of the convexity and L-smoothness inequality with this new projection inequality.

f(x(t+1))f(x)L2(2<x(t)x(t+1),x(t+1)x>+x(t+1)x(t)2)f(x^{(t+1)})-f(x^*)\leq \frac{L}{2}\left(2\left<x^{(t)}-x^{(t+1)},x^{(t+1)}-x^*\right>+\|x^{(t+1)}-x^{(t)}\|^2\right)

Using the property of squared norms that a+b2=a2+2<a,b>+b2\|a+b\|^2=\|a\|^2+2\left<a,b\right>+\|b\|^2 and rearranging to 2<a,b>+a2=a+b2b22\left<a,b\right>+\|a\|^2=\|a+b\|^2-\|b\|^2 we can replace the right-hand side of the inequality with another simpler term. Setting a=x(t)x(t+1)a=x^{(t)}-x^{(t+1)} and b=x(t+1)xb=x^{(t+1)}-x^* gives us the following.

f(x(t+1))f(x)L2(x(t)x2x(t+1)x2)f(x^{(t+1)})-f(x^*)\leq\frac{L}{2}\left(\|x^{(t)}-x^*\|^2-\|x^{(t+1)}-x^*\|^2\right)

We can see that adding the inequality of f(x(t))f(x)f(x^{(t)})-f(x^*) and that of f(x(t+1))f(x^{(t+1)}) would result in the terms x(t+1)x2\|x^{(t+1)}-x^*\|^2 in both cancelling out. Summing over kk iterations gives us a telescoping sum since this continues, giving us the following.

t=0k1f(x(t+1))f(x)L2(x(0)x2x(k)x2)\sum^{k-1}_{t=0}f(x^{(t+1)})-f(x^*)\leq\frac{L}{2}\left(\|x^{(0)}-x^*\|^2-\|x^{(k)}-x^*\|^2\right)

Since we know that all of the iterates are non-increasing, we can simplify the sum on the left-hand side to be only in terms of our final iterate, giving us our desired result.

t=0k+1(f(x(t+1))f(x))t=0k1(f(x(k))f(x))=k(f(x(k))f(x))k(f(x(k))f(x))L2x(0)x2\begin{gather*} \sum^{k+1}_{t=0}(f(x^{(t+1)})-f(x^*))\geq\sum^{k-1}_{t=0}(f(x^{(k)})-f(x^*))=k(f(x^{(k)})-f(x^*))\\ k(f(x^{(k)})-f(x^*))\leq\frac{L}{2}\|x^{(0)}-x^*\|^2 \end{gather*}

Frank-Wolfe:

One problem with Projected Gradient Descent is that for more complex settings the projection step itself can be very expensive since it requires solving a quadratic optimization problem. Frank-Wolfe, also known as Conditional Gradient, is an optimization method over the same type of constrained problem that replaces this quadratic subproblem with a linear subproblem at each iteration.

Frank-Wolfe works off of the idea of linearizing and minimizing, where a linear approximation of the function is first created and then minimized over. The most basic form of this is the Linear Minimization Oracle, which chooses an s(k)Ωs^{(k)}\in\Omega that minimizes over the current iterate’s gradient’s direction. The iterate is then updated by moving towards the direction of the chosen s(k)s^{(k)} with step-size γ(k)\gamma^{(k)}.

s(k)=argminsΩsTf(x(k))x(k+1)=x(k)+γ(k)(s(k)x(k))\begin{gather*} s^{(k)}=\text{arg}\min_{s\in\Omega}s^T\nabla f(x^{(k)})\\ x^{(k+1)}=x^{(k)}+\gamma^{(k)}(s^{(k)}-x^{(k)}) \end{gather*}

By definition we can see that x(k+1)=x(k)+γ(k)s(k)γ(k)x(k)=(1γ(k))x(k)+γ(k)s(k)x^{(k+1)}=x^{(k)}+\gamma^{(k)} s^{(k)}-\gamma^{(k)}x^{(k)}=(1-\gamma^{(k)})x^{(k)}+\gamma^{(k)}s^{(k)}, which means that the new iterate is derived from a convex combination of two points within Ω\Omega, thus the new iterate is also in Ω\Omega as long as Ω\Omega is a convex set. This presents a projection-free method of constrained optimization that is much quicker for each iteration when compared to projected gradient descent in complex problems.

Frank-Wolfe Convex Convergence:

Along with being faster per iteration, we can also prove that the method has the same convergence rate under well-behaved functions as projected gradient descent. On a L-Smooth Convex function over a Compact Convex Set Ω\Omega, Frank-Wolfe with the LMO also has a convergence rate of O(1/k)O(1/k) if the stepsize is chosen as γ(k)=2k+2\gamma^{(k)}=\frac{2}{k+2}. The convergence rate is bounded by the Curvature Constant CfC_f, which conveys how much ff deviates from the linear approximation made for each iteration.

f(x(k))f(x)2Cfk+2Cf=supx,sΩ,γ[0,1],y=x+γ(sx)2γ2(f(y)f(x)<yx,f(x)>)\begin{gather*} f(x^{(k)})-f(x^*)\leq\frac{2C_f}{k+2}\\ C_f=\sup_{x,s\in\Omega,\gamma\in[0,1],y=x+\gamma(s-x)}\frac{2}{\gamma^2}(f(y)-f(x)-\left<y-x,\nabla f(x)\right>) \end{gather*}

The proof follows a very similar structure as before, where all of the properties are used to bound certain quantities, which are then used to derive a recurrence relation which allows us to derive our desired result.

We can first prove that CfC_f is finite with the L-smoothness property so that we can make statements using it. We can use the quadratic lower bounding property to show that the following holds with y=x+γ(sx)y=x+\gamma(s-x).

f(x+γ(sx))f(x)<f(x),γ(sx)>L2γ(sx)2f(y)f(x)<f(x),yx>Lγ22sx2\begin{gather*} f(x+\gamma(s-x))-f(x)-\left<\nabla f(x),\gamma(s-x)\right>\leq\frac{L}{2}\|\gamma(s-x)\|^2\\ f(y)-f(x)-\left<\nabla f(x),y-x\right>\leq\frac{L\gamma^2}{2}\|s-x\|^2 \end{gather*}

We can then substitute this inequality into our definition of CfC_f to upper bound it.

Cfsupx,sΩ2γ2(Lγ22sx2)Cfsupx,sΩLsx2\begin{gather*} C_f\leq\sup_{x,s\in\Omega}\frac{2}{\gamma^2}\left(\frac{L\gamma^2}{2}\|s-x\|^2\right)\\ C_f\leq\sup_{x,s\in\Omega}L\|s-x\|^2 \end{gather*}

Since Ω\Omega is compact, we know that sx2\|s-x\|^2 is finite, so we know that CfC_f is finite.

We can then rearrange the definition of CfC_f as an upper bound, using our previously defined y=x+γ(sx)y=x+\gamma(s-x), and rewrite it in terms of our iterates.

Cf2γ2(f(y)f(x)<yx,f(x)>)f(x+γ(sx))f(x)+γ<sx,f(x)>+γ22Cff(x(k+1))f(x(k))+γ(k)<s(k)x(k),f(x(k))>+γ(k)22Cf\begin{gather*} C_f\geq \frac{2}{\gamma^2}(f(y)-f(x)-\left<y-x,\nabla f(x)\right>)\\ f(x+\gamma(s-x))\leq f(x)+\gamma\left<s-x,\nabla f(x)\right>+\frac{\gamma^2}{2}C_f\\ f(x^{(k+1)})\leq f(x^{(k)})+\gamma^{(k)}\left<s^{(k)}-x^{(k)},\nabla f(x^{(k)})\right>+\frac{\gamma^{(k)2}}{2}C_f \end{gather*}

For this to be meaningful, we need a relationship between s(k)s^{(k)} and the optimal point xx^*. Since s(k)s^{(k)} is defined as the minimum of <s,f(x(k))>\left<s,\nabla f(x^{(k)})\right>, we know that this is the upper bound of any point, including xx^*.

<s(k),f(x(k))><x,f(x(k))><s(k)x(k),f(x(k))><xx(k),f(x(k))>\begin{gather*} \left<s^{(k)},\nabla f(x^{(k)})\right>\leq\left<x^*,\nabla f(x^{(k)})\right>\\ \left<s^{(k)}-x^{(k)},\nabla f(x^{(k)})\right>\leq\left<x^*-x^{(k)},\nabla f(x^{(k)})\right> \end{gather*}

Since ff is convex we can apply the definition to the above inequality, with h(k)=f(x(k)f(x))h^{(k)}=f(x^{(k)}-f(x^*)) being used to defin the optimality gap for simplicity.

<xx(k),f(x(k))>f(x)f(x(k))=h(k)<s(k)x(k),f(x(k))>h(k)\begin{gather*} \left<x^*-x^{(k)},\nabla f(x^{(k)})\right>\leq f(x^*)-f(x^{(k)})=-h^{(k)}\\ \left<s^{(k)}-x^{(k)},\nabla f(x^{(k)})\right>\leq -h^{(k)} \end{gather*}

We can now use this to upper bound our previous inequality, deriving the following.

f(x(k+1))f(x(k))γ(k)h(k)+γ(k)22Cfh(k+1)h(k)(1γ(k))+γ(k)22Cf\begin{gather*} f(x^{(k+1)})\leq f(x^{(k)})-\gamma^{(k)}h^{(k)}+\frac{\gamma^{(k)2}}{2}C_f\\ h^{(k+1)}\leq h^{(k)}(1-\gamma^{(k)})+\frac{\gamma^{(k)2}}{2}C_f \end{gather*}

Now that we have a recurrence relation, we can use induction on it to prove h(k)2Cfk+2h^{(k)}\leq\frac{2C_f}{k+2}, deriving our desired result.

h(k+1)2Cfk+2(12k+2)+2Cf(k+2)2h(k+1)2Cfk+2(kk+2)+2Cf(k+2)2=2Cf(k(k+2)2+1(k+2)2)h(k+1)2Cfk+1(k+2)2=2Cf(k+1)(k+3)(k+2)2(k+3)h(k+1)2Cfk+3\begin{gather*} h^{(k+1)}\leq\frac{2C_f}{k+2}\left(1-\frac{2}{k+2}\right)+\frac{2C_f}{(k+2)^2}\\ h^{(k+1)}\leq\frac{2C_f}{k+2}\left(\frac{k}{k+2}\right)+\frac{2C_f}{(k+2)^2}=2C_f\left(\frac{k}{(k+2)^2}+\frac{1}{(k+2)^2}\right)\\ h^{(k+1)}\leq 2C_f\frac{k+1}{(k+2)^2}=2C_f\frac{(k+1)(k+3)}{(k+2)^2(k+3)}\\ h^{(k+1)}\leq\frac{2C_f}{k+3} \end{gather*}

Frank-Wolfe Non-Convex Convergence:

A convergence rate for Frank-Wolfe has also been derived in non-convex settings. Since we’re now dealing with non-convex functions, we need to describe convergence in terms other than our original optimality gap. Instead, we use the Frank-Wolfe gap gtg_t. This describes how much the objective can be decreased by moving towards an element in the feasible set, which allows gt=0g_t=0 to act as an analogous term to f(x)=0\|\nabla f(x)\|=0 in our setting.

gt=maxsΩ<sx(t),f(x(t))>g_t=\max_{s\in\Omega}\left<s-x^{(t)},-\nabla f(x^{(t)})\right>

Since we’re in a non-convex setting, we don’t have the guarantee that the iterates always improve, so we need to denote the convergence in terms of the minimal Frank-Wolfe gap g~t=min0kt\tilde{g}_t=\min_{0\leq k\leq t} after tt iterations. The minimal Frank-Wolfe gap on an L-Smooth, potentially non-convex, ff over a colmpact convex set Ω\Omega after tt iterations can be bounded by O(1t)O(\frac{1}{\sqrt{t}}) as shown below.

g~tmax{2h0,C}t+1t0h0=f(x(0))minxMf(x)\begin{gather*} \tilde{g}_t\leq\frac{\max\{2h_0,C\}}{\sqrt{t+1}}\quad t\geq 0\\ h_0=f(x^{(0)})-\min_{x\in\mathcal{M}}f(x) \end{gather*}

h0h_0 denotes the initial suboptimality and CC is some constant that satisfies CCfC\geq C_f that is determined by our stepsize. If a line-search algorithm is used to find the optimal step-size, then C=CfC=C_f. If an adaptive stepsize is chosen min{gtC,1}\min\{\frac{g_t}{C},1\} (will be derived later), then CCfC\geq C_f.

The proof follows the same general form of using the properties of ff to derive a recursive bound and then using it to prove global convergence properties. We can first derive a more practical form of the L-Smooth descent lemma using the definition of CfC_f, which we know is finite from ff being L-Smooth and Ω\Omega being compact.

2γ2[f(y)f(x)<f(x),yx>]Cff(y)f(x)+<f(x),yx>+γ22Cf\begin{gather*} \frac{2}{\gamma^2}[f(y)-f(x)-\left<\nabla f(x),y-x\right>]\leq C_f\\ f(y)\leq f(x)+\left<\nabla f(x),y-x\right>+\frac{\gamma^2}{2}C_f \end{gather*}

We can then define xγ=x(t)+γdtx_\gamma=x^{(t)}+\gamma d_t (the potential next step without a specific γ\gamma) and dt=s(t)x(t)d_t=s^{(t)}-x^{(t)} (the direction of the step) to get the bound in terms of our algorithm. We can also further simplify by using the definition of gtg_t as an upper bound of the middle term and replacing CfC_f with our previously defined CC.

f(xγ)f(x(t))+γ<f(x(t)),dt>+γ22Cfγ[0,1]f(xγ)f(x(t))γgt+γ22C\begin{gather*} f(x_\gamma)\leq f(x^{(t)})+\gamma\left<\nabla f(x^{(t)}),d_t\right>+\frac{\gamma^2}{2}C_f\quad\forall\gamma\in[0,1]\\ f(x_\gamma)\leq f(x^{(t)})-\gamma g_t+\frac{\gamma^2}{2}C \end{gather*}

To minimize this bound on f(xγ)f(x_\gamma), we can choose a stepsize γ=min{gtC,1}\gamma^*=\min\{\frac{g_t}{C},1\}. This breaks the decrease of the objective at each step into a piecewise based on whether gtCg_t\leq C, leading to a stepsize of gtC\frac{g_t}{C}, or gt>Cg_t>C, leading to a stepsize of 11. We can also replace the piecewise with an indicator function as shown below.

f(x(t+1))f(x(t)){gt22cif gtCgtC2if gt>Cf(x(t+1))f(x(t))min{gt22C,gtC21{gt>C}}\begin{gather*} f(x^{(t+1)})\leq f(x^{(t)})-\begin{cases}\frac{g^2_t}{2c}&\text{if }g_t\leq C\\g_t-\frac{C}{2}&\text{if }g_t>C\end{cases}\\ f(x^{(t+1)})\leq f(x^{(t)})-\min\left\{\frac{g_t^2}{2C},g_t-\frac{C}{2}\mathbb{1}_{\{g_t>C\}}\right\} \end{gather*}

In cases where gt>Cg_t>C, we have min{gt22C,gtC2}\min\left\{\frac{g_t^2}{2C},g_t-\frac{C}{2}\right\}. We can define the difference between the two as a function f(gt)=gt22C(gtC2)=gt22Cgt+C22C=(gtC)22Cf(g_t)=\frac{g^2_t}{2C}-(g_t-\frac{C}{2})=\frac{g^2_t-2Cg_t+C^2}{2C}=\frac{(g_t-C)^2}{2C}. Since we know that (gt2C)20(g_t^2-C)^2\geq 0, we know that the difference is positive so gt22CgtC2\frac{g_t^2}{2C}\geq g_t-\frac{C}{2}, thus the correct min is chosen. In cases where gtCg_t\leq C, we have min{gt22C,gt}\min\left\{\frac{g_t^2}{2C},g_t\right\}. We can first divide by gtg_t, which we know is positive, and get min{g22C,1}\min\{\frac{g_2}{2C},1\}. Since gtCg_t\leq C, we know gt2C12\frac{g_t}{2C}\leq \frac{1}{2}, thus the correct min is chosen.

Since this forms a recursive relationship, we can somver over tt iterations since it forms a telescoping sum. We can then simplify by replacing the sum over decreaes with the minimal gap g~t\tilde{g}_t since we know that each term is always positive, so replacing the subtracted terms with a smaller subtracted term keeps the \leq bound.

f(x(t+1))f(x(0))k=0tmin{gk22C,gkC21{gk>C}}f(x(t+1))f(x(0))(t+1)min{g~t22C,g~tC21{g~t>C}}\begin{gather*} f(x^{(t+1)})\leq f(x^{(0)})-\sum^t_{k=0}\min\left\{\frac{g_k^2}{2C},g_k-\frac{C}{2}\mathbb{1}_{\{g_k>C\}}\right\}\\ f(x^{(t+1)})\leq f(x^{(0)})-(t+1)\min\left\{\frac{\tilde{g}_t^2}{2C},\tilde{g}_t-\frac{C}{2}\mathbb{1}_{\{\tilde{g}_t>C\}}\right\} \end{gather*}

This means we can bound g~t\tilde{g}_t with these, specifically using the fact that f(x(0))f(x(t+1))f(x(0))f(x)=h0f(x^{(0)})-f(x^{(t+1)})\leq f(x^{(0)})-f(x^*)=h_0.

g~t{h0t+1+C2g~t>C2h0Ct+1g~tC\tilde{g}_t\leq\begin{cases}\frac{h_0}{t+1}+\frac{C}{2}&\tilde{g}_t>C\\\sqrt{\frac{2h_0C}{t+1}}&\tilde{g}_t\leq C\end{cases}

Since we’re trying to bound g~t\tilde{g}_t, we would like to derive a new set of conditions for the piecewise to get rid of the circular logic. While an equivalent condition is hard to find, we can find a set of conditions that guarantees that the bounds hold. By simple algebraic manipulation, we can know that g~t>C\tilde{g}_t>C implies t+12h0Ct+1\leq \frac{2h_0}{C}. If g~tC\tilde{g}_t\leq C, so any g~t\tilde{g}_t has to be cateogized in the first case. If we have some g~tC\tilde{g}_t\leq C and t+12h0Ct+1\leq\frac{2h_0}{C}, we can show that the first bound also holds.

g~tCC2+C2=h02h0C+C2h0t+1+C2\tilde{g}_t\leq C\leq \frac{C}{2}+\frac{C}{2}= \frac{h_0}{\frac{2h_0}{C}}+\frac{C}{2}\leq\frac{h_0}{t+1}+\frac{C}{2}

Thus we can rewrite the conditions based on this.

g~t{h0t+1+C2if t+12h0C2h0Ct+1otherwise\tilde{g}_t\leq\begin{cases}\frac{h_0}{t+1}+\frac{C}{2}&\text{if }t+1\leq\frac{2h_0}{C}\\\sqrt{\frac{2h_0C}{t+1}}&\text{otherwise}\end{cases}

Since we have two different convergence rates based on these conditions, we need to get them into a similar form to provide an overall convergence rate. We can first manipulate the first bound to get it into a workable form.

h0t+1+C2=h0t+1(1t+1+C2h0t+1)h0t+1+C2h0t+1(1t+1+C2h0)\begin{gather*} \frac{h_0}{t+1}+\frac{C}{2}=\frac{h_0}{\sqrt{t+1}}\left(\frac{1}{\sqrt{t+1}}+\frac{C}{2h_0}\sqrt{t+1}\right)\\ \frac{h_0}{t+1}+\frac{C}{2}\leq\frac{h_0}{\sqrt{t+1}}\left(\frac{1}{\sqrt{t+1}}+\sqrt{\frac{C}{2h_0}}\right) \end{gather*}

In order to work with this, we need to prove that h0>C2h_0>\frac{C}{2} if t+12h0Ct+1\leq\frac{2h_0}{C}. Consider the case where the condition holds and h0C2h_0\leq\frac{C}{2}. This would lead to t+12h0C2CC2=1t+1\leq \frac{2h_0}{C}\leq \frac{2}{C}\cdot\frac{C}{2}=1, which leads to a contradiction since we know that t0t\geq 0.

Using this fact, we can know that C21h0<1\frac{C}{2}\cdot\frac{1}{h_0}<1, so we can substitute and simplify, deriving our desired result.

h0t+1+C2h0t+1(1t+1+1)2h0t+1\frac{h_0}{t+1}+\frac{C}{2}\leq\frac{h_0}{\sqrt{t+1}}\left(\frac{1}{\sqrt{t+1}}+1\right)\leq\frac{2h_0}{\sqrt{t+1}}

This leaves us with the following two bounds.

g~t{2h0t+1if t+12h0C2h0Ct+1otherwise\tilde{g}_t\leq\begin{cases}\frac{2h_0}{\sqrt{t+1}}&\text{if }t+1\leq\frac{2h_0}{C}\\\frac{\sqrt{2h_0C}}{\sqrt{t+1}}&\text{otherwise}\end{cases}

To further simplify, we can show that both bounds can have their numerators be replaced with max{2h0,C}\max\{2h_0,C\}. In the first case, since we know that h0>C2h_0>\frac{C}{2}, we know that 2h0>C2h_0>C, which derives our desired numerator 2h0=max{2h0,C}2h_0=\max\{2h_0,C\}. For the second, we know that since both h0h_0 and CC are nonnegative we have 2h0Cmax{2h0,C}\sqrt{2h_0C}\leq\max\{2h_0,C\} (given some arbitrary a,b0a,b\geq 0 and that aba\geq b, we know that aba2=a\sqrt{ab}\leq\sqrt{a^2}=a). Replacing the numerators then combines the conditions and derives our final result, proving the theorem.

g~tmax{2h0,C}t+1\tilde{g}_t\leq\frac{\max \{2h_0,C\} }{\sqrt{t+1}}