This is the third of the quick summaries for the lab review. Going to cover Projected Gradient Descent for some background, Frank-Wolfe for some modern techniques, and their convergence on well-behaved functions. Also going to cover some new-ish convergence proofs for Frank-Wolfe on non-convex settings.
Prerequisites:
I’m going to assume that properties like convexity, L-smoothness, and standard Gradient Descent terminology are known, however I am going to prove one property of L-smooth functions. Any L-smooth function is upper bounded by some quadratic (shown below for any x,y∈Rn).
f(y)≤f(x)+⟨∇f(x),y−x⟩+2L∥y−x∥2
For the proof, we can first define the distance f(y)−f(x) with the line integral of the segment of the function between the two points, which is derived from some basic calculus properties.
Consider a constrained optimization problem over a set Ω defined below.
x∈Ωminf(x)
One of the most basic ways to solve this problem is Projected Gradient Descent, which extends Gradient Descent by simply adding a projection step after the gradient step. The projection ΠΩ(x) returns the point within Ω that is closest to x.
Under settings where both f(x) and Ω are convex, this method takes good advantage of the benefits that gradient descent gives.
PGD Non-Increasing Iterates:
To properly evaluate the convergence properties of the method, we first need to prove the guarantee that under an L-Smooth function, a convex Ω, and a step-size μ=L1, then we know that the iterates x(t) are non-increasing.
f(x(t))≤f(x(t−1))
For the proof, we can first see that the algorithm can be redefined as a minimization subproblem for each iterate by using the definition of the projection. Convexity being required for the set not only guarantees that the solution to this subproblem is unique, but also is a general requirement for this property to be meaningful, since it’s required to guarantee that the algorithm will reach the true minimum at all. We scale the term by 2L for simplicity later since it does not change the final minimum.
x(t+1)=argz∈Ωmin{2Lz−(x(t)−L1∇f(x(t)))2}
Using the quadratic upper bounding property from the fact that f is L-smooth, we can define a surrogate function Qx(z) below for use later. This definition intuitively means that f(x(t+1))≤Qx(t)(x(t+1)) and f(x(t))=Qx(t)(x(t)).
Qx(z)=f(x)+⟨∇f(x),z−x⟩+2L∥z−x∥2
Expanding the square norm of the original minimization problem and isolating terms with z (which are the only ones that impact the minimization) gives ⟨∇f(x),z−x⟩+2L∥z−x∥2, which is equivalent to our surrogate Qx(z) in the minimization. This means we can redefine the subproblem with the surrogate.
x(t+1)=argx∈ΩminQx(t)(z)
By definition, we know that f(x(t+1))≤Qx(t)(x(t+1)) and f(x(t))=Qx(t)(x(t)). As well, since x(t+1) is the minimizer of the subproblem over all Ω and x(t) is within Ω, we know Qx(t)(x(t+1))≤Qx(t)(x(t)). Combining these together gives our desired result.
f(x(t+1))≤Qx(t)(x(t+1))≤Qx(t)(x(t))=f(x(t))
PGD Convex Convergence:
Now that we’ve proved that iterates never get worse, we can derive what convergence rate PGD has under for well-behaved problems. If f is an L-Smooth Convex Function over a Convex Set Ω and μ=21, we can prove that PGD has a convergence rate of O(1/k).
f(x(k))−f(x∗)≤2kL∥x(0)−x∗∥2
The proof is a standard one in optimization, simply using the different properties to derive some inequalities, combining them, and then using the derived properties to prove some global behavior.
From L-smoothness and convexity we know that the following hold by definition.
Since Ω is a convex set, we can use the a property of projections on convex sets that guarantees that for all y⟨p−z,y−p⟩≥0 with any z and its projection p. We can set z=x(t)−L1∇f(x(t)) and p=x(t+1) to get the inequality in terms of the algorithm, and then rearrange and set y=x∗ to add some more meaning.
Using the property of squared norms that ∥a+b∥2=∥a∥2+2⟨a,b⟩+∥b∥2 and rearranging to 2⟨a,b⟩+∥a∥2=∥a+b∥2−∥b∥2 we can replace the right-hand side of the inequality with another simpler term. Setting a=x(t)−x(t+1) and b=x(t+1)−x∗ gives us the following.
f(x(t+1))−f(x∗)≤2L(∥x(t)−x∗∥2−∥x(t+1)−x∗∥2)
We can see that adding the inequality of f(x(t))−f(x∗) and that of f(x(t+1)) would result in the terms ∥x(t+1)−x∗∥2 in both cancelling out. Summing over k iterations gives us a telescoping sum since this continues, giving us the following.
Since we know that all of the iterates are non-increasing, we can simplify the sum on the left-hand side to be only in terms of our final iterate, giving us our desired result.
One problem with Projected Gradient Descent is that for more complex settings the projection step itself can be very expensive since it requires solving a quadratic optimization problem. Frank-Wolfe, also known as Conditional Gradient, is an optimization method over the same type of constrained problem that replaces this quadratic subproblem with a linear subproblem at each iteration.
Frank-Wolfe works off of the idea of linearizing and minimizing, where a linear approximation of the function is first created and then minimized over. The most basic form of this is the Linear Minimization Oracle, which chooses an s(k)∈Ω that minimizes over the current iterate’s gradient’s direction. The iterate is then updated by moving towards the direction of the chosen s(k) with step-size γ(k).
By definition we can see that x(k+1)=x(k)+γ(k)s(k)−γ(k)x(k)=(1−γ(k))x(k)+γ(k)s(k), which means that the new iterate is derived from a convex combination of two points within Ω, thus the new iterate is also in Ω as long as Ω is a convex set. This presents a projection-free method of constrained optimization that is much quicker for each iteration when compared to projected gradient descent in complex problems.
Frank-Wolfe Convex Convergence:
Along with being faster per iteration, we can also prove that the method has the same convergence rate under well-behaved functions as projected gradient descent. On a L-Smooth Convex function over a Compact Convex Set Ω, Frank-Wolfe with the LMO also has a convergence rate of O(1/k) if the stepsize is chosen as γ(k)=k+22. The convergence rate is bounded by the Curvature Constant Cf, which conveys how much f deviates from the linear approximation made for each iteration.
The proof follows a very similar structure as before, where all of the properties are used to bound certain quantities, which are then used to derive a recurrence relation which allows us to derive our desired result.
We can first prove that Cf is finite with the L-smoothness property so that we can make statements using it. We can use the quadratic lower bounding property to show that the following holds with y=x+γ(s−x).
For this to be meaningful, we need a relationship between s(k) and the optimal point x∗. Since s(k) is defined as the minimum of ⟨s,∇f(x(k))⟩, we know that this is the upper bound of any point, including x∗.
A convergence rate for Frank-Wolfe has also been derived in non-convex settings. Since we’re now dealing with non-convex functions, we need to describe convergence in terms other than our original optimality gap. Instead, we use the Frank-Wolfe gap gt. This describes how much the objective can be decreased by moving towards an element in the feasible set, which allows gt=0 to act as an analogous term to ∥∇f(x)∥=0 in our setting.
gt=s∈Ωmax⟨s−x(t),−∇f(x(t))⟩
Since we’re in a non-convex setting, we don’t have the guarantee that the iterates always improve, so we need to denote the convergence in terms of the minimal Frank-Wolfe gap g~t=min0≤k≤t after t iterations. The minimal Frank-Wolfe gap on an L-Smooth, potentially non-convex, f over a colmpact convex set Ω after t iterations can be bounded by O(t1) as shown below.
g~t≤t+1max{2h0,C}t≥0h0=f(x(0))−x∈Mminf(x)
h0 denotes the initial suboptimality and C is some constant that satisfies C≥Cf that is determined by our stepsize. If a line-search algorithm is used to find the optimal step-size, then C=Cf. If an adaptive stepsize is chosen min{Cgt,1} (will be derived later), then C≥Cf.
The proof follows the same general form of using the properties of f to derive a recursive bound and then using it to prove global convergence properties. We can first derive a more practical form of the L-Smooth descent lemma using the definition of Cf, which we know is finite from f being L-Smooth and Ω being compact.
We can then define xγ=x(t)+γdt (the potential next step without a specific γ) and dt=s(t)−x(t) (the direction of the step) to get the bound in terms of our algorithm. We can also further simplify by using the definition of gt as an upper bound of the middle term and replacing Cf with our previously defined C.
To minimize this bound on f(xγ), we can choose a stepsize γ∗=min{Cgt,1}. This breaks the decrease of the objective at each step into a piecewise based on whether gt≤C, leading to a stepsize of Cgt, or gt>C, leading to a stepsize of 1. We can also replace the piecewise with an indicator function as shown below.
In cases where gt>C, we have min{2Cgt2,gt−2C}. We can define the difference between the two as a function f(gt)=2Cgt2−(gt−2C)=2Cgt2−2Cgt+C2=2C(gt−C)2. Since we know that (gt2−C)2≥0, we know that the difference is positive so 2Cgt2≥gt−2C, thus the correct min is chosen. In cases where gt≤C, we have min{2Cgt2,gt}. We can first divide by gt, which we know is positive, and get min{2Cg2,1}. Since gt≤C, we know 2Cgt≤21, thus the correct min is chosen.
Since this forms a recursive relationship, we can somver over t iterations since it forms a telescoping sum. We can then simplify by replacing the sum over decreaes with the minimal gap g~t since we know that each term is always positive, so replacing the subtracted terms with a smaller subtracted term keeps the ≤ bound.
This means we can bound g~t with these, specifically using the fact that f(x(0))−f(x(t+1))≤f(x(0))−f(x∗)=h0.
g~t≤{t+1h0+2Ct+12h0Cg~t>Cg~t≤C
Since we’re trying to bound g~t, we would like to derive a new set of conditions for the piecewise to get rid of the circular logic. While an equivalent condition is hard to find, we can find a set of conditions that guarantees that the bounds hold. By simple algebraic manipulation, we can know that g~t>C implies t+1≤C2h0. If g~t≤C, so any g~t has to be cateogized in the first case. If we have some g~t≤C and t+1≤C2h0, we can show that the first bound also holds.
Since we have two different convergence rates based on these conditions, we need to get them into a similar form to provide an overall convergence rate. We can first manipulate the first bound to get it into a workable form.
In order to work with this, we need to prove that h0>2C if t+1≤C2h0. Consider the case where the condition holds and h0≤2C. This would lead to t+1≤C2h0≤C2⋅2C=1, which leads to a contradiction since we know that t≥0.
Using this fact, we can know that 2C⋅h01<1, so we can substitute and simplify, deriving our desired result.
To further simplify, we can show that both bounds can have their numerators be replaced with max{2h0,C}. In the first case, since we know that h0>2C, we know that 2h0>C, which derives our desired numerator 2h0=max{2h0,C}. For the second, we know that since both h0 and C are nonnegative we have 2h0C≤max{2h0,C} (given some arbitrary a,b≥0 and that a≥b, we know that ab≤a2=a). Replacing the numerators then combines the conditions and derives our final result, proving the theorem.