For convex problems, does gradient in Stochastic Gradient Descent (SGD) always point at the global extreme value?
For convex problems, does gradient in Stochastic Gradient Descent (SGD) always point at the global extreme value?
Given a convex cost function, using SGD for optimization, we will have a gradient (vector) at a certain point during the optimization process.
My question is, given the point on the convex, does the gradient only point at the direction at which the function increase/decrease fastest, or the gradient always points at the optimal/extreme point of the cost function?
The former is a local concept, the latter is a global concept.
SGD can eventually converge to the extreme value of the cost function. I'm wondering about the difference between the direction of the gradient given an arbitrary point on the convex and the direction pointing at the global extreme value.
The gradient's direction should be the direction at which the function increase/decrease fastest on that point, right?
$begingroup$
No, because it's stochastic gradient descent, not gradient descent. The whole point of SGD is that you throw away some of the gradient information in return for increased computational efficiency -- but obviously in throwing away some of the gradient information you're no longer going to have the direction of the original gradient. This is already ignoring the issue of whether or not the regular gradient points in the direction of optimal descent, but the point being, even if regular gradient descent did, there's no reason to expect stochastic gradient descent to do so.
$endgroup$
– Chill2Macht
Sep 18 '18 at 16:50
$begingroup$
@Tyler, why is your question specifically about stochastic gradient descent. Do you imagine somehow something different in comparison to standard gradient descent?
$endgroup$
– Martijn Weterings
Sep 18 '18 at 20:22
$begingroup$
The gradient will always point toward the optimum in the sense that the angle between the gradient and the vector to the optimum will have angle less than $fracpi2$, and walking in the direction of the gradient an infinitesimal amount will get you closer to the optimum.
$endgroup$
– Solomonoff's Secret
Sep 18 '18 at 20:23
$begingroup$
If the gradient pointed directly at a global minimizer, convex optimization would become super easy, because we could then just do a one-dimensional line search to find a global minimizer. This is too much to hope for.
$endgroup$
– littleO
Sep 18 '18 at 20:25
6 Answers
6
They say an image is worth more than a thousand words. In the following example (courtesy of MS Paint, a handy tool for amateur and professional statisticians both) you can see a convex function surface and a point where the direction of the steepest descent clearly differs from the direction towards the optimum.
On a serious note: There are far superior answers in this thread that also deserve an upvote.
$begingroup$
And today's counter-example is ... an avocado!
$endgroup$
– JDL
Sep 18 '18 at 13:28
$begingroup$
You see that while cutting an avocado, you should cut in the steepest descent direction to avoid the seed and a possible injury.
$endgroup$
– Jan Kukacka
Sep 18 '18 at 14:16
An intuitive view is to imagine a path of descent that is a curved path. See for instance the examples below.
As an analogy: Imagine I blindfold you and put you somewhere on a mountain with the task to walk back to the extreme (low) point. On the hill, if you only have local information, then you are not knowing in which direction the bottom of the lake will be.
The angle may exceed $pi/2$. In the image below this is emphasized by drawing an arrow of direction of descent for a particular point where the final solution is behind the line perpendicular to the direction of descent.
In the convex problem this is not possible. You could relate this to
the isolines for the cost function having a curvature all in the same
direction when the problem is convex.
Below is another view for four data points. Each of the four images shows the surface for a different single point. Each step a different point is chosen along which the gradient is computed. This makes that there are only four directions along which a step is made, but the stepsizes decrease when we get closer to the solution.
The above images are for 4 datapoints generated by the function:
$$y_i = e^-0.4x_i-e^-0.8 x_i + epsilon_i$$
x = 0 2 4 6
y = 0.006 0.249 0.153 0.098
which results in:
a non-convex optimization problem when we minimize the (non-linear) cost function $$ S(a,b) = sum_i=1 left( y_i - (e^-ax_i-e^-b x_i) right)^2$$ $$nabla S(a,b) = beginbmatrix sum_i=1 2 x_i e^-a x_ileft( y_i - e^-ax_i-e^-b x_i right) \
sum_i=1 -2 x_i e^-b x_ileft( y_i - e^-ax_i-e^-b x_i right) endbmatrix$$
a convex optimization problem (like any linear least squares) when we minimize $$ S(a,b) = sum_i=1 left( y_i - (a e^-0.4 x_i- b e^-0.8 x_i )right)^2$$ $$nabla S(a,b) = beginbmatrix sum_i=1 -2 e^-0.4x_ileft( y_i - a e^-0.4x_i- b e^-0.8 x_i right) \
sum_i=1 2 e^-0.8x_ileft( y_i - a e^-0.4x_i- b e^-0.8 x_i right) endbmatrix$$
a convex optimization problem (but not with a single minimum) when we minimize for some specific $i$ $$ S(a,b) = left( y_i - (a e^-0.4 b x_i- b e^-0.8 x_i) right)^2$$ which has gradient $$nabla S(a,b) = beginbmatrix -2 e^-0.4x_ileft( y_i - a e^-0.4x_i- b e^-0.8 x_i right) \
2 e^-0.8x_ileft( y_i - a e^-0.4x_i- b e^-0.8 x_i right) endbmatrix$$ this has multiple minima (there are multiple $a$ and $b$ for which $S = 0$ )
Steepest descent can be inefficient even if the objective function is strongly convex.
Ordinary gradient descent
I mean "inefficient" in the sense that steepest descent can take steps that oscillate wildly away from the optimum, even if the function is strongly convex or even quadratic.
Consider $f(x)=x_1^2 + 25x_2^2$. This is convex because it is a quadratic with positive coefficients. By inspection, we can see that it has a global minimum at $x=[0,0]^top$. It has gradient
$$
nabla f(x)=
beginbmatrix
2x_1 \
50x_2
endbmatrix
$$
With a learning rate of $alpha=0.035$, and initial guess $x^(0)=[0.5, 0.5]^top,$ we have the gradient update
$$
x^(1) =x^(0)-alpha nabla fleft(x^(0)right)
$$
which exhibits this wildly oscillating progress towards the minimum.
Indeed, the angle $theta$ formed between $(x^(i), x^*)$ and $(x^(i), x^(i+1))$ only gradually decays to 0. What this means is that the direction of the update is sometimes wrong -- at most, it is wrong by almost 68 degrees -- even though the algorithm is converging and working correctly.
Each step is wildly oscillating because the function is much steeper in the $x_2$ direction than the $x_1$ direction. Because of this fact, we can infer that the gradient is not always, or even usually, pointing toward the minimum. This is a general property of gradient descent when the eigenvalues of the Hessian $nabla^2 f(x)$ are on dissimilar scales. Progress is slow in directions corresponding to the eigenvectors with the smallest corresponding eigenvalues, and fastest in the directions with the largest eigenvalues. It is this property, in combination with the choice of learning rate, that determines how quickly gradient descent progresses.
The direct path to the minimum would be to move "diagonally" instead of in this fashion which is strongly dominated by vertical oscillations. However, gradient descent only has information about local steepness, so it "doesn't know" that strategy would be more efficient, and it is subject to the vagaries of the Hessian having eigenvalues on different scales.
Stochastic gradient descent
SGD has the same properties, with the exception that the updates are noisy, implying that the contour surface looks different from one iteration to the next, and therefore the gradients are also different. This implies that the angle between the direction of the gradient step and the optimum will also have noise - just imagine the same plots with some jitter.
More information:
Can we apply analyticity of a neural network to improve upon gradient descent?
Why are second-order derivatives useful in convex optimization?
How can change in cost function be positive?
This answer borrows this example and figure from Neural Networks Design (2nd Ed.) Chapter 9 by Martin T. Hagan, Howard B. Demuth, Mark Hudson Beale, Orlando De Jesús.
Local steepest direction is not the same with the global optimum direction. If it were, then your gradient direction wouldn't change; because if you go towards your optimum always, your direction vector would point optimum always. But, that's not the case. If it were the case, why bother calculating your gradient every iteration?
The other answers highlight some annoying rate-of-convergence issues for GD/SGD, but your comment "SGD can eventually converge..." isn't always correct (ignoring pedantic usage remarks about the word "can" since it seems you meant "will").
One nice trick for finding counter-examples with SGD is to notice that if every data point is the same, your cost function is deterministic. Imagine the extremely pathological example where we have one data point $$(x_0,y_0)=(1,0)$$ and we have a model for how our system should work based on a single parameter $alpha$ $$f(x,alpha)=sqrtalpha^2-alpha x.$$
With MSE as our cost function, this simplifies to $$(f(x_0,alpha)-y_0)^2=alpha^2-alpha,$$ a convex function. Suppose we choose our learning rate $beta$ poorly so that our update rule is as follows: $$alpha_n+1=alpha_n-beta(2alpha_n-1)=alpha_n-(2alpha_n-1)=1-alpha_n.$$ Now, our cost function has a minimum at $alpha=frac12$, but if we start literally anywhere other than $p=frac12$ then SGD will simply bounce between cycle between the starting point $p$ and $1-p$ and never converge.
I'm not sure if convexity is enough to break some worse behavior that exists for general SGD, but if you allow functions even as complex as cubics for your cost function then SGD can bounce around on a dense subset of the domain and never converge anywhere or approach any cycle.
SGD can also approach/obtain cycles of any finite length, diverge toward $infty$, oscillate toward $pminfty$ (excuse the notation), and have tons of other pathological behavior.
One interesting thing about the whole situation is that there exist uncountably many functions (like SGD) which take arbitrary convex functions as inputs and then output an update rule which always quickly converges to the global minimum (if one exists). Even though conceptually there exist loads of them, our best attempts at convex optimization all have pathological counterexamples. Somehow the idea of a simple/intuitive/performant update rule runs counter to the idea of a provably correct update rule.
$begingroup$
+1 for this observation. But, this $beta=1$ is a bit of bad choice and would also be bad in the case of regular gradient descent. It is a good comment but it doesn't really relate to the issue whether the steepest descent path points towards the solution or not, it relates instead to the issue of too large step-sizes that may lead to divergent updating.
$endgroup$
– Martijn Weterings
Sep 19 '18 at 12:25
$begingroup$
Note that SGD convergence proof assumes a decreasing step size...
$endgroup$
– Jan Kukacka
Sep 19 '18 at 14:04
$begingroup$
@MartijnWeterings Good observation. I guess my example does actually point the correct direction. Should I update it with a 2D example that never points the right direction and diverges?
$endgroup$
– Hans Musgrave
Sep 19 '18 at 15:35
$begingroup$
@MartijnWeterings Agreed, $beta=1$ is a bad choice. For any $beta>0$ though, there exists a pathological cost function for which that $beta$ fails. One of the easiest ones stems from $f(x,alpha)=sqrtfracalpha^2-alpha xbeta.$
$endgroup$
– Hans Musgrave
Sep 19 '18 at 15:37
$begingroup$
@JanKukacka That's a common modification to SGD that suffers from a similar flaw. Instead of the cost function being a parabola, you choose $f$ so that the cost function is a symmetric convex function rising rapidly enough in both directions from the minimum to counteract the cooling rate of $beta$. The SGD convergence proofs I've seen are only with probability 1 and rely on such badly chosen cost functions existing with probability 0 with typical measures on the space of cost functions.
$endgroup$
– Hans Musgrave
Sep 19 '18 at 15:46
Maybe the answers to this question need a quick update. It seems like SGD yields a global minimum also in the non-convex case (convex is just a special case of that):
SGD Converges To Global Minimum In Deep Learning via Star-Convex Path, Anonymous authors, Paper under double-blind review at ICLR 2019
https://openreview.net/pdf?id=BylIciRcYQ
The authors establish the convergence of SGD to a global minimum for nonconvex optimization problems that are commonly encountered in neural network training. The argument exploits the following two important properties: 1) the training loss can achieve zero value (approximately); 2) SGD follows a star-convex path. In such a context, although SGD has long been considered as a randomized algorithm,
the paper reveals that it converges in an intrinsically deterministic manner to a global minimum.
This should be taken with a grain of salt though. The paper is still under review.
The notion of star-convex path gives a hint about towards where the gradient would point at each iteration.
Thanks for contributing an answer to Cross Validated!
But avoid …
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy
$begingroup$
Have you ever walked straight downhill from a mountain ridge, only to find yourself in a valley that continues downhill in a different direction? The challenge is to imagine such a situation with convex topography: think of a knife edge where the ridge is steepest at the top.
$endgroup$
– whuber♦
Sep 18 '18 at 14:48