Section 32.1 Gradient descent in several variables
In order to minimize a function \(f\colon \mathbb R^n \to \mathbb R\text{,}\) we can start with initial vector \(\mathbf a\) and compute \(\mathbf x = \mathbf a - \beta \nabla f(\mathbf a)\text{,}\) then replace \(\mathbf a\) with \(\mathbf x\) and repeat until convergence is achieved (or the limit on steps is reached). This is the \(n\)-dimensional version of the gradient descent method in Section 31.3. We already saw that the choice of \(\beta\) is difficult even in one dimension. It gets worse in several dimensions.
Consider the function
with the minimum at \((0, 0)\text{.}\) Its gradient is \(\nabla f = \langle 2x_1, 2\cdot 10^6 x_2\rangle\text{.}\) So, the process described above is
There is no good value of \(\beta\) to use here. If \(\beta > 10^{-6}\text{,}\) the second coordinate grows exponentially. If \(\beta \lt 10^{-6}\text{,}\) for example \(\beta = 10^{-6}/2\) (which is optimal for the second coordinate), it will take a million steps just to reduce the first coordinate, say, from \(1\) to \(0.37\text{.}\)
The underlying issue is that the direction \(-\nabla f\) does not really point toward the minimum of \(f\) in this example. The steepest descent goes sideways due to very different rates of change in different variables. This is one of main reasons why optimization in several variables is hard, and why it is sensitive to scaling of the variables. We would not have such difficulties with \(x_1^2 + 3 x_2^2\text{.}\)
One way to improve the situation is to run a one-variable minimization algorithm, line search at each step, to find the minimum of \(f\) on the line of steepest descent. This means minimizing the function \(h(t) = f(\mathbf a + t \mathbf g)\) where \(\mathbf g = -\nabla f(\mathbf a)\text{.}\) If we find the exact minimum on the line each time, then the minimization process for function converges to the minimum quickly. But for slightly more complicated functions the difficulty emerges again. Consider Rosenbrock's function
Its graph has a curved “valley” along the parabola \(x_2 = x_1^2\text{.}\) One-dimensional minimization along the line of steepest descent leads to each consecutive direction being perpendicular to the previous one (why?). So the algorithm ends up zig-zagging in tiny steps in the deep narrow valley, making very little progress toward minimum. And since each step is its own minimization problem (in one dimension), the entire process may take too long.
In order to try the method of previous paragraph in practice, we need a line-search subroutine such as Matlab's command fminbnd
for one-dimensional minimization on an interval. Recall from Section 31.2 that this command uses a combination of parabolic interpolation and golden section methods to find a local minimum: for example fminbnd(@(t) t*cos(t), 0, 12)
returns \(3.4256\) even though the absolute minimum of \(t\cos t\) on the interval \([0, 12]\) is at \(t \approx 9.5293\text{.}\) But it is fast, and for our current purpose, a line search at each step of gradient descent, being fast is more important than finding the absolute minimum.
Example 32.1.1. Gradient descent with line search.
Minimize Rosenbrock's function (32.1.2) using line search in the direction of steepest descent, \(-\nabla f\text{,}\) at each step. Use a random starting point randn(2, 1)
. Plot the path of the search.
The gradient of Rosenbrock's function is
The code uses a matrix path
to record the path taken by the search.
f = @(x) (x(1)-1).^2 + 100*(x(1).^2 - x(2)).^2; fg = @(x) [2*(x(1)-1) + 400*x(1)*(x(1).^2 - x(2)); -200*(x(1).^2 - x(2))]; x = randn(2, 1); max_tries = 10000; path = zeros(2, max_tries); for k = 1:max_tries path(:, k) = x; g = -fg(x); % direction of descent t_min = fminbnd(@(t) f(x + t*g), 0, 1); dx = t_min*g; x = x + dx; if norm(dx) < 1e-6 break end end plot(path(1, 1:k), path(2, 1:k), '-+') if k < max_tries fprintf('Found x = (%g, %g) with f(x) = %g after %d steps\n', x, f(x), k); else disp('Failed to converge') end
Whether the code converges or fails depends on the initial point. Often it fails despite being on the right track, because of too many tiny zigzagging steps.
Note there is a reason to have a less restrictive stopping conditions for step size norm(dx)
when the goal is minimization (compared to root-finding). A smooth function changes slowly near a point of minimum (like a parabola near its vertex), so an error of size \(10^{-6}\) in terms of location of the minimum could mean an error of about \(10^{-12}\) for the minimal value.
There are several “weak line search” methods designed to determine a step size that makes the function noticeably smaller without finding its exact minimum. They can speed up the process but do not completely resolve the zigzagging issue which is the root of the difficulty in Example 32.1.1. The directions seem to be a problem, not just the step sizes.