Gradient descent in several variables

Section 32.1 Gradient descent in several variables

In order to minimize a function \(f\colon \mathbb R^n \to \mathbb R\text{,}\) we can start with initial vector \(\mathbf a\) and compute \(\mathbf x = \mathbf a - \beta \nabla f(\mathbf a)\text{,}\) then replace \(\mathbf a\) with \(\mathbf x\) and repeat until convergence is achieved (or the limit on steps is reached). This is the \(n\)-dimensional version of the gradient descent method in Section 31.3. We already saw that the choice of \(\beta\) is difficult even in one dimension. It gets worse in several dimensions.

Consider the function

\begin{equation} f(\mathbf x) = x_1^2 + 10^6 x_2^2\label{eq-quadratic-badly-scaled}\tag{32.1.1} \end{equation}

with the minimum at \((0, 0)\text{.}\) Its gradient is \(\nabla f = \langle 2x_1, 2\cdot 10^6 x_2\rangle\text{.}\) So, the process described above is

\begin{equation*} \mathbf x = \langle (1-2\beta)a_1, (1-2\beta\cdot 10^6)a_2 \rangle \end{equation*}

There is no good value of \(\beta\) to use here. If \(\beta > 10^{-6}\text{,}\) the second coordinate grows exponentially. If \(\beta \lt 10^{-6}\text{,}\) for example \(\beta = 10^{-6}/2\) (which is optimal for the second coordinate), it will take a million steps just to reduce the first coordinate, say, from \(1\) to \(0.37\text{.}\)

The underlying issue is that the direction \(-\nabla f\) does not really point toward the minimum of \(f\) in this example. The steepest descent goes sideways due to very different rates of change in different variables. This is one of main reasons why optimization in several variables is hard, and why it is sensitive to scaling of the variables. We would not have such difficulties with \(x_1^2 + 3 x_2^2\text{.}\)

One way to improve the situation is to run a one-variable minimization algorithm, line search at each step, to find the minimum of \(f\) on the line of steepest descent. This means minimizing the function \(h(t) = f(\mathbf a + t \mathbf g)\) where \(\mathbf g = -\nabla f(\mathbf a)\text{.}\) If we find the exact minimum on the line each time, then the minimization process for function converges to the minimum quickly. But for slightly more complicated functions the difficulty emerges again. Consider Rosenbrock's function

\begin{equation} f(\mathbf x) = (x_1 - 1)^2 + 100(x_1^2 - x_2)^2\label{eq-rosenbrock}\tag{32.1.2} \end{equation}

Its graph has a curved “valley” along the parabola \(x_2 = x_1^2\text{.}\) One-dimensional minimization along the line of steepest descent leads to each consecutive direction being perpendicular to the previous one (why?). So the algorithm ends up zig-zagging in tiny steps in the deep narrow valley, making very little progress toward minimum. And since each step is its own minimization problem (in one dimension), the entire process may take too long.

In order to try the method of previous paragraph in practice, we need a line-search subroutine such as Matlab's command fminbnd for one-dimensional minimization on an interval. Recall from Section 31.2 that this command uses a combination of parabolic interpolation and golden section methods to find a local minimum: for example fminbnd(@(t) t*cos(t), 0, 12) returns \(3.4256\) even though the absolute minimum of \(t\cos t\) on the interval \([0, 12]\) is at \(t \approx 9.5293\text{.}\) But it is fast, and for our current purpose, a line search at each step of gradient descent, being fast is more important than finding the absolute minimum.

Example 32.1.1. Gradient descent with line search.

Minimize Rosenbrock's function (32.1.2) using line search in the direction of steepest descent, \(-\nabla f\text{,}\) at each step. Use a random starting point randn(2, 1). Plot the path of the search.

Solution

The gradient of Rosenbrock's function is

\begin{equation} \begin{pmatrix} 2(x_1-1) + 400x_1(x_1^2-x_2) \\ -200(x_1^2 - x_2)\end{pmatrix}\label{eq-rosenbrock-grad}\tag{32.1.3} \end{equation}

The code uses a matrix path to record the path taken by the search.

f = @(x) (x(1)-1).^2 + 100*(x(1).^2 - x(2)).^2;
fg = @(x) [2*(x(1)-1) + 400*x(1)*(x(1).^2 - x(2)); -200*(x(1).^2 - x(2))];
x = randn(2, 1);
max_tries = 10000;

path = zeros(2, max_tries);
for k = 1:max_tries
    path(:, k) = x;
    g = -fg(x);    % direction of descent
    t_min = fminbnd(@(t) f(x + t*g), 0, 1);
    dx = t_min*g;
    x = x + dx;
    if norm(dx) < 1e-6
        break
    end
end

plot(path(1, 1:k), path(2, 1:k), '-+')
if k < max_tries
    fprintf('Found x = (%g, %g) with f(x) = %g after %d steps\n', x, f(x), k);
else
    disp('Failed to converge')
end

Whether the code converges or fails depends on the initial point. Often it fails despite being on the right track, because of too many tiny zigzagging steps.

Note there is a reason to have a less restrictive stopping conditions for step size norm(dx) when the goal is minimization (compared to root-finding). A smooth function changes slowly near a point of minimum (like a parabola near its vertex), so an error of size \(10^{-6}\) in terms of location of the minimum could mean an error of about \(10^{-12}\) for the minimal value.

There are several “weak line search” methods designed to determine a step size that makes the function noticeably smaller without finding its exact minimum. They can speed up the process but do not completely resolve the zigzagging issue which is the root of the difficulty in Example 32.1.1. The directions seem to be a problem, not just the step sizes.