Gradient descent in one dimension

Section 31.3 Gradient descent in one dimension

The method of gradient descent uses only the first derivative of function \(f\) to determine the direction in which to look for its minimum. As a motivating example, suppose we started Newton's method at \(a = 1\) and found \(f'(1) = 3\) and \(f''(1) = -2\text{.}\) According to Newton's method our next point should be \(1 - \frac{3}{-2} = 2.5\text{.}\) But since the function is increasing near \(1\text{,}\) should not the search move to the left instead of to the right?

In its simplest, one-dimensional form, gradient descent amounts to repeatedly computing \(x = a - \beta f'(a)\) where a parameter \(\beta > 0\) may be a fixed number or be somehow adjusted in the process. The idea is to make a small step in the direction where the function \(f\) decreases.

Example 31.3.1. First attempt at gradient descent.

Minimize the function \(f(x) = x^4 - 3 x^2 + x + 1\) from Example 30.3.1 using gradient descent with \(\beta = 0.1\) and initial point \(0\text{.}\)

Solution

f = @(x) x.^4 - 3*x.^2 + x + 1;
fp = @(x) 4*x.^3 - 6*x + 1;

beta = 0.1;
a = 0;
max_tries = 10000; 
for k = 1:max_tries
    x = a - beta*fp(a);
    if abs(x-a) < 100*eps(a)
        break
    end
    a = x;
end

if k < max_tries
    fprintf('Found x = %g with f(x) = %g after %d steps\n', x, f(x), k);
else
    disp('Failed to converge')
end

The code takes more steps than Newton's method in Example 30.3.1 but it actually minimizes the function.

If no formula for \(f'\) is available, the methods of numerical differentiation (Chapter 12) can be used to implement gradient descent, for example with \(f'(a) \approx (f(a+h)-f(a-h))/(2h)\) with some small \(h\text{.}\)

Unfortunately, gradient descent with fixed \(\beta\) is not a reliable method. Simply multiplying both f and fp by 2 in Example 31.3.1 results in failure to converge (why? add disp(x) to the loop to see). We could make \(\beta\) smaller and restore convergence, but this is manual fine-tuning. The underlying issue is scaling: multiplying \(f\) by a positive constant does not change the location of its minima, yet it affects the gradient descent if it uses the same \(\beta\) value. (Newton's method is invariant under scaling because \(f'/f''\) is invariant.)

Example 31.3.2. Model case of gradient descent.

Determine for what values of \(\beta\) the gradient descent will converge for the function \(f(x) = Mx^2\) where \(M > 0\text{.}\)

Solution

Here \(f'(x) = 2Mx\text{,}\) hence \(a - \beta f'(a) = (1-2M\beta)a\text{.}\) This converges to \(0 \) if and only if \(|1-2M\beta| \lt 1\text{.}\) This means we need \(0 \lt \beta \lt 1/M\text{,}\) and the optimal value of \(\beta\) is \(1/(2M)\text{.}\) Note that \(f''(x) = 2M\text{.}\)

The message of Example 31.3.2 is that we should try to adjust \(\beta\) to be of the size \(1/f''\text{,}\) even if we do not have a formula for \(f''\text{.}\) One way to do this is to start with a guess like \(\beta=0.1\) and then update \(\beta\) after every step using \(\beta = |x-a|/|f'(x)-f'(a)|\text{.}\) This makes \(\beta\) approximately the reciprocal of \(|f''|\text{.}\) (Similar idea was used in Broyden's method in Chapter 11.) To implement this idea, we insert the line

beta = abs(x-a)/abs(fp(x)-fp(a));

into the loop in Example 31.3.1 (where?). Now if the function is multiplied by 2 or even 2000, the method continues to converge.