Skip to main content

Section 34.1 Penalty method

A typical problem of constrained optimization is to find the minimum of a function \(f\) on some region \(D\) of \(\mathbb R^n\text{.}\) One approach is to add a penalty for being outside of \(D\text{,}\) for example let

\begin{equation} g(x) = \begin{cases} f(x) & x\in D \\ f(x) + M & x\notin D \end{cases} \label{eq-constant-penalty}\tag{34.1.1} \end{equation}

where \(M\) is a large number, say \(10^6\text{.}\) Then try to minimize \(g\text{,}\) with the expectation that the minimum will be in \(D\text{.}\) Since \(g=f\) within \(D\text{,}\) such a minimum will also be a minimum of \(f\text{.}\)

The constant penalty (34.1.1) has several disadvantages. It makes \(g\) discontinuous, and it does not eliminate local minima outside of \(D\text{.}\) Since minimization methods often converge to a local minimum, we may end up outside of \(D\) even though there are much smaller values within \(D\text{.}\)

A way to correct both of the above issues is to add to \(f\) a continuous penalty function \(P\) that is equal to \(0\) inside \(D\) and is positive outside of \(D\text{.}\) Then minimize \(f+P\text{.}\) One way to construct \(P\) is to describe the region \(D\) by way of an inequality \(c\le 0\text{,}\) for example \(x^2 + y^2 - 1 \le 0\) describes a disk of radius \(1\text{.}\) There are multiple ways to use such an inequality to form a penalty function. For example, the linear penalty

\begin{equation} P_1(x) = M\max(0, c)\label{eq-linear-penalty}\tag{34.1.2} \end{equation}

(with a large constant \(M\)) and quadratic penalty

\begin{equation} P_2(x) = M\max(0, c)^2\label{eq-quadratic-penalty}\tag{34.1.3} \end{equation}

The linear penalty is continuous but not differentiable on the boundary of \(D\text{.}\) The quadratic penalty is differentiable, which is preferable for the methods that rely on the gradient. Indeed, the gradient of (34.1.3) is

\begin{equation} \nabla P_2(x) = 2 M \max(0, c) \nabla c\label{eq-quadratic-penalty-grad}\tag{34.1.4} \end{equation}

On the other hand, the fact that \(P_2\) starts off with zero slope makes it more likely that the minimum of \(f + P_2\) will be slightly outside of the region \(D\text{.}\) So there is a tradeoff between reaching the minimum and enforcing the constraint.

Let us try both methods with a simple function: minimize \(f(x, y) = (x + 7y)^3\) on the disk \(x^2 + y^2\le 1\text{.}\) We could use our own implementation of Nelder-Mead method from ChapterĀ 33 but fminsearch already does that, so it is used below.

Minimize \(f(x, y) = (x + 7y)^3 \) on the disk \(x^2 + y^2\le 1\) by using fminsearch with a penalty term.

Solution
f = @(x) (x(1) + 7*x(2)).^3;
c = @(x) x(1).^2 + x(2).^2 - 1;
M = 1e6;

x0 = randn(2, 1);
xm1 = fminsearch(@(x) f(x) + M*max(c(x), 0), x0);
xm2 = fminsearch(@(x) f(x) + M*max(c(x), 0).^2, x0);

fprintf('Linear penalty: found x = (%g, %g), |x| = %g, f(x) = %g \n', xm1, norm(xm1), f(xm1));
fprintf('Quadratic penalty: found x = (%g, %g), |x| = %g, f(x) = %g \n', xm2, norm(xm2), f(xm2));

With the linear penalty, the minimization process sometimes fails to converge within the allowed attempts; but if it converges, it stays in the unit disk. With the quadratic penalty, the convergence is more reliable but the norm of the point of minimum is usually greater than \(1\text{.}\)

In practice, quadratic penalty is usually preferable and the issue of the minimum being slightly out of bounds is dealt with by increasing \(M\) and using the previously found point of minimum as a new starting point. This is done below.

Minimize \(f(x, y) = (x + 7y)^3 \) on the disk \(x^2 + y^2\le 1\) by iteratively using fminsearch with a quadratic penalty term.

Solution
f = @(x) (x(1) + 7*x(2)).^3;
c = @(x) x(1).^2 + x(2).^2 - 1;
x0 = randn(2, 1);

for M = [1e3, 1e6, 1e9, 1e12]
    x0 = fminsearch(@(x) f(x) + M*max(c(x), 0).^2, x0);
end
    
fprintf('Found x = (%g, %g), |x| = %g, f(x) = %g \n', x0, norm(x0), f(x0));

Note that the penalty method works just as well with equality constraints of the form \(c = 0\text{.}\) In this case one adds \(Mc^2\) instead of \(M\max(c, 0)^2\) to the objective function. And if there are multiple constraints, they are all added.