Overfitting, training and testing

Section 27.2 Overfitting, training and testing

Recall Example 6.5.1 in which we find a best-fitting parabola to given data. The code is extended below to allow arbitrary degree \(d\) of the polynomial, and to add the computation of \(R^2\) according to (27.1.5). Note the use of column vectors below.

Example 27.2.1. Fitting a polynomial of any degree.

Improve Example 6.5.1 to work with polynomials of any degree \(d\text{,}\) and add a computation of \(R^2\) to it.

Solution

x = (0:14)';  
y = [9 8 5 5 3 1 2 5 4 7 8 8 7 7 8]';  % data
d = 2;                     % degree of polynomial
X = x.^(0:d);              % matrix of linear system for parameters
beta = X\y;                % optimal parameters
f = @(x) (x.^(0:d))*beta;  % best-fitting function f

t = linspace(min(x), max(x), 1000)';
plot(t, f(t), 'b', x, y, 'r*')

total = norm(y - mean(y))^2;
residual = norm(y - f(x))^2;
fprintf('R^2 for degree %d is %g\n', d, 1 - residual/total);

If we increase \(n\) toward \(7\text{,}\) the curve gets closer to the points but it also becomes more wiggly. Most importantly, the prediction it makes for \(x=10\text{,}\) that is \(f(10)\text{,}\) becomes less realistic. This is overfitting, which describes the situation in which our model \(f\) mostly memorizes the data instead of learning from it. (Food for thought from xkcd)

A simple but effective way to combad overfitting is to split the data into two parts: one (training set) is used to compute the coefficients \(\beta_k\text{,}\) and then another part (testing set) is used to evaluate goodness of fit. The split may be done randomly, perhaps 50:50 or 60:40 proportion. Which of the following 50:50 splits looks better?

x_train = x(1:8); 
x_test = x(9:end);

(and similarly for y) or

x_train = x(1:2:end); 
x_test = x(2:2:end);

Example 27.2.2. Training-testing split.

Use the training-testing split in Example 27.2.1.

Solution

x = (0:14)';  
y = [9 8 5 5 3 1 2 5 4 7 8 8 7 7 8]';
x_train = x(1:2:end);
y_train = y(1:2:end);
x_test = x(2:2:end);
y_test = y(2:2:end);

d = 2;

X = x_train.^(0:d);
beta = X\y_train; 
f = @(x) (x.^(0:d))*beta; 

t = linspace(min(x), max(x), 1000)';
plot(t, f(t), 'b', x, y, 'r*')

total = norm(y_test - mean(y_test))^2;
residual = norm(y_test - f(x_test))^2;
fprintf('R^2 for degree %d is %g\n', d, 1 - residual/total);

An additional detail: when one reports the model performance (not in this class, but perhaps for a publication), the reported value of \(R^2\) or other estimate of goodness-of-fit should be computed on a separate validation set which was not used for either training or testing. Using the testing set for reporting \(R^2\) introduces a subtle source of bias: if many models are tested, and one of them wins, it is more likely than not that the winning model “had better luck” with the specific data set used for testing, and that would lead us to overestimate its goodness-of-fit. (Here is an example of testing too many models and not validating.)