Section 27.1 Overview of least squares
Least squares fit is about more than just fitting a curve to some points like Example 6.5.1. In its general form we have a vector \(\mathbf y\) of \(n\) observations and \(p\) vectors of explanatory variables, say \(\mathbf v_1, \dots, \mathbf v_p\text{.}\) Our goal is to approximate \(\mathbf y\) by a linear combination of \(\mathbf v_k\text{,}\) say \(\sum_{k=1}^p \beta_k \mathbf v_k\text{.}\) What are these vectors in the polynomial model?
If \(\mathbf z = \sum_{k=1}^p \beta_k \mathbf v_k\) is the best prediction, then the difference \(\mathbf z - \mathbf y\) is orthogonal to each \(\mathbf v_k\text{.}\) That is,
for each \(k\text{.}\) These are \(p\) linear equations for \(p\) parameters. Let us express them in matrix form. Let \(X\) be the \(n\times p\) matrix formed by vectors \(\mathbf v_1, \dots, \mathbf v_p\) as columns. Then \(X^T\mathbf y\) is the column vector with entries \(\mathbf y\cdot \mathbf v_k\text{,}\) matching the right hand side of (27.1.1). The left hand side of (27.1.1) involves \(z = X\beta\) where \(\beta\) is the vector of coefficients that we are looking for. With this, we recognize that \(X^TX \beta \) is the vector whose entries match the left hand side. In conclusion, the matrix form of (27.1.1) is
which is known as the normal equations associated to the overdetermined system \(X\beta = \mathbf y\text{.}\) Note that \(X^TX\) is invertible if and only if the explanatory vectors are linearly independent (which they should be; otherwise we have redundant ones). However, in practice we do not obtain \(\beta\) by solving the system (27.1.2): it is more efficient for Matlab to find it directly from the overdetermined system \(X\beta = \mathbf y\text{:}\) the command beta = X\y
does this by using QR decomposition (covered in an Applied Linear Algebra course).
The goodness of fit could be measured in several ways. One indicator is the norm of the residual vector \(\mathbf y-\mathbf z\text{,}\) or better yet, the squared norm
However this quantity has the units of \(y_k\) squared, and a unit-free quantity is preferable. A popular approach is to compare the residual sum fo squares (27.1.3) to the total sum of squares
We want our model to predict \(y_k\) better than the baseline guess \(\bar y\) (which does not use any explanatory variables). So, if (27.1.3) is much smaller than (27.1.4), the fit is good. The goodness-of-fit is then measured by the coefficient of determination
which shows how much of the variability in observation is predicted by the model. The value does not exceed 1, with 1 being exact fit and close to 0 being a rather useless model (example). The quantity \(R^2\) may even be negative if the model does a worse job than the constant predictor \(\bar y\text{.}\) The notation \(R^2\) comes from the fact that for linear regression \(y=ax+b\) (without testing-training split), it is the square of correlation coefficient. But in general it is not a square of anything.