Classification and logistic regression

Section 36.1 Classification and logistic regression

We focus on the case of two categories, represented by the binary outcome variable \(z\) attaining two possible values, \(0\) and \(1\text{.}\) The model may involve one or more explanatory variables \(x_k\text{,}\) and the goal is to predict the outcome based on these variables. Some examples:

Explanatory variables: temperature, humidity, atmospheric pressure. Outcome: rain (1) or no rain (0).
Explanatory variables: patient's weight, height, age, activity level. Outcome: has diabetes (1) or not (0).

In the examples such as these, it is understood that the prediction may be far from certain. A way to think of such a model as assigning a probability of outcome (1), based on the data available. This leads to the idea of a model being a function that takes values between \(0\) and \(1\text{.}\) The standard logistic function (Section 29.2) has this property:

\begin{equation*} \sigma(x) = \frac{1}{1+\exp(-x)} \end{equation*}

In order to include several explanatory variables \(x_1, \dots, x_m\) in this model, we use the composite function

\begin{equation} h_{\mathbf c}(\mathbf x) = \sigma\left(\sum_{k=1}^m c_k x_k\right) = \frac{1}{1+\exp\left(-\sum_{k=1}^m c_k x_k\right)}\label{eq-multivariate-logistic}\tag{36.1.1} \end{equation}

where the coefficients \(c_1, \dots, c_m\) are to be determined by training the model on some data set with known outcomes.

In Chapter 29 our training method was to minimize the sum of squares

\begin{equation*} S(\mathbf c) = \sum_{k=1}^n (z_k - h_{\mathbf c}(\mathbf x_k))^2 \end{equation*}

However, in the context of probability there is a more natural method: maximizing the likelihood function. Think of \(h_{\mathbf c}(\mathbf x_k)\) as the probability that \(z_k=1\text{,}\) and therefore \(1 - h_{\mathbf c}(\mathbf x_k)\) is the probability that \(z_k=0\text{.}\) Assuming the independence of outcomes, the probability of observing the data that was actually observed is

\begin{equation} L(\mathbf c) = \prod_{z_k = 1} h_{\mathbf c}(\mathbf x_k) \cdot \prod_{z_k = 0} (1-h_{\mathbf c}(\mathbf x_k))\label{eq-likelihood-function}\tag{36.1.2} \end{equation}

Since the outcomes \(z_k\) already occurred, this expression is not really the probability of a random event; it is the likelihood of the parameters being \(\mathbf c\text{.}\) We want to maximize this likelihood. Since the product may be very small when there are many observations, it is easier to maximize the log-likelihood, the logarithm of (36.1.2):

\begin{equation} \log L(\mathbf c) = \sum_{z_k = 1} \log h_{\mathbf c}(\mathbf x_k) + \sum_{z_k = 0} \log (1-h_{\mathbf c}(\mathbf x_k))\label{eq-log-likelihood-function}\tag{36.1.3} \end{equation}

Once we find \(\mathbf c\) which maximizes \(\log L\text{,}\) the model can then be used to make predictions about unobserved outcomes on the basis of explanatory variables \(\mathbf x\text{.}\) If \(h_{\mathbf c}(\mathbf x)\) is close to \(1\text{,}\) we expect the outcome \(z=1\text{;}\) it is close to \(0\text{,}\) we expect the outcome \(z=0\text{.}\) The quantity \(h_{\mathbf c}(\mathbf x)\) may be interpreted as the probability of \(z=1\text{,}\) predicted by the model.

Unless the explanatory variables \(\mathbf x\) are centered (have zero mean by design), the exponential should include a constant term, as shown in the following example.

Example 36.1.1. Smoking, age and blood pressure.

The following data, taken from Matlab's built-in patients data sample (reference), records the ages and systolic pressure of 10 smokers (\(z=1\)) and 12 nonsmokers (\(z=0\)). Train a logistic regression model on this data.

age1 = [38 33 39 48 32 27 44 28 30 45];
sys1 = [124 130 130 130 124 123 128 129 127 134];
age0 = [43 38 40 49 46 40 28 31 45 42 25 36];
sys0 = [109 125 117 122 121 115 115 118 114 115 127 114];

Then test the model on a separate data set, generating predictions for the people based on their age and systolic pressure:

age = [38 45 30 48 48 25 44 49 45 48];
sys = [138 124 130 123 129 128 124 119 136 114];

Finally, compare the model's prediction with actual smoker status in the test dataset.

actual = [1 0 0 0 0 1 1 0 1 0];

Solution

We set up the log-likelihood function LogL, maximize it, and use the optimal parameters cc to predict the status of the 10 people who were not a part of the training set.

LogL = @(c) sum(log(1./(1+exp(-c(1)*age1-c(2)*sys1-c(3))))) + sum(log(1 - 1./(1+exp(-c(1)*age0-c(2)*sys0-c(3)))));
cc = fminsearch(@(c) -LogL(c), [0; 0; 0]);

age = [38 45 30 48 48 25 44 49 45 48];
sys = [138 124 130 123 129 128 124 119 136 114];
prediction = 1./(1+exp(-cc(1)*age-cc(2)*sys-cc(3)));
actual = [1 0 0 0 0 1 1 0 1 0];
disp([prediction' actual']);

Excluding the cases where prediction is a number close to \(0.5\) (which should be considered a “don't know” answer), the model got 6 out of 8 right.