Section 36.1 Classification and logistic regression
We focus on the case of two categories, represented by the binary outcome variable z attaining two possible values, 0 and 1. The model may involve one or more explanatory variables xk, and the goal is to predict the outcome based on these variables. Some examples:- Explanatory variables: temperature, humidity, atmospheric pressure. Outcome: rain (1) or no rain (0).
- Explanatory variables: patient's weight, height, age, activity level. Outcome: has diabetes (1) or not (0).
Example 36.1.1. Smoking, age and blood pressure.
The following data, taken from Matlab's built-in patients
data sample (reference), records the ages and systolic pressure of 10 smokers (\(z=1\)) and 12 nonsmokers (\(z=0\)). Train a logistic regression model on this data.
age1 = [38 33 39 48 32 27 44 28 30 45]; sys1 = [124 130 130 130 124 123 128 129 127 134]; age0 = [43 38 40 49 46 40 28 31 45 42 25 36]; sys0 = [109 125 117 122 121 115 115 118 114 115 127 114];
Then test the model on a separate data set, generating predictions for the people based on their age and systolic pressure:
age = [38 45 30 48 48 25 44 49 45 48]; sys = [138 124 130 123 129 128 124 119 136 114];
Finally, compare the model's prediction with actual smoker status in the test dataset.
actual = [1 0 0 0 0 1 1 0 1 0];
We set up the log-likelihood function LogL
, maximize it, and use the optimal parameters cc
to predict the status of the 10 people who were not a part of the training set.
LogL = @(c) sum(log(1./(1+exp(-c(1)*age1-c(2)*sys1-c(3))))) + sum(log(1 - 1./(1+exp(-c(1)*age0-c(2)*sys0-c(3))))); cc = fminsearch(@(c) -LogL(c), [0; 0; 0]); age = [38 45 30 48 48 25 44 49 45 48]; sys = [138 124 130 123 129 128 124 119 136 114]; prediction = 1./(1+exp(-cc(1)*age-cc(2)*sys-cc(3))); actual = [1 0 0 0 0 1 1 0 1 0]; disp([prediction' actual']);
Excluding the cases where prediction is a number close to \(0.5\) (which should be considered a βdon't knowβ answer), the model got 6 out of 8 right.