The following data, taken from Matlab’s built-in
patients data sample (reference), records the ages and systolic pressure of 10 smokers (\(z=1\)) and 12 nonsmokers (\(z=0\)). Train a logistic regression model on this data.
age1 = [38 33 39 48 32 27 44 28 30 45]; sys1 = [124 130 130 130 124 123 128 129 127 134]; age0 = [43 38 40 49 46 40 28 31 45 42 25 36]; sys0 = [109 125 117 122 121 115 115 118 114 115 127 114];
Then test the model on a separate data set, generating predictions for the people based on their age and systolic pressure:
age = [38 45 30 48 48 25 44 49 45 48]; sys = [138 124 130 123 129 128 124 119 136 114];
Finally, compare the model’s prediction with actual smoker status in the test dataset.
actual = [1 0 0 0 0 1 1 0 1 0];
Solution.
We set up the log-likelihood function
LogL, maximize it, and use the optimal parameters cc to predict the status of the 10 people who were not a part of the training set.
LogL = @(c) sum(log(1./(1+exp(-c(1)*age1-c(2)*sys1-c(3))))) + sum(log(1 - 1./(1+exp(-c(1)*age0-c(2)*sys0-c(3))))); cc = fminsearch(@(c) -LogL(c), [0; 0; 0]); age = [38 45 30 48 48 25 44 49 45 48]; sys = [138 124 130 123 129 128 124 119 136 114]; prediction = 1./(1+exp(-cc(1)*age-cc(2)*sys-cc(3))); actual = [1 0 0 0 0 1 1 0 1 0]; disp([prediction' actual']);
Excluding the cases where prediction is a number close to \(0.5\) (which should be considered a “don’t know” answer), the model got 6 out of 8 right.
