The basic approach is to use the following regression model, employing the notation from Definition 3 of Method of Least Squares for Multiple Regression:
where the odds function is as given in the following definition.
Definition 1: Odds(E) is the odds that event E occurs, namely
Where p has a value 0 ≤ p ≤ 1 (i.e. p is a probability value), we can define the odds function as
Observation: For our purposes, the odds function has the advantage of transforming the probability function, which has values from 0 to 1, into an equivalent function with values between 0 and . When we take the natural log of the odds function, we get a range of values from – to .
Definition 2: The logit function is the log of the odds function, namely logit(E) = ln Odds(E), or
Definition 3: Based on the logistic model as described above, we have
where π = P(E). It now follows that (see Exponentials and Logs):
Here we switch to the model based on the observed sample (and so the π parameter is replaced by its sample estimate p, the βj coefficients are replaced by the sample estimates bj and the error term ε is dropped). For our purposes we take the event E to be that the dependent variable y has value 1. If y takes only the values 0 or 1, we can think of E as success and the complement E′ of E as failure. This is as for the trials in a binomial distribution.
Just as for the regression model studied in Regression and Multiple Regression, a sample consists of n data elements of the form (yi, xi1, x ,…, xik), but for logistic regression each yi only takes the value 0 or 1. Now let Ei = the event that yi = 1 and pi = P(Ei). Just as the regression line studied previously provides a way to predict the value of the dependent variable y from the values of the independent variables x1, …, xk in for logistic regression we have
Note too that since the yi have a proportion distribution, by Property 2 of Proportion Distribution, var(yi) = pi (1 – pi).
Observation: In the case where k = 1, we have
Such a curve has sigmoid shape:
Figure 1 – Sigmoid curve for p
The values of b0 and b1 determine the location direction and spread of the curve. The curve is symmetric about the point where x = -b0 / b1. In fact, the value of p is 0.5 for this value of x.
Observation: Logistic regression is used instead of ordinary multiple regression because the assumptions required for ordinary regression are not met. In particular
- The assumption of the linear regression model that the values of y are normally distributed cannot be met since y only takes the values 0 and 1.
- The assumption of the linear regression model that the variance of y is constant across values of x (homogeneity of variances) also cannot be met with a binary variable. Since the variance is p(1–p) when 50 percent of the sample consists of 1s, the variance is .25, its maximum value. As we move to more extreme values, the variance decreases. When p = .10 or .90, the variance is (.1)(.9) = .09, and so as p approaches 1 or 0, the variance approaches 0.
- Using the linear regression model, the predicted values will become greater than one and less than zero if you move far enough on the x-axis. Such values are theoretically inadmissible for probabilities.
For the logistics model, the least squares approach to calculating the values of the coefficients bi cannot be used; instead the maximum likelihood techniques, as described below, are employed to find these values.
Definition 4: The odds ratio between two data elements in the sample is defined as follows:
Using the notation px = P(x), the log odds ratio of the estimates is defined as
Observation: In the case where
Furthermore, for any value of d
Note too that when x is a dichotomous variable,
E.g. when x = 0 for male and x = 1 for female, then represents the odds ratio between males and females. If for example b1 = 2, and we are measuring the probability of getting cancer under certain conditions, then = 7.4, which would mean that the odds of females getting cancer would be 7.4 times greater than males under the same conditions.
Observation: The model we will use is based on the binomial distribution, namely the probability that the sample data occurs as it does is given by
Taking the natural log of both sides and simplifying we get the following definition.
Definition 5: The log-likelihood statistic is defined as follows:
where the yi are the observed values while the pi are the corresponding theoretical values.
Observation: Our objective is to find the maximum value of LL assuming that the pi are as in Definition 3. This will enable us to find the values of the bi coordinates. It might be helpful to review Maximum Likelihood Function to better understand the rest of this topic.
Example 1: A sample of 760 people who received doses of radiation between 0 and 1000 rems was made following a recent nuclear accident. Of these 302 died as shown in the table in Figure 2. Actually each row in the table represents the midpoint of an interval of 100 rems (i.e. 0-100, 100-200, etc.).
Figure 2 – Data for Example 1 plus probability and odds
Let Ei = the event that a person in the ith interval survived. The table also shows the probability P(Ei) and odds Odds(Ei) of survival for a person in each interval. Note that P(Ei) = the percentage of people in interval i who survived and
In Figure 3 we plot the values of P(Ei) vs. i and ln Odds(Ei) vs. i. We see that the second of these plots is reasonably linear.
Given that there is only one independent variable (namely x = # of rems), we can use the following model
Here we use coefficients a and b instead of b0 and b1 just to keep the notation simple.
We show two different methods for finding the values of the coefficients a and b. The first uses Excel’s Solver tool and the second uses Newton’s method. Before proceeding it might be worthwhile to click on Goal Seeking and Solver to review how to use Excel’s Solver tool and Newton’s Method to review how to apply Newton’s Method. We will use both methods to maximize the value of the log-likelihood statistic as defined in Definition 5.
Sample Size: The recommended minimum sample size for logistic regression is given by 10k/q where k = the number of independent variables and q = the smaller of the percentage of cases with y = 0 or y = 1, with a minimum of 100.
For Example 1, k = 1 and q = 302/760 = .397, and so 10k/q = 25.17. Thus a minimum sample of size 100 is recommended.