Suppose there are r + 1 possible outcomes for the dependent variable, 0, 1, …, r, with r > 1. Pick one of the outcomes as the reference outcome and conduct r pairwise logistic regressions between this outcome and each of the other outcomes. For our purposes we will assume that 0 is the reference outcome. The binary logistic regression model for the outcome h, with h ≠ 0, is defined by
Here pih is the probability that the ith sample has outcome h. Taking the exponential of both sides of the above equation yields the equivalent expression
where we define xi0 = 1 (in order to keep our notation simple). Now let
Whereas the model used in the binary case with only two outcomes is based on a binomial distribution, where there are more than two outcomes, the model we use is based on the multinomial distribution. Thus, the probability that the sample data occurs as it does is given by
where the yih are the observed values while the pih are the corresponding theoretical values.
Taking the natural log of both sides and simplifying we get the following definition.
Definition 1: The log-likelihood statistic for multinomial logistic regression is defined as follows:
Observation: The multinomial counterparts to Property 1 and 2 of Finding Logistic Regression Coefficients using Newton’s Method are as follows.
Property 1: For each h > 0, let Bh = [bhj] be the (k+1) × 1 column vector of binary logistic regression coefficients of the outcome h compared to the reference outcome 0 and let B be the r(k+1) × 1 column vector consisting of the elements in B1, …, Br arranged in a column.
Also let X be the n × (k+1) design matrix (as described in Definition 3 of Least Squares for Multiple Regression). For outcomes h and l let Vhl be the n× n diagonal matrix whose main diagonal contains elements of form
and let Chl = XTVhlX. Now define the nr × nr matrices
Property 2: The maximum of the log-likelihood statistic occurs when for all h = 1, …, r and j = 1, …, k the following r(k+1) equations hold
Observation: Let Y = [yih] be the n × r matrix of observed outcomes of the dependent variable and let P = [pih] be the n × r matrix of the model’s predicted values for the outcomes (excluding the reference variable). Let X be the n × (k+1) design matrix. Then the matrix equation
where the right side of the equation is the (k+1) × r zero matrix, is equivalent to the equations in Property 2.
Property 3: Let B, X, Y, P and S be defined as in Property 1 and 2, and let B(0) be a an initial guess of B, and for each m define the following iteration
then for sufficiently large m, B(m+1) ≈ B(m), and so B(m) is a good approximation of the coefficient vector B.
Observation: Here we can take as the initial guess for B the r(k+1) × 1 zero matrix.
Observation: If we group the data as we did in Example 1 of Basic Concepts of Logistic Regression (i.e. summary data), then Property 1 takes the form
where n = the number of groups (instead of the sample size) and for each i ni = the number of observations in group i.
Property 2 also holds where Y = [yih] is the n × r column vector of summarized observed outcomes of the dependent variable, X is the corresponding n × (k+1) design matrix, P =[pih] is the n × r column vector of predicted values and Vhl is the n × n diagonal matrix whose main diagonal contains elements of form
Thus, the element in the jth row and mth column of Chl is
In this case, the expressions for L and LL become
The values of LL and R2 as well as the chi-square test for significance are calculated exactly as for binary logistic regression (see Testing the Fit of the Logistic Regression Model).
As for LL, to the above formula we need to add the constant term
Note, however, that in calculating the different versions of R2, the constant term is not included in LL and LL0.