Basic Concepts of Multinomial Logistic Regression

Suppose there are r + 1 possible outcomes for the dependent variable, 0, 1, …, r, with r > 1. Pick one of the outcomes as the reference outcome and conduct r pairwise logistic regressions between this outcome and each of the other outcomes. For our purposes we will assume that 0 is the reference outcome. The binary logistic regression model for the outcome h, with h ≠ 0, is defined by


Here pih is the probability that the ith sample has outcome h. Taking the exponential of both sides of the above equation yields the equivalent expression


where we define xi0 = 1 (in order to keep our notation simple). Now let

image7138and so





Whereas the model used in the binary case with only two outcomes is based on a binomial distribution, where there are more than two outcomes, the model we use is based on the multinomial distribution. Thus, the probability that the sample data occurs as it does is given by


where the yih are the observed values while the pih are the corresponding theoretical values.

Taking the natural log of both sides and simplifying we get the following definition.

Definition 1: The log-likelihood statistic for multinomial logistic regression is defined as follows:


Observation: The multinomial counterparts to Property 1 and 2 of Finding Logistic Regression Coefficients using Newton’s Method are as follows.

Property 1: For each h > 0, let Bh = [bhj] be the (k+1) × 1 column vector of binary logistic regression coefficients of the outcome h compared to the reference outcome 0 and let B be the r(k+1) × 1 column vector consisting of the elements in B1, …, Br arranged in a column.

Also let X be the n × (k+1) design matrix (as described in Definition 3 of Least Squares for Multiple Regression). For outcomes h and l let Vhl be the  n diagonal matrix whose main diagonal contains elements of form


and let Chl = XTVhlX. Now define the nr × nr matrices

image7148and S = C-1. Then S is the covariance matrix for B.

Property 2: The maximum of the log-likelihood statistic occurs when for all h = 1, …, r and j = 1, …, k the following r(k+1) equations hold


Observation: Let Y = [yih] be the n × r matrix of observed outcomes of the dependent variable and let P = [pih] be the n × r matrix of the model’s predicted values for the outcomes (excluding the reference variable). Let X be the n × (k+1) design matrix. Then the matrix equation


where the right side of the equation is the (k+1) × r zero matrix, is equivalent to the equations in Property 2.

Property 3: Let B, X, Y, P and S be defined as in Property 1 and 2, and let B(0) be a an initial guess of B, and for each m define the following iteration


then for sufficiently large m, B(m+1) ≈ B(m), and so B(mis a good approximation of the coefficient vector B.

Observation: Here we can take as the initial guess for B the r(k+1) × 1 zero matrix.

Observation: If we group the data as we did in Example 1 of Basic Concepts of Logistic Regression (i.e. summary data), then Property 1 takes the form


where n = the number of groups (instead of the sample size) and for each i ni = the number of observations in group i.

Property 2 also holds where Y = [yih] is the n × r column vector of summarized observed outcomes of the dependent variable, X is the corresponding n × (k+1) design matrix, P =[pih] is the n × r column vector of predicted values and Vhl is the n × n diagonal matrix whose main diagonal contains elements of form


Thus, the element in the jth row and mth column of Chl is


In this case, the expressions for L and LL become



The values of LL and R2 as well as the chi-square test for significance are calculated exactly as for binary logistic regression (see Testing the Fit of the Logistic Regression Model).


As for LL, to the above formula we need to add the constant term


Note, however, that in calculating the different versions of R2, the constant term is not included in LL and LL0.

14 Responses to Basic Concepts of Multinomial Logistic Regression

  1. Eki says:

    Dear Charles,

    What should I do if the variance-covariance matrix is a singular matrix?
    Are there any solution for this problem?

    • Charles says:

      Dear Eki,
      There are approaches in when the variance-covariance matrix is not invertible, but these go beyond the score of the website. You can find some of these by googling.

  2. Thomas says:

    Dear Charles,
    From the literature, what would you suggest as a rule to define the minimum sample size (1) for the binomial logistic regression, (2) for the multinomial logistic regression? E.g. a rule based on the number of independent variables, the observed proportions related to each possible outcome of the dependent variable. Should such a threshold be defined by considering the possible outcomes separately (e.g. the minimum observed proportion across the outcomes), or considering all rows (combinations of outcomes) of the summary table. Thanks.

  3. Thomas says:

    Dear Charles,

    Many thanks for this very useful material. I’d like to know if, even if probably similar to the binomial case, you could add a section on the comparison of regression models. In particular, I’d be also interested to know if LL0 is supposed to remain identical from one model to the other (I think it however depends on the way the summary table is designed, due to non linearity in the LL0 formula), and if the degrees of freedom can also be simply subtracted.
    Many thanks in advance,


    • Charles says:

      What sort of comparison are you looking for? When you use one model rather than another?
      The LL0 values won’t be identical from model to model. Generally, they will be identical only when the summary data are identical.

      • Thomas says:

        Dear Charles,
        Thanks for your prompt answer. I’m thinking of nested models, exactly as illustrated in the binomial case; with a chi-square test based on log likelihoods, and a substitution of LL0 by the LL1 related to the reference model. Is it valid for the multinomial case, provided we keep the summary table identical for all models? Once the final model selected, I’ll try to define a classification matrix based on RS capabilities.

  4. Jackie says:

    All I want to figure out is how do get the population and sample for a multinomial logistic regress. I have four generational cohorts and five soft skill categories that I will be testing.


    • Charles says:

      Please explain what you mean by “how do [I] get the population and sample for a multinomial logistic regress”

  5. Colin says:


    When h j the element of v matrix is vii = (-1)*ni*Pih*Pil, but it seems in Excel Workbook you forget the term -1, why?


Leave a Reply

Your email address will not be published. Required fields are marked *