Testing the Fit of the Logistic Regression Model

Unfortunately, for larger values of coefficient b, the standard error and the Wald statistic become inflated, which increases the probability that b is viewed as not making a significant contribution to the model even when it does (i.e. a type II error).

To overcome this problem it is better to test on the basis of the log-likelihood statistic since

image2207

where df = k – k0 and where LL1 refers to the full log-likelihood model and LL0 refers to a model with fewer coefficients (especially the model with only the intercept b0 and no other coefficients). This is equivalent to

image2210

Observation: For ordinary regression the coefficient of determination

image2211

Thus R2 measures the percentage of variance explained by the regression model. We need a similar statistic for logistic regression. We define the following three pseudo-R2 statistics for logistic regression.

Definition 1The log-linear ratio R2 (aka McFadden’s R2) is defined as follows:

Log-linear ratio

where LL1 refers to the full log-likelihood model and LL0 refers to a model with fewer coefficients (especially the model with only the intercept b0 and no other coefficients).

Cox and Snell’s R2  is defined as

Cox Snell R squared

where n = the sample size.

Nagelkerke’s R2 is defined as

Nagelkerke's R-squared

Observation: Since R_{CS}^2 cannot achieve a value of 1, Nagelkerke’s R2 was developed to have properties more similar to the R2 statistic used in ordinary regression.

Observation: The initial value L0 of L, i.e. where we only include the intercept value b0, is given by

image2220

where n0 = number of observations with value 0, n1 = number of observations with value 1 and n = n0 + n1.

As described above, the likelihood-ratio test statistic equals:

image2223

where L1 is the maximized value of the likelihood function for the full model L1, while L0 is the maximized value of the likelihood function for the reduced model. The test statistic has chi-square distribution with df = k1 – k0, i.e. the number of parameters in the full model minus the number of parameters in the reduced model.

Example 1: Determine whether there is a significant difference in survival rate between the different values of rem in Example 1 of Basic Concepts of Logistic Regression. Also calculate the various pseudo-R2 statistics.

We are essentially comparing the logistic regression model with coefficient b to that of the model without coefficient b. We begin by calculating the L1 (the full model with b) and L0 (the reduced model without b).

image7014image7015

Here L1 is found in cell M16 or T6 of Figure 6 of Finding Logistic Coefficients using Solver.

We now use the following test:

image2229

image7016

where df = 1. Since p-value = CHIDIST(280.246,1) = 6.7E-63 < .05 = α, we conclude that differences in rems yield a significant difference in survival.

The pseudo-R2 statistics are as follows:

image7017image7018 image7019

All these values are reported by the Logistic Regression data analysis tool (see range S5:T16 of Figure 6 of Finding Logistic Coefficients using Solver).

7 Responses to Testing the Fit of the Logistic Regression Model

  1. Amy says:

    Given your figure 6 output are the following statements a correct interpretation?

    The results of the likelihood ratio test suggest there was statistically significant relationship between the input variable and the outcome variable at the 0.05 level of significance (chi sq (1, N=760)= 280.2421, p=6.65E-63).

    The odds ratio of the input was .9928(=exp(-0.00722)) with a 95% confidence interval=(.9917,9939). This indicated that every every unit …. increased/decrease in the input variable the odds of the output variable increased/decreased by 0.9928

    My understanding of your data set is weak so I’m not sure how to interpret that.

    My data is pretest score and output is pass/fail class. The logisitic regression ran nicely and my model is significant.

  2. shri says:

    Hi Charles,

    Is there any post where the Binary logistic regression output has been interpreted. As in what does the output mean and what conclusion actions can be derived from the same.

    Shri

  3. Wytek Szymanski says:

    Hi Charles,

    The R-squared in linear regression is defined like so:
    var(Y) = (var(Y)-var(err))/var(Y) = 1 – var(err)/var(Y)
    where var(err) is derived from the absolute difference between Y and Yhat.

    Why can’t we apply this definition to logistic regression where Y is the observed probability and Yhat is the estimated probability?

    • Charles says:

      Wytek,
      Sorry, but I have not tried to evaluate this version of R-square for logistic regression. From what I can see no one uses it. Instead they use pseudo-R-square statistics, some of which are described on my website.
      Charles

  4. Mike says:

    Great site – very helpful.

    One typo:
    CHITEST(280.246,1) = 6.7E-63 => CHIDIST(280.246,1) = 6.7E-63

Leave a Reply

Your email address will not be published. Required fields are marked *