Comparing Logistic Regression Models

Example 1: Repeat the study from Example 3 of Finding Logistic Regression Coefficients using Newton’s Method based on the summary data shown in Figure 1.

Figure 1 – Data for Example 1

Using the Logistic Regression supplemental data analysis tool, selecting the Newton Method option, we obtain the output displayed in Figure 2.

Figure 2 – Base model for Example 1

We know from the above analysis that the presence of Temp and Water makes a significant difference (over the initial model where only the intercept is used), but do we need both of these independent variables? 1 = exp(0) doesn’t lie in the 95% confidence interval for Temp, but it does lie in the 95% confidence interval of Water. We conclude that Temp make a significant contribution to the model, but Water doesn’t. Since this analysis relies on the Wald statistic, which is not completely reliable, we would prefer to use an approach similar to that used in Testing Fit of the Logistic Regression Model.

Example 2: Do the Temp and Water variables make a significant difference in the model of Example 1?

We first create summary tables for the Temp-only and Water-only models and then use the Logistic Regression data analysis tool (with Newton’s Method option) to build the two models. Also see below for a simpler approach for creating the Temp-only summary table.

The summary table for the Temp model is shown in range B28:D34 of Figure 3 The values of the C and D columns can be calculated from the summary table of the base model (as shown in Figure 2) using SUMIF. For example, the number of samples where Temp = 20 and the reptile was born Male (cell C29) is given by the formula

=SUMIF(\$A\$4:\$A\$15,\$B29,C\$4:C\$15)

By filling right (Ctrl-R) and down (Ctrl-D), you can copy this formula into the other cells in the range C29:D34. You now use the Logistic Regression tool to obtain the output shown in Figure 3.

Figure 3 – Output for Temp-only model

We observe that the Temp variable makes a significant contribution (cell U35) over the constant-only model. Here we are comparing LL1 (Temp model) with LL0 (constant-only model).

We can also compare the Temp model with the base model (Temp + Water), by copying the range T28:U35 to another location in the worksheet and using the LL1 value from the base model and substituting the LL1 value from the Temp model for LL0. Also we need to change df to 1 since the difference between the df of the two models is 2 – 1 = 1. This is shown in Figure 4.

Figure 4 – Comparing the Temp and base models

We see that there is not a significant difference between the models (cell X44). This confirms the conclusion that we reached previously that the Water variable is not making a significant contribution, and in fact it can be dropped.

We create the Water-only model in a similar way to obtain the output shown in Figure 5.

Figure 5 – Output for Water-only model

This time we see that there is no significant difference between the Water model and the constant model. If we repeat the analysis of Figure 4, we would see that there is a significant difference between the Water model and the base model.

Finally, we can look at further refinements of the model, such as the full interaction model, where we include the interaction between Temp and Water. We show this analysis in Figure 6.

Figure 6 – Logistic regression – Interaction model

If we compare this model with the base model using the approach described above (as in Figure 4), we get the output shown in Figure 7.

Figure 7 – Comparing the interaction and base models

This shows that there is a significant difference between the full interaction model and the base model, with the interaction model providing a better fit.

Observation: As mentioned above, there is a simpler way to create the Temp-only and Water-only summary data tables. To create the Temp-only table, enter Ctrl-m and select the Logistic Regression data analysis tool and then enter the following information into the dialog box that appears.

Figure 8 – Creating reduced models

Here we have entered the Water independent variable into the List of variables to exclude field. This produces the output in Figure 3.

Observation: The List of variables to exclude field can be used whenever the Input Format is set to Summary data and the Headings included with data field is checked in order to create a reduced model. The list of variables to exclude are entered into this field separated by commas.

E.g. if we have a summary data table with Nationality, Age, Education, Gender and Occupation as independent variables and want to create a reduced model with only Nationality, Education and Occupation, we would simply enter Age, Gender into the List of variables to exclude field.

10 Responses to Comparing Logistic Regression Models

1. Mark says:

Hello,

But the sentence in the middle about the degree of freedom is missing the relevant term (“df”?).
“Also we need to change to 1 since the difference between the of the two models is 2 – 1 = 1. This is shown in Figure 4.”

Also, is there any page explaining how to determine the degree of freedom when we compare the two models like this case? If so, I appreciate it if you refer me to that page.

• Charles says:

Mark,

Thanks for catching the omission of the term df. In fact, the terms LL0 and LL1 are also omitted a few times. The webpage has now been corrected.

The referenced page explains that to obtain the degrees of freedom for the test you use the difference in df from the two models. Admittedly this is not so clear. I will be revising this part of the website shortly and I will make sure that the explanation is clearer.

Charles

2. Jorge M Lecona says:

Charles, your description of the concepts of the logistic regression have been a great help for me in my thesis.

I am currently dealing with a problem that I can’t understand and have not seen referenced neither online or in the literature.

For my statistic test of the null hypothesis with the likelihood ratio test I compare a model using the estimators and a model without them.

The values I get for this test : -2[(L(0)-L(B)] are negative, which leads me to believe that I can not reject the null hypothesis. But this is very weird in my opinion as the shares in the model are predicted quite accurately.

Do you have any idea what could be the problem, or provide some insight on what Negative values of the likelihood ratio mean.

Jorge

• Charles says:

Jorge,
I don’t think L(B) can be more negative than L(0), and so it is likely that you have either (1) made a mistake in calculating L(B) or L(0) or (2) you have reversed the roles of L(0) and L(B) in some way. Most likely it is (2), in which case the the real value of -2[(L(0)-L(B)] is the absolute value of the value you obtained.
Charles

3. Denis patry says:

Greetings M. Zaiontz

Your work is providing quite a body of knowledge (learning, how to and interpretation) for scientific and business community from all horizon.

Considering that your work is a very valuable asset, I hope that precaution has been taken to assure perenety of the GEM (selfish n’est-ce pas?).

Best regards
Denis

• Charles says:

Sorry Denis, but I don’t understand what you mean by “perenety of the GEM”.
Charles

4. Phillip says:

Hello Dr. Zaiontz,

I am a bit confused about the part of the instructions where you copy/paste the LL0 and LL1 variables to compare models.

I am trying to decide if a pilot program was effective or, in other words, if increasing “pass rates” at a particular “test station” were an effect of time or the effect of the program.

To give you more background on the data: I have multiple “test stations” and for each of them a person takes a test and “passes” or “fails”. At one particular station, a 1-month-long program was implemented to try to increase pass rates. The problem is that the proportion of passes significantly increased between the month of the test and the month prior for both the control stations and the treatment station.

How do I decide if the increase in the pass rate for the treatment station is significantly greater than the increase for the control stations?

(I hope I have articulated my problem clearly and I appreciate your help.)

Thank you,
Phillip

• Charles says:

Phillip,
Sorry, but I don’t understand the scenario that you are describing.
The usual situation being described on the referenced webpage is that the LL0 model contains a subset of the variables in the LL1 model. I am not sure this is true for the situation you are describing.
Charles

• Phillip says:

Thank you for responding. I was unsure which LL0 and LL1 values were copied/pasted and where, but I think I understand after reading your comment and the webpage.

I have 2 samples: one is a control and one is treatment. Each member of a sample took a test on a date and either passed or failed. For both samples, it appears the ratio of passes to fails each day increased with time, but what I want to know is if the treatment sample saw a greater increase of the pass-fail ratio over time.
Thank you,
Phillip