We begin by investigating the saturated model, which accounts for all the possible variables. We do this by reexamining Example 2 of Independence Testing using a log-linear approach.
Example 1: Create a saturated log-linear model for the data in Example 2 of Independence Testing
The data for the 150 patients is again summarized in the contingency table in Figure 1.
Define the following coding of the categorical variables:
t1 = 1 if therapy 1 and = -1 if therapy 2
t2 = 1 if cured and = -1 if not cured
Based on this coding the data can be expressed as in Figure 2.
The log-linear model takes the form:
Here all the variables are included, including the interaction terms. This is called the saturated model. We now find the values of the population coefficients β0, β1, β2, β3. As usual, using the sample data we find the estimates of these coefficients b0, b1, b2, b3, where
It then follows (using the data in Figure 2) that:
Adding the first two equations and dividing by 2 we get
Adding the first and third equations and dividing by 2 we get
Adding the first and last and dividing by 2 we get
Thus the model is
which is equivalent to
which is in turn is equivalent to
Using , the log-linear model takes the form (dropping the error term):
In Figure 3 we provide the contingency table for the logs of the original data in range S13:T14, but this time instead of calculating the marginal totals, we calculate the marginal averages.
Thus, for example, the marginal average for the Cured row (cell U13) contains the formula =AVERAGE(S13:T13) and the marginal average for the Therapy 1 column (cell S15) contains the formula =AVERAGE(S13:S14).
Note that b0 = the grand mean (cell U15), b1 = the mean for Cured (cell U13) minus the grand mean, b2 = the mean for Therapy 1 (cell S15) minus the grand mean and b3 = Cured × Therapy 1 (cell S13) minus the mean for Cured minus the mean for Therapy 1 plus the grand mean.
We now map the log values back into the original contingency table (range R5:U8) by using the exponential function. Thus the marginal average for the Cured row in the original contingency table (cell U6) is EXP(U13) = EXP(3.738519) = 42.0357. Note, however, that the arithmetic mean of 31 and 57 is not 42.0357. It turns out, however, that the geometric mean of 31 and 57 is 42.0357. Thus we could also put the formula GEOMEAN(S6:T6) in cell U6 and get the same value of 42.0357. This relationship is also true for the other marginal averages.
Observation: The saturated model is an exact fit for the data (i.e. the error terms are zero), and simply provides a new way of looking at the observed data.
The exact version of the coefficients calculated depends on the coding of the dummy variables used. E.g., if we use the coding
t1 = 0 if therapy 1 and = 1 if therapy 2
t2= 0 if not cured and = 1 if cured
then the log-linear regression model becomes: