We begin by investigating the saturated model, which accounts for all the possible variables. We do this by reexamining Example 2 of Independence Testing using a log-linear approach.

**Example 1**: Create a saturated log-linear model for the data in Example 2 of Independence Testing

The data for the 150 patients is again summarized in the contingency table in Figure 1.

Define the following coding of the categorical variables:

*t*_{1} = 1 if therapy 1 and = -1 if therapy 2

*t*_{2} = 1 if cured and = -1 if not cured

Based on this coding the data can be expressed as in Figure 2.

The log-linear model takes the form:

Here all the variables are included, including the interaction terms. This is called the **saturated model**. We now find the values of the population coefficients *β _{0}, β_{1}, β_{2}, β_{3}*. As usual, using the sample data we find the estimates of these coefficients

*b*, where

_{0}, b_{1}, b_{2}, b_{3}It then follows (using the data in Figure 2) that:

Adding all four equations and dividing by 4 we get

Adding the first two equations and dividing by 2 we get

Adding the first and third equations and dividing by 2 we get

Adding the first and last and dividing by 2 we get

Thus the model is

which is equivalent to

which is in turn is equivalent to

Using , the log-linear model takes the form (dropping the error term):

In Figure 3 we provide the contingency table for the logs of the original data in range S13:T14, but this time instead of calculating the marginal totals, we calculate the marginal averages.

Thus, for example, the marginal average for the Cured row (cell U13) contains the formula =AVERAGE(S13:T13) and the marginal average for the Therapy 1 column (cell S15) contains the formula =AVERAGE(S13:S14).

Note that *b*_{0} = the grand mean (cell U15), *b*_{1} = the mean for Cured (cell U13) minus the grand mean, *b*_{2} = the mean for Therapy 1 (cell S15) minus the grand mean and *b*_{3} = Cured × Therapy 1 (cell S13) minus the mean for Cured minus the mean for Therapy 1 plus the grand mean.

We now map the log values back into the original contingency table (range R5:U8) by using the exponential function. Thus the marginal average for the Cured row in the original contingency table (cell U6) is EXP(U13) = EXP(3.738519) = 42.0357. Note, however, that the arithmetic mean of 31 and 57 is not 42.0357. It turns out, however, that the geometric mean of 31 and 57 is 42.0357. Thus we could also put the formula GEOMEAN(S6:T6) in cell U6 and get the same value of 42.0357. This relationship is also true for the other marginal averages.

**Observation**: The saturated model is an exact fit for the data (i.e. the error terms are zero), and simply provides a new way of looking at the observed data.

The exact version of the coefficients calculated depends on the coding of the dummy variables used. E.g., if we use the coding

*t*_{1} = 0 if therapy 1 and = 1 if therapy 2

*t*_{2}= 0 if not cured and = 1 if cured

then the log-linear regression model becomes:

Thank you for your website – it is most helpful! I am having trouble understanding the log-linear regression model with the alternate coding at the bottom of the page. I would think that b0 = 3.932 (ln 51, when t1 = t2 = 0), which gave me the following model: ln y = 3.932 – 1.534 t1 + 0.111 t2 + 0.925 t1t2.

Thanks for your comment. The coding I actually used is

t1 = 0 if therapy 1 and = 1 if therapy 2 (instead of 1 for therapy 1 and 0 for therapy 2 which is how it was stated)

t2 = 0 if not cured and = 1 if cured

The probably accounts for the difference. I have now corrected the webpage.

Charles