In addition to the Logistic Regression data analysis tool, the Real Statistics Resource Pack provides the following supplemental functions for binary logistic regression.
Real Statistics Function: The following are array functions where R1 contains data in either raw or summary form.
LogitCoeff(R1, lab, raw, head, alpha, iter) – calculates the logistic regression coefficients for data in raw or summary form. Includes the standard errors, Wald statistic, p-value and 1 – α confidence interval. If head = TRUE then R1 contains column headings.
LogitCoeff2(R1, R2, lab, head alpha, iter) – calculates the logistic regression coefficients as for LogitCoeff. R1 contains the data for the independent variables and R2 contains the data for the dependent variable. If R2 has one column then data is in raw format, while if R2 has two columns then data is in summary format. R1 and R2 can contain non-numeric data; these are treated as missing data and are not included in the analysis.
LogitTest(R1, lab, raw, iter) – calculates LL of the full and reduced models, the chi-square statistic and the p-value for the data in range R1 (without headings)
LogitRSquare(R1, lab, raw, iter) – calculates LL of the full and reduced models for the data in range R1 (without headings) and three versions of R2 (McFadden, Cox and Snell, Nagelkerke) as well as AIC and BIC
Here the parameters lab, raw, alpha, head and iter are optional.
When lab = True then the output includes row and/or column headings and when lab = False (the default) only the data is outputted.
When raw = True then the data in R1 is in raw form and when raw = FALSE (the default) then the data in R1 is in summary form. The parameter alpha is used to calculate a confidence interval and takes a value between 0 and 1 with a default value of .05. The parameter iter determines the number of iterations used in the Newton method for calculating the logistic regression coefficients; the default value is 20. The default value of head is False.
The Real Statistics Resource Pack also provides the following functions:
LogitPred(R0, R1, raw, iter) – outputs the probability of success for the values of the independent variables contained in the range R0 (in the form of either a row or column vector) based on the logistic regression model calculated from the data in R1 (without headings)
LogitPredC(R0, R2) – outputs the probability of success for the values of the independent variables contained in the range R0 (in the form of either a row or column vector) based on the logistic regression coefficients contained in R2 (in the form of a column vector)
LogitSummary(R1, head) – array function which takes the raw data in range R1 and outputs an equivalent array in summary form. If head = TRUE then R1 contains column headings as well as the output.
LogitSelect(R1, s, head) – array function which takes the summary data in range R1 and outputs an array in summary form based on s. If head = TRUE then R1 includes column headings as well as the output. The string s is a comma delimited list of independent variables in R1 and/or interactions between such variables. E.g. if s = “2,3,2*3” then the data for the independent variables in columns 2 and 3 of R1 plus the interaction between these variables are output.
LogitReduce(R1, s) – array function that takes summary data in R1 including column headings and fills the highlighted range with the data in R1 omitting the columns described by the string s, where s is a comma delimited list of column headings in R1.
LogitCoeffs(R1, iter) – array function which outputs an n+3 × 1 range, where n = # of columns in R1, consisting of the n+1 logit coefficients for R1, an empty cell and the # iterations actually made; iter = maximum # of iterations (default 20).
Observation: LogitCoeffs is a simpler version of the LogitCoeff function with the following added feature. If the algorithm fails before the number of iterations in iter, then the coefficients obtained to that point as well as the # of successful iterations are reported.
Observation: Figure 1 shows the use of some of the supplemental functions described above for Example 1 of Comparing Logistic Regression Models where the model data is in summary form. The output should agree with the output obtained using the Logistic Regression data analysis tool as shown in Figure 2 of Comparing Logistic Regression Models.
Figure 1 – Use of Logistic Regression functions
Note that output shown in range G18:G20 and the output shown in range G12:M14 is produced using the same formula, =LogitCoeff(A4:D15). The only difference is the shape of the output range (i.e. the highlighted range).
To find the predicted probability that a reptile in the study will be male when the temperature is 21 and there is no water, we use the formula =LogitPred(I19:J19,A4:D15) in cell K19. We see that this probability is 89.9%. If we use absolute addressing for the second argument then we can highlight the range K19:K20 and press Ctrl-D to get the predicted probability when the temperature is 21.5 and there is no water, namely 83.8%.
The function LogitRSquare produces two values not provided by the data analysis tool, namely AIC (Akaike’s Information Criterion) and BIC (Baysian Information Criterion). We have already seen AIC for linear regression (see Significant Testing of Multiple Regression Variables). These statistics are calculated using the following formulas:
AIC = -2LL + 2(k+1) BIC = -2LL + (k+1) ln(N)
where N = the total number of observations. Both of these statistics are used to compare logistic regression models (as described in Comparing Logistic Regression Models), and can be used even when the models are not nested. The lower their value the better the model. Both favor models with fewer parameters (i.e. fewer independent variables), although BIC favors fewer parameters even more than AIC.
Observation: The function LogitSelect can be used to create the reduced and interaction models shown in Comparing Logistic Regression Models. E.g. referring to Figure 2 and 3 of that webpage, LogitSelect(A3:D15,”1″,TRUE) can be used to create the data in range B28:D34. The formula LogitSelect(A3:D15,”1,2,1*2″, TRUE) can be used to create the data in range B70:F82 of Figure 6 of that webpage.
Observation: We can also use these supplemental functions when the data is in raw format. Referring to Figure 4 and Example 2 of Finding Logistic Regression Coefficients using Newton’s Method we could obtain the results shown in Figure 2.
Figure 2 – Use of Logistic Regression functions with raw input data
Observation: When working with data in raw format, both the Logistic Regression data analysis tool and the LogitSummary function are limited to 65,500 rows of data. When you have a range with more than 65,500 rows, you need to use the following supplemental array function:
LogitMatches(R1, R2) – array function which outputs the total number of successes and failures for each row in R2 based on the raw data in R1.
R1 is an m × n range in logistic regression raw data format without column headings and R2 is a k × n–1 range. The output of the function is a k × 2 range where the first column contains counts of all the matches between R1 and R2 where the last column of R1 contains a 1 (for success) and the second column contains counts of all the matches between R1 and R2 where the last column of R1 contains a 0 (for failure).
We show how to use this function for the raw data in range A3:C52 for Example 2 of Finding Logistic Regression Coefficients using Newton’s Method (the first 15 rows of which are repeated on the left side of Figure 3 below).
Figure 3 – Obtaining a summary using LogitMatches
We first highlight range A3:B52, press Ctrl-C, click on cell E3 and then press Ctrl-V. This makes a copy of the first two columns of the data and places it in range E3:F52. We next select Data > Data Tools|Remove Duplicates. This will remove all the duplicates in the data, leaving the range E3:F9 in Figure 3. Note that this procedure will work even if there are more than 65,500 rows of data even after the duplicates are removed.
Now highlight the range G4:H9 and enter the array formula
After pressing Ctrl-Shift-Enter the result will be as displayed in Figure 3. It turns out that the output from the LogitMatches function is also limited to 65,500 rows, but you don’t need to handle all the rows at once.
If, for example, the range A3:C200003 holds the original data and the range E3:F100003 holds the data from the first two columns with all duplicates eliminated then you can highlight the range G4:H50003 and enter the array formula
Then highlight the range G50004:H100003 and enter the array formula
Even though the original data range contains 200,000 rows and the data range after duplicates are eliminated contains 100,000 rows, by splitting the 100,000 rows in half and using LogitMatches separately on each half we are able to generate the equivalent of the output from the LogitSummary function without violating the 65,500 row limit.
Once you have completed this operation you can use the Logistic Regression data analysis tool, using the summarized output obtained. The Logistic Regression data analysis tool can handle summarized data with more than 65,500 rows.