Real Statistics Functions for Logistic Regression

In addition to the Logistic Regression data analysis tool, the Real Statistics Resource Pack provides the following supplemental functions for binary logistic regression.

Real Statistics Function:  The following are array functions where R1 contains data in either raw or summary form.

LogitCoeff(R1, lab, raw, head, alpha, iter) – calculates the logistic regression coefficients for data in raw or summary form. Includes the standard errors, Wald statistic, p-value and 1 – α confidence interval. If head = TRUE then R1 contains column headings.

LogitCoeff2(R1, R2, lab, head alpha, iter) – calculates the logistic regression coefficients as for LogitCoeff. R1 contains the data for the independent variables and R2 contains the data for the dependent variable. If R2 has one column then data is in raw format, while if R2 has two columns then data is in summary format. R1 and R2 can contain non-numeric data; these are treated as missing data and are not included in the analysis.

LogitTest(R1, lab, raw, iter) – calculates LL of the full and reduced models, the chi-square statistic and the p-value for the data in range R1 (without headings)

LogitRSquare(R1, lab, raw, iter) – calculates LL of the full and reduced models for the data in range R1 (without headings) and three versions of R2 (McFadden, Cox and Snell, Nagelkerke) as well as AIC and BIC

Here the parameters lab, raw, alpha, head and iter are optional.

When lab = True then the output includes row and/or column headings and when lab = False (the default) only the data is outputted.

When raw = True then the data in R1 is in raw form and when raw = FALSE (the default) then the data in R1 is in summary form. The parameter alpha is used to calculate a confidence interval and takes a value between 0 and 1 with a default value of .05. The parameter iter determines the number of iterations used in the Newton method for calculating the logistic regression coefficients; the default value is 20. The default value of head is False.

The Real Statistics Resource Pack also provides the following functions:

LogitPred(R0, R1, raw, iter) – outputs the probability of success for the values of the independent variables contained in the range R0 (in the form of either a row or column vector) based on the logistic regression model calculated from the data in R1 (without headings)

LogitPredC(R0, R2) – outputs the probability of success for the values of the independent variables contained in the range R0 (in the form of either a row or column vector) based on the logistic regression coefficients contained in R2 (in the form of a column vector)

LogitSummary(R1, head) – array function which takes the raw data in range R1 and outputs an equivalent array in summary form. If head = TRUE then R1 contains column headings as well as the output.

LogitSelect(R1, s, head) – array function which takes the summary data in range R1 and outputs an array in summary form based on s. If head = TRUE then R1 includes column headings as well as the output. The string s is a comma delimited list of independent variables in R1 and/or interactions between such variables. E.g. if s = “2,3,2*3” then the data for the independent variables in columns 2 and 3 of R1 plus the interaction between these variables are output.

LogitReduce(R1, s) – array function that takes summary data in R1 including column headings and fills the highlighted range with the data in R1 omitting the columns described by the string s, where s is a comma delimited list of column headings in R1.

LogitCoeffs(R1, iter) – array function which outputs an n+3 × 1 range, where n =  # of columns in R1, consisting of the n+1 logit coefficients for R1, an empty cell and the # iterations actually made;  iter = maximum # of iterations (default 20).

Observation: LogitCoeffs is a simpler version of the LogitCoeff function with the following added feature. If the algorithm fails before the number of iterations in iter, then the coefficients obtained to that point as well as the # of successful iterations are reported.

Observation:  Figure 1 shows the use of some of the supplemental functions described above for Example 1 of Comparing Logistic Regression Models where the model data is in summary form. The output should agree with the output obtained using the Logistic Regression data analysis tool as shown in Figure 2 of Comparing Logistic Regression Models.

Logistic regression functions summary

Figure 1 – Use of Logistic Regression functions

Note that output shown in range G18:G20 and the output shown in range G12:M14 is produced using the same formula, =LogitCoeff(A4:D15). The only difference is the shape of the output range (i.e. the highlighted range).

To find the predicted probability that a reptile in the study will be male when the temperature is 21 and there is no water, we use the formula =LogitPred(I19:J19,A4:D15) in cell K19. We see that this probability is 89.9%. If we use absolute addressing for the second argument then we can highlight the range K19:K20 and press Ctrl-D to get the predicted probability when the temperature is 21.5 and there is no water, namely 83.8%.

The function LogitRSquare produces two values not provided by the data analysis tool, namely AIC (Akaike’s Information Criterion) and BIC (Baysian Information Criterion). We have already seen AIC for linear regression (see Significant Testing of Multiple Regression Variables). These statistics are calculated using the following formulas:

AIC = -2LL + 2(k+1)                  BIC = -2LL + (k+1) ln(N)

where N = the total number of observations. Both of these statistics are used to compare logistic regression models (as described in Comparing Logistic Regression Models), and can be used even when the models are not nested. The lower their value the better the model. Both favor models with fewer parameters (i.e. fewer independent variables), although BIC favors fewer parameters even more than AIC.

Observation: The function LogitSelect can be used to create the reduced and interaction models shown in Comparing Logistic Regression Models. E.g. referring to Figure 2 and 3 of that webpage, LogitSelect(A3:D15,”1″,TRUE) can be used to create the data in range B28:D34. The formula LogitSelect(A3:D15,”1,2,1*2″, TRUE) can be used to create the data in range B70:F82 of Figure 6 of that webpage.

Observation: We can also use these supplemental functions when the data is in raw format. Referring to Figure 4 and Example 2 of Finding Logistic Regression Coefficients using Newton’s Method we could obtain the results shown in Figure 2.

Logistic regression raw functions

Figure 2 – Use of Logistic Regression functions with raw input data

Observation: When working with data in raw format, both the Logistic Regression data analysis tool and the LogitSummary function are limited to 65,500 rows of data. When you have a range with more than 65,500 rows, you need to use the following supplemental array function:

LogitMatches(R1, R2) – array function which outputs the total number of successes and failures for each row in R2 based on the raw data in R1.

R1 is an m × n range in logistic regression raw data format without column headings and R2 is a k × n–1 range. The output of the function is a k × 2 range where the first column contains counts of all the matches between R1 and R2 where the last column of R1 contains a 1 (for success) and the second column contains counts of all the matches between R1 and R2 where the last column of R1 contains a 0 (for failure).

We show how to use this function for the raw data in range A3:C52 for Example 2 of Finding Logistic Regression Coefficients using Newton’s Method (the first 15 rows of which are repeated on the left side of Figure 3 below).

Logistic regression summary procedure

Figure 3 – Obtaining a summary using LogitMatches

We first highlight range A3:B52, press Ctrl-C, click on cell E3 and then press Ctrl-V. This makes a copy of the first two columns of the data and places it in range E3:F52. We next select Data > Data Tools|Remove Duplicates. This will remove all the duplicates in the data, leaving the range E3:F9 in Figure 3. Note that this procedure will work even if there are more than 65,500 rows of data even after the duplicates are removed.

Now highlight the range G4:H9 and enter the array formula

=LogitMatches(A4:C52,E4:F9)

After pressing Ctrl-Shift-Enter the result will be as displayed in Figure 3. It turns out that the output from the LogitMatches function is also limited to 65,500 rows, but you don’t need to handle all the rows at once.

If, for example, the range A3:C200003 holds the original data and the range E3:F100003 holds the data from the first two columns with all duplicates eliminated then you can highlight the range G4:H50003 and enter the array formula

=LogitMatches(A4:C200003,E4:F50003)

Then highlight the range G50004:H100003 and enter the array formula

=LogitMatches(A4:C200003,E50004:H100003)

Even though the original data range contains 200,000 rows and the data range after duplicates are eliminated contains 100,000 rows, by splitting the 100,000 rows in half and using LogitMatches separately on each half we are able to generate the equivalent of the output from the LogitSummary function without violating the 65,500 row limit.

Once you have completed this operation you can use the Logistic Regression data analysis tool, using the summarized output obtained. The Logistic Regression data analysis tool can handle summarized data with more than 65,500 rows.

25 Responses to Real Statistics Functions for Logistic Regression

  1. Matt says:

    Hi Charles,

    Is there a limit to the number of columns that the logitmatches function can handle? I have around 409 rows but 16 different columns for the independent variables. I tried using the logitmatches function but got an error of #VALUE!. I tried splitting up the data to have less rows as you mentioned but it still didn’t work.

    Could you help me out with this?

    Thanks,
    Matt

    • Charles says:

      Matt,
      409 rows and 16 columns is not very high and so you don’t even need to use the logitmatches function.
      If you send me an Excel file with your data and results, I will try to figure out what is going on.
      You can find my email address at Contact Us.
      Charles

  2. Thanks for a very useful set of functions and tools.
    Given an expected probability it is possible to calculate an “x” value from the output of the logistic tool as on the logit 3 example worksheet in the Examples workbook part 2. How can one calculate the confidence interval round that “x” value?

    • Charles says:

      Since x is the independent variable, I don’t know what you mean by a confidence interval for the x values.
      Charles

      • Over the Hill now says:

        Thanks for your very rapid reply!
        If a probability is required (say 95%) one can calculate the x value that would give that probability from the formulae at the left of the logit 3 example sheet by using Solver to set p to the required value by adjusting the x value. It is the confidence interval of that x that I hope to be able to calculate. “What x value (irradiation say) can I accept to expect a survival proportion of 95% and with that x value what is the confidence interval of the survival?”

        • Charles says:

          Julian,

          This is an interesting question. First we need to define what such a confidence interval would be.

          Let me discuss a similar problem by referring to Example 1 of Confidence and Prediction Interval for Linear Regression (see Figure 2). The predicted value of x = 20 is y = 73.16 with a prediction interval of (55.36, 90.95). So if I ask for the prediction interval of x when y = 73.16, I need to figure out what this means. One approach is to find the x values that correspond the y values 55.36 and 90.95. This would result in a “prediction interval” of (-8.3, 48.3) for x = 20. I calculated these values by trial and error, using the formula =RegPred(P4,A4:A18,B4:B18,TRUE) where P4 contains the trial value of x and looking for outputs of 55.36 and 90.95.

          If this approach makes sense, a similar thing can be done for logistic regression.

          Charles

  3. Francesco says:

    Hi Charles,
    First off, congratulations for this website, it s explicative and very helpful.
    I just have a problem which I think it s easy to solve, but bc of time constraints I need to cut it short and ask you for your help.
    I have 16 patients total, 5 with disease and 11 w/o disease. In each of these patients we measured a variable in different location (n=14). I wanted to know if with these premises , I can find a cut-off value which can discriminate disease/no disease. FYI I used Levene(equal variances) and Kruskall Wallis (small sample) and found out that the variable (mean value) is higher in disease and the difference is stat significative (p<.05). I was thinking about binomial logistic regression? (X=disease/no disease and Y mean variable value per patient?) I am a bit confused. Thank you so much, I hope I was clear.

    • Charles says:

      Francesco,
      I don’t completely understand the scenario, but based on my understanding of the situation binary logistic regression could be the correct way to go.
      Charles

  4. Wytek Szymanski says:

    Hi Charles,

    I want to predict the probability of a Default given the credit card balance. My raw data consists of 400 records where the credit card balances range distinctly from $0 to $2000. Another column is binary indicating the Default.

    I am having some problems with the LogitSummary function. I have also used Binary Logistic Regression (from Ctrl-m). The output is the same, almost 400 records.

    It seems to me that the credit card balance data should be binned into ranges of some width of say $100. Excel’s Frequency function could provide the start of it. Knowing the number of accounts for any range and the number of defaults within any range should produce the observed probability for any range as well.

    Unlike linear regression where raw data works just fine, in the case of logistic regression a lot of work on the raw data seems to be necessary to produce a required vector of observed probabilities.

    • Charles says:

      It could be interesting to try both approaches: one using the raw data and the other using ranges of data and see which model best fits the data.
      Charles

      • Wytek Szymanski says:

        Using raw data makes no sense because the observed probabilities are either 0% or 100%, either a default or no default.

        My goal is to recommend credit card limits. Looking at my data I can see that defaults are much more frequent with higher balances. I can also see that defaults are much more frequent if the card holder is young, a student, unemployed, or has a low income.

        For example, what credit limit should I recommend for a 20 year old student with a low income so that the probability of a default stays below 5%?

        I am hoping to solve this problem using logistic regression. I am surfing the web for literature to help me. Would you know where I should look?

        • Charles says:

          Wytek,
          Of course for each raw data element the probability is 0% or 100%; that is normal. The beauty of logistic regression is that this raw data is summarized; the summarized data isn’t 0% or 100%. Logistic regression may be the right way to go for this problem. It all depends on the nature of your data. Even linear regression might help, although based on the description of your data, this sounds like a likely fit for logistic regression.
          Charles

          • Wytek Szymanski says:

            In logistic regression the so-called dependent variable is a vector of observed probabilities, numbers between 0 and 1. In my case it is a vector of default rates. How would you compute the default rate for the following:

            There are 100 individuals with credit card balances between $1000 and $2000. 50 out of 100 of these individuals are in default. 40 out of 100 of these individuals are students. half of these students are in default.

            What would be the default rate of a student with a credit card balance between $1000 and $2000.

            This is basically my struggle.

  5. Matthew says:

    Charles,
    Thank you so much for all the information and resources on this site. As amazing as it is I am still in need of some help. I have a logistic model that works well, but the next step is have the model run in the background and map it against real data to confirm the continued accuracy. Sample is very dynamic so it must be updated regularly changing the sample portions and coefficients. Do you have a recommendation/function that calculate p-score and immediately add the new data to the sample while bumping off the oldest observation? Example, calculate p-score with individual observation, p-score=.85 threshold is .5 observation receives a 1. Observation data added to sample, and oldest recoded prediction of 1 is removed. Any help would be greatly appreciated.

  6. siti says:

    Hi Charles,

    I have two questions.

    1. Where can we get the “likelihood ratio” from the Logistic Regression Excel output as in the above figures?

    2. Which R-square [R-Sq (L); R-Sq (CS); R-Sq (N) ] represents the “goodness of fit” for the model?

    Thank you in advance for your response.

    • Charles says:

      1. The LL value can be obtained from the LogitTest or LogitRSquare functions as shown in Figure 1.
      2. They all represent a sort of goodness of fit value, but the situation is not as clear as for linear regression.
      Charles

      • Siti says:

        Thank you so much for your response!
        One thing for sure, your website is one of my favorites now. Keep up sharing 🙂

  7. Shruti Gupta says:

    Whenever I am trying to do logistic regression, its showing “compile error hidden message”. What possible mistake am I doing?

    • Charles says:

      To try to answer your question, please answer the following questions:

      1. When do you see this message: (a) when you press Ctrl-m (or select Real Statistics from the Add-ins ribbon), (b) when you choose Regression or Logistic Regression from the initial dialog box or (c) when you press OK on the Logistic Regression dialog box?

      2. What version of the operating system are you using (Windows XP, Vista, 7, 8, 8.1, Mac OS)? What version of Excel are you using?

      3. What value do you get when you enter the formula =VER() in any cell in a spreadsheet?

      4. Are you able to use some other Real Statistics data analysis tools without getting a similar message?

      Charles

  8. Martin Coetzee says:

    Hi,

    Is it at all possible to call a full logistic regression Direct from Vba?
    Only item that may differ between iterations will be the input range.

    Thanks in advance

  9. shosho says:

    Hi Charles,
    I tried to use the LogitMatches and I followed all the steps stated in this page for this function but the the result was displayed as #NAME?
    for all summery cells ?

    Please reply to me ASAP 🙁

    Many Thanks,

    • Charles says:

      Shosho,
      This is surprising. The LogitMatches function was introduced in Release 2.13 in May 2014; so as long as you are using this release or a later one, you should at least have access to the function. What do you see when you enter =VER() in any cell?
      Charles

Leave a Reply

Your email address will not be published. Required fields are marked *