Logistic Regression

When the dependent variable is categorical it is often possible to show that the relationship between the dependent variable and the independent variables can be represented by using a logistic regression model. Using such a model the value of the dependent variable can be predicted from the values of the independent variables.

We review here binary logistic regression models where the dependent variable only takes one of two values. In Multinomial and Ordinal Logistic Regression we look at multinomial and ordinal logistic regression models where the dependent variable can take 2 or more values.


65 Responses to Logistic Regression

  1. Amit Kumar Gupta says:

    I am getting different coefficients/estimates from Excel addin and SparkMlib, even on”R” for the same data set.

    Is this an expected behavior. If yes, can you please provide an explanation.

    I have searched and found below information, it would be great if you can put some light on this.
    R’s glm is returning a maximum likelihood estimate of the model while Spark’s LogisticRegressionWithLBFGS is returning a regularized model estimate.
    Please refer to following URL –

    • Charles says:

      I have tested the Real Statistics logistic regression model against some other sources and found that they match. There are many techniques available and so I cannot guarantee that the result will match with all of them.

  2. Shri says:

    Hi Charles,

    I am trying to build a model to predict turnover in our organization based
    1. Tenure
    2. Age
    3. Training attended
    4. Salary range
    5. Gender

    Any suggestions as to how do i go about doing the same especially given the fact that I am trying it out in Excel.

    Also in case your e-book is out would love to have it.


    • Charles says:

      It sounds like some sort of regression model, but I can-t say exactly without a more complete description of the scenario.
      I am working on the book now and hope to have it out in a few weeks.

      • Shri says:

        Hi Charles,

        Thanks for your response.

        To give you an example I have list of all the employees (i.e. Population) their tenure in the organization in years (Independent – Continuous variable) and if they are still active or resigned (Dichotomous – Dependent variable). I have other data points (Independent variables) for same set of employees i.e. age, salary range etc.

        I wanted to build a model basis the above data set to predict which employees would resign.

        To start with I was planning to run a logistic regression for Tenure and active and resigned data set to predict the chances of turnover based on tenure. Similarly I would run it with other variables.

        Is this the right method?
        Do i need to find a correlation of all the independent and dependent variables (Resignation or Active)? If yes, How should I do the same given that the dependent variable is dichotomous?



        • Charles says:

          I would simply run a logistic regression and not worry about the correlations.

          • shri says:

            Hi Charles,


            I have downloaded the real statistics add in. Does it not have Logistic regression in it. I saw a youtube video it shows logistic regression. Have i downloaded an old version


          • Charles says:

            You will find logistic regression under the Regression option.

          • Shri says:

            Hi Charles,

            I also thought Logistic regression should be under Regression but unfortunately did not find.

            Kind regards

  3. Gary says:

    Mr. Zaiontz,
    Thanks for the great site!
    I ran a binary logistic of Y on each of three different numerical variables A,B,C respectively. I am having an issue of separation of variables, meaning that after certain values Ao,Bo, Co Of A,B,C respectively(different values for each, of course) responses are successes (I guess this forces the slope to diverge to minus infinity for the slope of the curve to accommodate the abrupt change of 1 to 0). Then I increased the success levels to three: high, medium and low. But now I have lack of fit issues. How does one interpret lack-of-fit issues with a Logistic Regression? I know that a lack of fit in a simple linear means that data is not linear but what does it mean for a Logistic? Does it mean the (log of) the data is not distributed like an S-curve ExpL/(1+ExpL) ?

  4. Krin says:

    Hi Charles

    Is there a limit to the number of rows and columns of data (apart from the Excel-specific ones) one can use to do a binary logistic regression using your pack?

  5. Robert says:

    Hi Charles,

    I’m trying to do logistic regression with some categorical/nominal inputs. I’m worried about multicolinearity problems from turning them into dummy variables. I was wondering if you could tell me whether I would have this problem or if it should be okay. I have a binary output (0 or 1) and my input is a set of 2 dummy variables representing 4 different scenarios: a control (0,0), a treatment a (1,0), a treatment b (0,1), or treatments a+b (1,1). I read your blog and watched the youtube video and ran the regression (I also did a chi-squared test and it seems that there is a correlation to be found), but I’m not sure whether the results are great, especially because most of my output variable observations are 0 and only 1-3% are 1.

    I’m looking at precision/recall as well but I want to know if I’m working with the model properly. Would these 2 variables give me any problems as far as you can tell? I’m just curious about if it is appropriate to use logistic regression in this way. And what about if I encoded my variables differently? Like if I had 4 dummy variables that were 0/1 for each of the 4 scenarios (meaning for every data point, exactly one of the four was set to 1)? Would that cause a multicolinearity problem? Also, is there some other concern I may be missing here?



    • Charles says:

      When most of the output is 0 and few are 1, you will certainly need a larger sample than if the data were more balanced.
      Regharding multicollinearity, I would need to see your data before I can really comment further.

  6. Shalaw says:

    Hello Dr. Charles Zaiontz

    Dear Prof. I would like to have your comment or suggestion on my situation.
    I have collected the data, there are 300 non-injury and only 17 injury… four categorical variables are significant according to Chi-squire, then I used Multiple logistic regression for significant variables. Three of them are significant again. does it make any sense? I would like to know whether can I use Multiple logistic regression because only 17 respondent had injured from 317 of the respondents.?
    I used SPSS to analysis data.

    If I can not run it what should I have to do? There is any way to salve it.

    I appreciate all your help and support; it’s been a great encouragement to me

    • Charles says:

      Since I only have very limited information about the analysis that you have done, I will limit my response to the issue that only 17 of 317 respondents had an injury. I don’t see any reason a priori why you couldn’t use logistic regression. One caution is that the power of the test may suffer a lot from such an unbalanced model. E.g. if you are conducting a two sample t test with effect size .5 and alpha .05, then for two samples of size 300 and 17 the power of the test would be 52%, while if the two samples have size 158 and 159 then the power of the test would be 99%. Thus even though the total sample sizes are the same, the power of the more balanced test is much higher.

  7. Prachi says:

    HI Charles,
    Do you have any e-book of the above topics explained in detail?
    if yes, please share the link.

  8. Jonathan Andro Tan says:

    Hi Charles,

    How do I interpret the Chi-sq and p-value in the binomial logistic regression? The same with R-sq and hosmer.


    • Charles says:

      Hi Jonathan,
      This is explained on the appropriate logistic regression webpages on the website. Please look at these explanations. If you are still having problems, please ask me a more specific question so that I can try to help you.

  9. Amonpun says:

    Dear Charles,
    I am using excel 2016 but I couldn’t use your tool pack. There was an error about incompatible with the version, or architecture of this application. Could you please give me the suggestion.

    • Charles says:

      The usual reason is that you need to make sure that Excel’s Solver is operational before you install the Real Statistics Resource Pack. This is described in the installation instructions (on the same webpage from which you downloaded the Real Statistics software).
      To see whether Solver is operational, press Alt-TI and see whether Solver appears on the list with a check mark next to it. If there is no check mark, you need to add it.

  10. Andrew Gizbert says:

    I am trying to estimate the learning curve equation for SW developers. I have 25 developers output over their first 18 months of work. Their output does follow a Sigmoid curve.

    My goal is to use the 25 sets of data to build an estimate with confidence intervals for a new developer (ie what might be their output in month 3 or 6 etc) – if they follow past historical patterns.

    Output is normalized as “estimated delivered hours per work effort”.

    What is your recommendation for handling this data? I think simply averaging output by month for the 25 developers will mask the variability that I am trying to capture.

    • Charles says:

      I can’t think of another approach, but perhaps someone else in the community has an idea.

  11. Steven says:

    Hello Mr. Charles,
    Thanks for your introduction of logistic regression. I just follow your webpage by webpage and these webpages help me a lot. But I have a question when I see an example in wikipedia for logistic regression.


    The example states,
    A group of 20 students spend between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability that the student will pass the exam? The table shows the number of hours each student spent studying, and whether they passed (1) or failed (0).

    Hours 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50

    Pass 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1

    I attempt to use the method from your example 1. But there is something different. When in your example 1, each category (rem) has typically more than 1 person. So majority of P(E) is less than 1 or greater than 0. But in this example from wikipedia, almost P(E) is 1 or 0, and ln(P(E)) is negtive infinite or position infinite.

    My question is how I solve this problem.

    Thanks in advance


    • Charles says:

      You need to take the transpose of the data as the input to the Real Statistics Logistic Regression data analysis tool. If you do, you will get the same answers as those you found on Wikipedia.

      • Yolanda says:

        i also have a problem with the example on Wikipidea. I cant get the same Intercept and slope. What am I doing wrong? Do I need to convert the response variable from binary and how.

        • Charles says:


          Is this the data for the example that you are referring to?
          Hours 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50
          Pass 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1

          If so, the response variable (Pass) is clearly binary.

          If you send me an Excel file with the analysis that you are trying to perform I will try to figure out what is going wrong.


  12. Shreya says:

    Hello Charles,
    I have 6 independent variables in my analysis and one dependent variable. All are binary (Both independent & dependent). What type of analysis should I use if I have to determine the equation involving all of them

  13. Dr buenos día, que pena escribirle en español. Dr. Si usted posee una variable dependiente binaria y una independiente binaria, puedo aplicar una regresión logística. Y si posee una dependiente binaria y varias independientes binarias también puede aplicar una regresión logística?.
    Dr good day, too bad to write in Spanish. Dr. If you have a dependent and an independent variable binary, can I aplay logistic regression? . And if you have a binary dependent and multiple independent binary variables can you also be applied logistic regression ?.
    Excuse me my translate please.

    • Charles says:

      If your dependent variable is binary and some but not all of your independent variables are binary, then you can try to apply logistic regression.
      If all your variables are binary then you should use log-linear regression and not logistic regression. The simple case reduces to the chi-square test of independence. Namely,
      if your dependent variable is binary and you also have one independent variable which is binary, then essentially you have a 2 x 2 contingency table which you can address via chi-square or related techniques.

  14. Nozomi says:

    Dear Charles;
    Thank you so much for sharing such an excellent tool.
    It’s very useful for people all over the world.

    Could you add the tool for probit regression and tobit regression
    next time you upload the new version?

    I think these two models are widely used in the field of social science.
    Of course we can often use logit regression instead of using probit model,
    but sometimes it is not appropriate.

    Again I appreciate you very much that you made wonderful website.
    Thank you.

    • Charles says:

      Thanks for your comment. I already have probit and tobit on my potential enhancements list. The next release will focus on time series analysis, but I will consider adding probit and/or tobit regression to one of the following releases.

  15. Mike Bronson says:

    Dr. Zaointz,
    Thanks for your very useful application!
    Would you be kind enough to point out how a user could go about reporting the testing power from a logistic multiple regression?
    As background, I observed 544 trees for whether they held cones or not as the dependent variable (i.e., 0 or 1), and for their length and age both continuous variables. Your Real Stat report showed that among the independent variables, ln length had a significant association with cone presence while ln age and interaction did not. For ln age in particular, the report shows the following: coeff b = -0.107, s.e.= 0.291, Wald stat = 0.135, p = 0.713, exp(b) = .899.
    How could I use Real Stat to find the power of the test whether the log normal age’s coefficient b differs from zero?
    Thanks and Best Regards!

  16. samiullah says:

    dear charles,

    please sent me study materials for theoretical background of logistic regression.

    • Charles says:

      The theoretical background of logistic regression is provided on the Real Statistics website. If you need additional detail, please look at the Bibliography.

  17. Niraj says:


    Thanks for your informational website. I am very new to regression analysis. I am learning from your website and youtube videos. I have downloaded your excel plug-in and am working on Logistics regression. Running into some challenges that I thought you could help with…

    I have created my training data set and when I run the logistics regression… sometimes I am getting all garbage values..like below

    p-Pred Suc-Pred Fail-Pred LL % Correct HL Stat

    Re-running the same data set again, sometimes it is working. Can you help me with the reason why I should be getting these errors?



    • Charles says:

      I can’t think of any reason why one time the procedure works and the next time you get garbage, except that perhaps one time you include the column headings option and the next time you don’t (which makes the program think there is invalid data).
      I will take a look at the Excel file you emailed me and try to figure out what is going on.

  18. Michael H says:

    Hi Charles,

    I am trying to use a logistic regression to forecast a percentage. The aim is to forecast the turnout at different polling locations in my country. My independent variables are a combination of numerical and categorical data (Month,Day of Week, most recent participation percentage, log of advertising money spent).

    I know it is possible to make forecasts for this data having done so in other statistical software. I’d love to use your package to do it though since most of my work is done in excel and I have had great success with some of your other tools (Thanks!).

    My problem is that when I try to run a logistic regression with participation percentage as the dependent variable, I am told I need to have either 0 or 1 as the dependent variable. Is there anyway around this so I can see the coefficients or the independent variables and make forecasts?

    Thanks for your help.

    • Charles says:

      Hi Michael,
      Logistic regression is used to make forecasts where there is binary outcome. It can be extended to a small number of categorical outcomes, but I have not seen it used to output percentages. You can use other regression techniques to forecast percentages, but as far as I am aware not logistic regression.

  19. Paola says:

    Hi Charles,
    before asking my question I wanted to thank you for this website. It has been extremely useful for a research I am doing.
    I have found the employment of the logistic regression easy, however, I am struggling with a further extension of the model to qualitative/categorical variables. I need to consider dichotomous and polytomous explanatory variables, however, I don’t know how to code them. The real problem is with dichotomous variables because I normalise my data taking logs before regressing, this means that I will have Log(1)= 0 and Log(1)= #Value!. How can I include these variables without affecting the accuracy of the whole model?

    • Charles says:

      Presumably, you mean Log(0)= #NUM!. This is a common problem. One approach is to use a Log(x+a) transformation instead of a Log(x) transformation, choosing the constant a so that x+a is always positive.

  20. Richard Tucker says:

    I very much appreciate your making Real Statistics available – and with such clarity! I am using the Logistic Regression module but am unable to obtain results if I enter more than 5 independent variables. Please let me know where I am going wrong. Many thanks.

    • Charles says:

      The problem probably has more to do with your data rather than the number of independent variables. If you send me an Excel file with your data I will try to figure out what is going wrong. See Contact Us for my email.

  21. David says:

    Hi Charles,
    Thanks for the site – very much insightful.

    Question here specific to the log regression function. How does one whittle down the number of variables for input into the model? Is this done as part of the pre-processing or is there an input parameter in any of the menus?

    Please advise.

    • David says:

      Another thing please Charles,
      Applying the model results in plenty of #VALUE in the summary page. Have checked for possible formatting issues and eliminated nulls – what is the reasons for this?
      Could I please send you the workbook?

    • Charles says:

      I have not provided any means for automatically whittling down the number of variables. I find that these automatic approaches can be sound mathematically, but they don’t take into account knowledge of the actual knowledge domain, especially since they usually don’t handle interactions between the variables, quadratic and higher powers of the variables, etc.

  22. Aurelio says:

    Dear Charles,

    thanks for the useful information in this website. I am, however, having a few problems with logistic regression I am running to test the relationship between a specific type of financial report (let’s call it Type-a, where accounting information is prevailing and type-b, when non-accounting info prevails) and the type of rating it gives (good or bad). My hypothesis is that there is no significant relation bewteen type-b and bad ratings. Ratings are always scaled from 1 to 21. However, I have divided the ratings classification in two classes, so to have class A (good) class B (bad). So, I have collected a set of data where i find 50 reports in which I have recoded types of report as 1 (accounting, type-a) and 0 (nonaccounting, b) and classes as 1 (good, A) and 0 (bad, B). Even thought, from eyeballing the data, there is a very weak relation between type-b and bad ratings (only 3 times out of 50 they coincide… 0 – 0), and although the logit regressions on the binary variables gives me a coefficient of the grade equals to 0.1245, the p-value is very high (0.98944). I cannot explain why this happened, since the process of data gathering and research was very rigid. Could it be that I only ran the logit regression on a set of dummy variables (the type is the independent and the grade is the dependent)? What can be the problem?
    Thanks in advance!

    • Charles says:

      If you send me an Excel spreadsheet with your data, I will try to figure out whether there is a problem or help explain what is going on. You can find my email address at Contact Us.

  23. delante moore says:


    Is there a limit to the number of independent variables, I have a dataset with 45 independent variables I am trying to analyze, if this is above the limit can you suggest an alternative

    • Charles says:

      You should be able to run the logistic regression with 45 independent variables. With such a large number of variables, you will also need a reasonably large sample (at least 45 just to get the model to run, much more to achieve reasonable power).

  24. Margaret says:

    Dear Charles,

    Thanks for this excellent explanation.
    I followed your instructions and mostly it worked well. However, when I tried to test a categorical independent variable (1:using multipurpose solution as cleaner; 2: using H2O2 as cleaner; total 41 inputs) and I did make sure the last column in the Input Range contained the 0 or 1 dichotomous values of the dependent variable (microbial contamination of the contact lens system), the outcome cells revealed #VALUE!. What did it mean?
    Thank you for your help.

    • Charles says:

      The usual explanation is that the logistic regression model did not converge to a solution. If you send me your worksheet I will check it out.

  25. Jonathan says:


    Thank you for this excellent explanation.

    I am building a dataset with three continuous independent variables (binned into values of 1 through 5 corresponding with standard deviation ranges above and below the mean) that I am testing to a dichotomous categorical dependent variable.

    My first attempt to use your data tool gave me cells with all significant categories from p-Pred and rightward containing only a #NAME output. There is a formula there, but it isn’t picking up data. It happened with both the Solver and Newton approach. I only had 30 inputs (and should have 100, under your minimum formula) so that may be the issue. But, I wanted to make sure there wasn’t something else going on before I kept adding data.

    I assume, by the way, that the input box assumes the last column on the right of the data set is the dichotomous output and that all columns to the left of that column in the selected range are the inputs. In other words, the columns must be contiguous and arranged in this fashion.

    Thank you.

    • Charles says:


      Using 30 inputs should not cause this problem.

      If you have chosen the Raw Data option, then the last column in the Input Range contains the 0 or 1 dichotomous values of the dependent variable. This column needs to be included in the data range. If you have chosen the Summary Data option, however, then the last two columns are associated with the dependent variable. The first of these contains the total number of successes for the corresponding independent variables and the second of these columns contains the number of failures for the corresponding independent variables (these totals won’t necessarily be 0 or 1).

      If you have done all of this correctly, then please make sure that you are using the latest release of the Real Statistics software. You can check this by using the worksheet formula =VER(). You should get the value 12.0 or 12.1 (if you are using the Windows version of the software). I made a few changes a number of releases ago which could be the cause of the problem that you have identified.

      If none of this resolves the problem, I would be happy to look at the worksheet you are using and see if I can resolve the problem. Just email it to me.


Leave a Reply

Your email address will not be published. Required fields are marked *