When the dependent variable is categorical it is often possible to show that the relationship between the dependent variable and the independent variables can be represented by using a logistic regression model. Using such a model the value of the dependent variable can be predicted from the values of the independent variables.

We review here **binary logistic regression** models where the dependent variable only takes one of two values. In Multinomial and Ordinal Logistic Regression we look at **multinomial and ordinal logistic regression** models where the dependent variable can take 2 or more values.

Topics:

- Basic Concepts
- Finding Coefficients using Excel’s Solver
- Significance Testing of Logistic Regression Coefficients
- Testing Fit of the Logistic Regression Model
- Finding Coefficients using Newton’s Method
- Handling Categorical Coding
- Comparing Logistic Regression Models
- Hosmer-Lemeshow Test
- Classification Table
- ROC Curve
- Real Statistics Functions

I am getting different coefficients/estimates from Excel addin and SparkMlib, even on”R” for the same data set.

Is this an expected behavior. If yes, can you please provide an explanation.

I have searched and found below information, it would be great if you can put some light on this.

R’s glm is returning a maximum likelihood estimate of the model while Spark’s LogisticRegressionWithLBFGS is returning a regularized model estimate.

Please refer to following URL –

http://datascience.stackexchange.com/questions/5710/why-does-logistic-regression-in-spark-and-r-return-different-models-for-the-same?newreg=084d82d6809040afa2d3aacb36a9128f&newreg=aa58ae673fd24216af6b2972f34604e8

Amit,

I have tested the Real Statistics logistic regression model against some other sources and found that they match. There are many techniques available and so I cannot guarantee that the result will match with all of them.

Charles

Hi Charles,

I am trying to build a model to predict turnover in our organization based

1. Tenure

2. Age

3. Training attended

4. Salary range

5. Gender

Any suggestions as to how do i go about doing the same especially given the fact that I am trying it out in Excel.

Also in case your e-book is out would love to have it.

Thanks

Shri

Shri,

It sounds like some sort of regression model, but I can-t say exactly without a more complete description of the scenario.

I am working on the book now and hope to have it out in a few weeks.

Charles

Hi Charles,

Thanks for your response.

To give you an example I have list of all the employees (i.e. Population) their tenure in the organization in years (Independent – Continuous variable) and if they are still active or resigned (Dichotomous – Dependent variable). I have other data points (Independent variables) for same set of employees i.e. age, salary range etc.

I wanted to build a model basis the above data set to predict which employees would resign.

To start with I was planning to run a logistic regression for Tenure and active and resigned data set to predict the chances of turnover based on tenure. Similarly I would run it with other variables.

Is this the right method?

or

Do i need to find a correlation of all the independent and dependent variables (Resignation or Active)? If yes, How should I do the same given that the dependent variable is dichotomous?

Regards

Shri

Shri,

I would simply run a logistic regression and not worry about the correlations.

Charles

Hi Charles,

Thanks.

I have downloaded the real statistics add in. Does it not have Logistic regression in it. I saw a youtube video it shows logistic regression. Have i downloaded an old version

Shri

Shri,

You will find logistic regression under the Regression option.

Charles

Hi Charles,

I also thought Logistic regression should be under Regression but unfortunately did not find.

Kind regards

Shri

Mr. Zaiontz,

Thanks for the great site!

I ran a binary logistic of Y on each of three different numerical variables A,B,C respectively. I am having an issue of separation of variables, meaning that after certain values Ao,Bo, Co Of A,B,C respectively(different values for each, of course) responses are successes (I guess this forces the slope to diverge to minus infinity for the slope of the curve to accommodate the abrupt change of 1 to 0). Then I increased the success levels to three: high, medium and low. But now I have lack of fit issues. How does one interpret lack-of-fit issues with a Logistic Regression? I know that a lack of fit in a simple linear means that data is not linear but what does it mean for a Logistic? Does it mean the (log of) the data is not distributed like an S-curve ExpL/(1+ExpL) ?

Gary,

Generally, this is true. Just as for linear regression, there can be other problems as well.

Charles

Hi Charles

Is there a limit to the number of rows and columns of data (apart from the Excel-specific ones) one can use to do a binary logistic regression using your pack?

Hi Krin,

The limit is 65,500 rows and columns. There is a way to exceed the rows limit as explained on the webpage:

http://www.real-statistics.com/logistic-regression/real-statistics-functions-logistic-regression/

Charles

Hi Charles,

I’m trying to do logistic regression with some categorical/nominal inputs. I’m worried about multicolinearity problems from turning them into dummy variables. I was wondering if you could tell me whether I would have this problem or if it should be okay. I have a binary output (0 or 1) and my input is a set of 2 dummy variables representing 4 different scenarios: a control (0,0), a treatment a (1,0), a treatment b (0,1), or treatments a+b (1,1). I read your blog and watched the youtube video and ran the regression (I also did a chi-squared test and it seems that there is a correlation to be found), but I’m not sure whether the results are great, especially because most of my output variable observations are 0 and only 1-3% are 1.

I’m looking at precision/recall as well but I want to know if I’m working with the model properly. Would these 2 variables give me any problems as far as you can tell? I’m just curious about if it is appropriate to use logistic regression in this way. And what about if I encoded my variables differently? Like if I had 4 dummy variables that were 0/1 for each of the 4 scenarios (meaning for every data point, exactly one of the four was set to 1)? Would that cause a multicolinearity problem? Also, is there some other concern I may be missing here?

Thanks,

Robert

Robert,

When most of the output is 0 and few are 1, you will certainly need a larger sample than if the data were more balanced.

Regharding multicollinearity, I would need to see your data before I can really comment further.

Charles

Hello Dr. Charles Zaiontz

Dear Prof. I would like to have your comment or suggestion on my situation.

I have collected the data, there are 300 non-injury and only 17 injury… four categorical variables are significant according to Chi-squire, then I used Multiple logistic regression for significant variables. Three of them are significant again. does it make any sense? I would like to know whether can I use Multiple logistic regression because only 17 respondent had injured from 317 of the respondents.?

I used SPSS to analysis data.

If I can not run it what should I have to do? There is any way to salve it.

I appreciate all your help and support; it’s been a great encouragement to me

Shalaw,

Since I only have very limited information about the analysis that you have done, I will limit my response to the issue that only 17 of 317 respondents had an injury. I don’t see any reason a priori why you couldn’t use logistic regression. One caution is that the power of the test may suffer a lot from such an unbalanced model. E.g. if you are conducting a two sample t test with effect size .5 and alpha .05, then for two samples of size 300 and 17 the power of the test would be 52%, while if the two samples have size 158 and 159 then the power of the test would be 99%. Thus even though the total sample sizes are the same, the power of the more balanced test is much higher.

Charles

Thank you very much dear Dr. Charles. I much appreciate your comment and discussion.

HI Charles,

Do you have any e-book of the above topics explained in detail?

if yes, please share the link.

Thanks,

Prachi

Prachi,

An ebook is coming soon.

Charles

Hi Charles,

How do I interpret the Chi-sq and p-value in the binomial logistic regression? The same with R-sq and hosmer.

Thanks.

Hi Jonathan,

This is explained on the appropriate logistic regression webpages on the website. Please look at these explanations. If you are still having problems, please ask me a more specific question so that I can try to help you.

Charles

Dear Charles,

I am using excel 2016 but I couldn’t use your tool pack. There was an error about incompatible with the version, or architecture of this application. Could you please give me the suggestion.

Amonpun,

The usual reason is that you need to make sure that Excel’s Solver is operational before you install the Real Statistics Resource Pack. This is described in the installation instructions (on the same webpage from which you downloaded the Real Statistics software).

To see whether Solver is operational, press Alt-TI and see whether Solver appears on the list with a check mark next to it. If there is no check mark, you need to add it.

Charles

I am trying to estimate the learning curve equation for SW developers. I have 25 developers output over their first 18 months of work. Their output does follow a Sigmoid curve.

My goal is to use the 25 sets of data to build an estimate with confidence intervals for a new developer (ie what might be their output in month 3 or 6 etc) – if they follow past historical patterns.

Output is normalized as “estimated delivered hours per work effort”.

What is your recommendation for handling this data? I think simply averaging output by month for the 25 developers will mask the variability that I am trying to capture.

Andrew,

I can’t think of another approach, but perhaps someone else in the community has an idea.

Charles

Hello Mr. Charles,

Thanks for your introduction of logistic regression. I just follow your webpage by webpage and these webpages help me a lot. But I have a question when I see an example in wikipedia for logistic regression.

https://en.wikipedia.org/wiki/Logistic_regression

The example states,

A group of 20 students spend between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability that the student will pass the exam? The table shows the number of hours each student spent studying, and whether they passed (1) or failed (0).

Hours 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50

Pass 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1

I attempt to use the method from your example 1. But there is something different. When in your example 1, each category (rem) has typically more than 1 person. So majority of P(E) is less than 1 or greater than 0. But in this example from wikipedia, almost P(E) is 1 or 0, and ln(P(E)) is negtive infinite or position infinite.

My question is how I solve this problem.

Thanks in advance

Steven

Steven,

You need to take the transpose of the data as the input to the Real Statistics Logistic Regression data analysis tool. If you do, you will get the same answers as those you found on Wikipedia.

Charles

i also have a problem with the example on Wikipidea. I cant get the same Intercept and slope. What am I doing wrong? Do I need to convert the response variable from binary and how.

Thanks

Yolanda,

Is this the data for the example that you are referring to?

Hours 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50

Pass 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1

If so, the response variable (Pass) is clearly binary.

If you send me an Excel file with the analysis that you are trying to perform I will try to figure out what is going wrong.

Charles

Hello Charles,

I have 6 independent variables in my analysis and one dependent variable. All are binary (Both independent & dependent). What type of analysis should I use if I have to determine the equation involving all of them

Shreya,

You can use logistic regression.

Charles

Dr buenos día, que pena escribirle en español. Dr. Si usted posee una variable dependiente binaria y una independiente binaria, puedo aplicar una regresión logística. Y si posee una dependiente binaria y varias independientes binarias también puede aplicar una regresión logística?.

Traducción:

Dr good day, too bad to write in Spanish. Dr. If you have a dependent and an independent variable binary, can I aplay logistic regression? . And if you have a binary dependent and multiple independent binary variables can you also be applied logistic regression ?.

Excuse me my translate please.

Gerardo,

If your dependent variable is binary and some but not all of your independent variables are binary, then you can try to apply logistic regression.

If all your variables are binary then you should use log-linear regression and not logistic regression. The simple case reduces to the chi-square test of independence. Namely,

if your dependent variable is binary and you also have one independent variable which is binary, then essentially you have a 2 x 2 contingency table which you can address via chi-square or related techniques.

Charles

Dear Charles;

Thank you so much for sharing such an excellent tool.

It’s very useful for people all over the world.

Could you add the tool for probit regression and tobit regression

next time you upload the new version?

I think these two models are widely used in the field of social science.

Of course we can often use logit regression instead of using probit model,

but sometimes it is not appropriate.

Again I appreciate you very much that you made wonderful website.

Thank you.

Nozomi,

Thanks for your comment. I already have probit and tobit on my potential enhancements list. The next release will focus on time series analysis, but I will consider adding probit and/or tobit regression to one of the following releases.

Charles

Dr. Zaointz,

Thanks for your very useful application!

Would you be kind enough to point out how a user could go about reporting the testing power from a logistic multiple regression?

As background, I observed 544 trees for whether they held cones or not as the dependent variable (i.e., 0 or 1), and for their length and age both continuous variables. Your Real Stat report showed that among the independent variables, ln length had a significant association with cone presence while ln age and interaction did not. For ln age in particular, the report shows the following: coeff b = -0.107, s.e.= 0.291, Wald stat = 0.135, p = 0.713, exp(b) = .899.

How could I use Real Stat to find the power of the test whether the log normal age’s coefficient b differs from zero?

Thanks and Best Regards!

Mike,

Testing whether a coefficient is significant is described on the webpage

Significance of Logistic Regression Coefficients

This is provided on the report generated from the Logistic Regression data analysis tool, as described on the webpage

Finding Logistic Regression Coefficients

By the way, this is not considered to be testing for “power”, but is instead considered to be testing for “significance”. Power is something else.

Charles

dear charles,

please sent me study materials for theoretical background of logistic regression.

The theoretical background of logistic regression is provided on the Real Statistics website. If you need additional detail, please look at the Bibliography.

Charles

Charles,

Thanks for your informational website. I am very new to regression analysis. I am learning from your website and youtube videos. I have downloaded your excel plug-in and am working on Logistics regression. Running into some challenges that I thought you could help with…

I have created my training data set and when I run the logistics regression… sometimes I am getting all garbage values..like below

p-Pred Suc-Pred Fail-Pred LL % Correct HL Stat

#VALUE! #VALUE! #VALUE! #VALUE! #VALUE! #VALUE!

Re-running the same data set again, sometimes it is working. Can you help me with the reason why I should be getting these errors?

Thanks,

-Niraj

Niraj,

I can’t think of any reason why one time the procedure works and the next time you get garbage, except that perhaps one time you include the column headings option and the next time you don’t (which makes the program think there is invalid data).

I will take a look at the Excel file you emailed me and try to figure out what is going on.

Charles

Hi Charles,

I am trying to use a logistic regression to forecast a percentage. The aim is to forecast the turnout at different polling locations in my country. My independent variables are a combination of numerical and categorical data (Month,Day of Week, most recent participation percentage, log of advertising money spent).

I know it is possible to make forecasts for this data having done so in other statistical software. I’d love to use your package to do it though since most of my work is done in excel and I have had great success with some of your other tools (Thanks!).

My problem is that when I try to run a logistic regression with participation percentage as the dependent variable, I am told I need to have either 0 or 1 as the dependent variable. Is there anyway around this so I can see the coefficients or the independent variables and make forecasts?

Thanks for your help.

Hi Michael,

Logistic regression is used to make forecasts where there is binary outcome. It can be extended to a small number of categorical outcomes, but I have not seen it used to output percentages. You can use other regression techniques to forecast percentages, but as far as I am aware not logistic regression.

Charles

Hi Charles,

before asking my question I wanted to thank you for this website. It has been extremely useful for a research I am doing.

I have found the employment of the logistic regression easy, however, I am struggling with a further extension of the model to qualitative/categorical variables. I need to consider dichotomous and polytomous explanatory variables, however, I don’t know how to code them. The real problem is with dichotomous variables because I normalise my data taking logs before regressing, this means that I will have Log(1)= 0 and Log(1)= #Value!. How can I include these variables without affecting the accuracy of the whole model?

Paola,

Presumably, you mean Log(0)= #NUM!. This is a common problem. One approach is to use a Log(x+a) transformation instead of a Log(x) transformation, choosing the constant a so that x+a is always positive.

Charles

I very much appreciate your making Real Statistics available – and with such clarity! I am using the Logistic Regression module but am unable to obtain results if I enter more than 5 independent variables. Please let me know where I am going wrong. Many thanks.

Richard,

The problem probably has more to do with your data rather than the number of independent variables. If you send me an Excel file with your data I will try to figure out what is going wrong. See Contact Us for my email.

Charles

Hi Charles,

Thanks for the site – very much insightful.

Question here specific to the log regression function. How does one whittle down the number of variables for input into the model? Is this done as part of the pre-processing or is there an input parameter in any of the menus?

Please advise.

David

Another thing please Charles,

Applying the model results in plenty of #VALUE in the summary page. Have checked for possible formatting issues and eliminated nulls – what is the reasons for this?

Could I please send you the workbook?

Please send the workbook to my email address, which is listed on the Contact Us webpage.

Charles

Hi, sorted this one out – still had nulls in the data-set.

Thanks

David,

I have not provided any means for automatically whittling down the number of variables. I find that these automatic approaches can be sound mathematically, but they don’t take into account knowledge of the actual knowledge domain, especially since they usually don’t handle interactions between the variables, quadratic and higher powers of the variables, etc.

Charles

Dear Charles,

thanks for the useful information in this website. I am, however, having a few problems with logistic regression I am running to test the relationship between a specific type of financial report (let’s call it Type-a, where accounting information is prevailing and type-b, when non-accounting info prevails) and the type of rating it gives (good or bad). My hypothesis is that there is no significant relation bewteen type-b and bad ratings. Ratings are always scaled from 1 to 21. However, I have divided the ratings classification in two classes, so to have class A (good) class B (bad). So, I have collected a set of data where i find 50 reports in which I have recoded types of report as 1 (accounting, type-a) and 0 (nonaccounting, b) and classes as 1 (good, A) and 0 (bad, B). Even thought, from eyeballing the data, there is a very weak relation between type-b and bad ratings (only 3 times out of 50 they coincide… 0 – 0), and although the logit regressions on the binary variables gives me a coefficient of the grade equals to 0.1245, the p-value is very high (0.98944). I cannot explain why this happened, since the process of data gathering and research was very rigid. Could it be that I only ran the logit regression on a set of dummy variables (the type is the independent and the grade is the dependent)? What can be the problem?

Thanks in advance!

If you send me an Excel spreadsheet with your data, I will try to figure out whether there is a problem or help explain what is going on. You can find my email address at Contact Us.

Charles

Charles,

Is there a limit to the number of independent variables, I have a dataset with 45 independent variables I am trying to analyze, if this is above the limit can you suggest an alternative

You should be able to run the logistic regression with 45 independent variables. With such a large number of variables, you will also need a reasonably large sample (at least 45 just to get the model to run, much more to achieve reasonable power).

Charles

my sample size is 66,000. Could that be whats causing the problem?

You can use the Logistic Regression data analysis tool even with 66,000 elements, but with more than 65,500 elements you need to uncheck the

Show summary in outputoption. This is described on the webpage Finding Logistic Regression Coefficients using Newton’s Method.Charles

thanks, you are amazing,

Here is another dumb question, i have 3 categories with 5 dummy variables each. is there a way for me to set. What is the best way to set this up so that my output produces a result with an intercept plus coefficients for the 15 dummy variables?

can i send you the worksheet, i cannot figure out why i cant get the program to produce a solution.

Dear Charles,

Thanks for this excellent explanation.

I followed your instructions and mostly it worked well. However, when I tried to test a categorical independent variable (1:using multipurpose solution as cleaner; 2: using H2O2 as cleaner; total 41 inputs) and I did make sure the last column in the Input Range contained the 0 or 1 dichotomous values of the dependent variable (microbial contamination of the contact lens system), the outcome cells revealed #VALUE!. What did it mean?

Thank you for your help.

Sincerely,

Margaret

Margaret,

The usual explanation is that the logistic regression model did not converge to a solution. If you send me your worksheet I will check it out.

Charles

Charles,

Thank you for this excellent explanation.

I am building a dataset with three continuous independent variables (binned into values of 1 through 5 corresponding with standard deviation ranges above and below the mean) that I am testing to a dichotomous categorical dependent variable.

My first attempt to use your data tool gave me cells with all significant categories from p-Pred and rightward containing only a #NAME output. There is a formula there, but it isn’t picking up data. It happened with both the Solver and Newton approach. I only had 30 inputs (and should have 100, under your minimum formula) so that may be the issue. But, I wanted to make sure there wasn’t something else going on before I kept adding data.

I assume, by the way, that the input box assumes the last column on the right of the data set is the dichotomous output and that all columns to the left of that column in the selected range are the inputs. In other words, the columns must be contiguous and arranged in this fashion.

Thank you.

Jonathan,

Using 30 inputs should not cause this problem.

If you have chosen the Raw Data option, then the last column in the Input Range contains the 0 or 1 dichotomous values of the dependent variable. This column needs to be included in the data range. If you have chosen the Summary Data option, however, then the last two columns are associated with the dependent variable. The first of these contains the total number of successes for the corresponding independent variables and the second of these columns contains the number of failures for the corresponding independent variables (these totals won’t necessarily be 0 or 1).

If you have done all of this correctly, then please make sure that you are using the latest release of the Real Statistics software. You can check this by using the worksheet formula =VER(). You should get the value 12.0 or 12.1 (if you are using the Windows version of the software). I made a few changes a number of releases ago which could be the cause of the problem that you have identified.

If none of this resolves the problem, I would be happy to look at the worksheet you are using and see if I can resolve the problem. Just email it to me.

Charles