top of page
  • Writer's picturestamudspiditretisu

Hosmer and Lemeshow Applied Logistic Regression PDF 300: An Easy Accessible Resource for LR Model Th



  • of our variables had missing values).The likelihood ratio chi-square of41.46 with a p-value of 0.0001 tells us that our model as a whole fits significantlybetter than an empty model (i.e., a model with no predictors).

In the table we see the coefficients, their standard errors, thez-statistic, associated p-values, and the 95% confidence interval of thecoefficients. Both gre and gpa are statisticallysignificant, as are the three indicator variables for rank. The logistic regression coefficients give the change in the log odds of theoutcome for a one unit increase in the predictor variable.


Now we can say that for a one unit increase in gpa, the odds of beingadmitted to graduate school (versus not being admitted) increase by a factor of2.23. For more information on interpreting odds ratios see our FAQ pageHow do I interpret odds ratios in logistic regression?.




hosmer and lemeshow applied logistic regression pdf 300




  • our page on non-independence within clusters.

  • See alsoStata help for logit

  • Annotated output for thelogistic command

  • Interpreting logistic regression inall its forms (in Adobe .pdf form), (from Stata STB53,Courtesy of, and Copyright, Stata Corporation)

  • Textbook examples: Applied Logistic Regression (SecondEdition) by David Hosmer and Stanley Lemeshow

  • Beyond BinaryLogistic Regression with Statawith movies

  • Visualizing Main Effects andInteractions for Binary Logit Models in Statawith movies

  • Stat Books for Loan, Logistic Regression and Limited Dependent Variables

  • ReferencesHosmer, D. & Lemeshow, S. (2000). Applied Logistic Regression (Second Edition).New York: John Wiley & Sons, Inc.

  • Long, J. Scott, & Freese, Jeremy (2006). Regression Models for Categorical Dependent VariablesUsing Stata (Second Edition). College Station, TX: Stata Press.

  • Long, J. Scott (1997). Regression Models for Categorical and Limited Dependent Variables.Thousand Oaks, CA: Sage Publications.



The researcher performs a logistic regression, where "success" is a grade of A in the memory test, and the explanatory (x) variable is dose of caffeine. The logistic regression indicates that caffeine dose is significantly associated with the probability of an A grade (p


There are many possible reasons that a model may give poor predictions. In this example, the plot of the logistic regression suggests that the probability of an A score does not change monotonically with caffeine dose, as assumed by the model. Instead, it increases (from 0 to 100 mg) and then decreases. The current model is P(success) vs caffeine, and appears to be an inadequate model. A better model might be P(success) vs caffeine + caffeine^2. The addition of the quadratic term caffeine^2 to the regression model would allow for the increasing and then decreasing relationship of grade to caffeine dose. The logistic model including the caffeine^2 term indicates that the quadratic caffeine^2 term is significant (p=0.003) while the linear caffeine term is not significant (p=0.21).


For the caffeine example, the observed number of A grades and non-A grades are known. The expected number (from the logistic model) can be calculated using the equation from the logistic regression. These are shown in the table below.


The logistic regression model for the caffeine data for 170 volunteers indicates that caffeine dose is significantly associated with an A grade, p


Compute p(success) for each subject using the coefficients from the logistic regression. Subjects with the same values for the explanatory variables will have the same estimated probability of success. The table below shows the p(success), the expected proportion of volunteers with an A grade, as predicted by the logistic model.


216 subjects with and 280 without suspected liver disease were studied. FL was diagnosed by ultrasonography and alcohol intake was assessed using a 7-day diary. Bootstrapped stepwise logistic regression was used to identify potential predictors of FL among 13 variables of interest [gender, age, ethanol intake, alanine transaminase, aspartate transaminase, gamma-glutamyl-transferase (GGT), body mass index (BMI), waist circumference, sum of 4 skinfolds, glucose, insulin, triglycerides, and cholesterol]. Potential predictors were entered into stepwise logistic regression models with the aim of obtaining the most simple and accurate algorithm for the prediction of FL.


Continuous variables are given as medians and interquartile ranges (IQR) because of skewed distributions. Comparisons of continuous variables between subjects with and without FL were performed with the Mann-Whitney test and those of nominal variables with the Fisher's exact test. To identify candidate predictors of FL, we performed a stepwise logistic regression analysis on 1000 bootstrap samples of 496 subjects (probability to enter = 0.05 and probability to remove = 0.1) [21]. All variables besides gender were evaluated as continuous predictors. Linearity of logits was ascertained using the Box-Tidwell procedure [22]. To obtain a linear logit, we transformed age using the coefficient suggested by the Box-Tidwell procedure [(age/10) 4.9255] and ALT, AST, GGT, insulin and triglycerides using natural logarithms (loge). The logits of the other predictors (BMI, waist circumference, glucose, cholesterol, ethanol and the sum of 4 skinfolds) were linear.


Candidate predictors identified at bootstrap analysis were evaluated using three stepwise logistic models before obtaining a final prediction model (probability to enter = 0.01 and probability to remove = 0.02; these more stringent levels were used to protect against type I errors). The goodness of fit of the models was evaluated using the Hosmer-Lemeshow statistic and their accuracy was assessed by calculating the non-parametric area (AUC) under the receiver-operating curve (ROC) with 95% confidence intervals (95%CI) [23, 24]. The standard errors of the regression coefficients of the final model were calculated using 1000 bootstrap samples of 496 subjects. The probabilities obtained from the final model were multiplied by 100 to obtain the fatty liver index (FLI). The sensitivity (SN), specificity (SP), positive likelihood ratio (LR+) and negative likelihood ratio (LR-) of 10-value intervals of FLI were calculated [23]. Statistical analysis was performed using STATA 9.2 (StataCorp, College Station, Texas, USA).


Selection of candidate predictors at bootstrapped stepwise logistic regression. Bars indicate the number of times out of 1000 that the variables were selected for inclusion in 3 models. Model 1 is the starting model, Model 2 removes insulin and Model 3 removes skinfolds. Data are sorted using Model 3. Abbreviations: * = transformed using natural logarithm; ** = transformed using Box-Tidwell transformation (see text for details); other abbreviations as in Table 1.


In this chapter, we will only learn about binary logistic regression, in which the dependent variable can only have two levels (for example, good or bad, 1 or 0, functional or non-functional, admit or not admit). In other words, the dependent variable must be a dummy variable.


In logistic regression, we also create a regression equation that tells us the relationship between the dependent variable and each independent variable. We then use this equation to calculate predicted probabilities that each observation (data point) falls into one category of the dependent variable rather than another. The goal is to accurately predict which category each observation is in. There are multiple possible goodness-of-fit tests for logistic regression, which examine how well A) our predictions about the category of the dependent variable into which each observation falls matches with B) the reality of the category into which each observation falls.


In logistic regression, each estimated regression coefficient also tells us the relationship between one independent variable and the dependent variable. However, the coefficient is telling us the multiplicative relationship between the independent and dependent variable. We will typically prefer to look at the coefficient in its odds ratio form, which we will learn about later in this chapter. The logistic regression is predicting that for every one unit increase in the independent variable, the predicted likelihood of an observation being in the 1 category (rather than the 0 category) of the dependent variable is multiplied by the coefficient. Note that we use the same magic words as always to interpret our results, even for logistic regression.


Inference works the same way for linear and logistic regressions. You should look at the confidence intervals and p-values for each estimated coefficient. The meaning of the p-value is the same in both cases. It is the result of a hypothesis test regarding the relationship between the dependent variable and independent variable.


Of course, as is the case with linear regression, all of the assumptions of logistic regression need to be met before you can consider your results to be trustworthy. We will look at some of the key logistic regression assumptions in this chapter.


The videos and resources in this section are optional to watch or read. The videos contain some introductory material that demonstrates how logistic regression works and how the computer fits a regression model using maximum likelihood estimation. The additional readings explain some of the more technical details and theory related to logistic regression, more than you are required to know in order to use logistic regression effectively and responsibly. 2ff7e9595c


0 views0 comments

Recent Posts

See All
bottom of page