The logistic regression coefficient β associated with a predictor X is the expected change in log odds of having the outcome per unit change in X. So increasing the predictor by 1 unit (or going from 1 level to the next) multiplies the odds of having the outcome by eβ.
Here’s an example:
Suppose we want to study the effect of
Smoking on the 10-year risk of
Heart disease. The table below shows the summary of a logistic regression that models the presence of heart disease using smoking as a predictor:
The question is: How to interpret the coefficient of smoking: β = 0.38?
First notice that this coefficient is statistically significant (associated with a p-value < 0.05), so our model suggests that smoking does in fact influence the 10-year risk of heart disease. And because it is a positive number, we can say that smoking increases the risk of having a heart disease.
But by how much?
1. If smoking is a binary variable (0: non-smoker, 1: smoker):
Then: eβ = e0.38 = 1.46 will be the odds ratio that associates smoking to the risk of heart disease.
This means that:
The smoking group has a 1.46 times the odds of the non-smoking group of having heart disease.
Alternatively we can say that:
The smoking group has 46% (1.46 – 1 = 0.46) more odds of having heart disease than the non-smoking group.
And if heart disease is a rare outcome, then the odds ratio becomes a good approximation of the relative risk. In this case we can say that:
Smoking multiplies by 1.46 the probability of having heart disease compared to non-smokers.
Alternatively we can say that:
There is a 46% greater relative risk of having heart disease in the smoking group compared to the non-smoking group.
Note for negative coefficients:
If β = – 0.38, then eβ = 0.68 and the interpretation becomes: smoking is associated with a 32% (1 – 0.68 = 0.32) reduction in the relative risk of heart disease.
How to interpret the standard error?
The standard error is a measure of uncertainty of the logistic regression coefficient. It is useful for calculating the p-value and the confidence interval for the corresponding coefficient.
From the table above, we have: SE = 0.17.
We can calculate the 95% confidence interval using the following formula:
95% Confidence Interval = exp(β ± 2 × SE) = exp(0.38 ± 2 × 0.17) = [ 1.04, 2.05 ]
So we can say that:
We are 95% confident that smokers have on average 4 to 105% (1.04 – 1 = 0.04 and 2.05 – 1 = 1.05) more odds of having heart disease than non-smokers.
Or, more loosely we say that:
Based on our data, we can expect an increase between 4 and 105% in the odds of heart disease for smokers compared to non-smokers.
How to interpret the intercept?
The intercept is β0 = -1.93 and it should be interpreted assuming a value of 0 for all the predictors in the model.
The intercept has an easy interpretation in terms of probability (instead of odds) if we calculate the inverse logit using the following formula:
eβ0 ÷ (1 + eβ0) = e-1.93 ÷ (1 + e-1.93) = 0.13, so:
The probability that a non-smoker will have a heart disease in the next 10 years is 0.13.
Without even calculating this probability, if we only look at the sign of the coefficient, we know that:
- If the intercept has a negative sign: then the probability of having the outcome will be < 0.5.
- If the intercept has a positive sign: then the probability of having the outcome will be > 0.5.
- If the intercept is equal to zero: then the probability of having the outcome will be exactly 0.5.
For more information on how to interpret the intercept in various cases, see my other article: Interpret the Logistic Regression Intercept.
2. If smoking is a numerical variable (lifetime usage of tobacco in Kilograms)
Then: eβ (= e0.38 = 1.46) tells us how much the odds of the outcome (heart disease) will change for each 1 unit change in the predictor (smoking).
An increase of 1 Kg in lifetime tobacco usage multiplies the odds of heart disease by 1.46.
An increase of 1 Kg in lifetime tobacco usage is associated with an increase of 46% in the odds of heart disease.
Interpreting the coefficient of a standardized variable
A standardized variable is a variable rescaled to have a mean of 0 and a standard deviation of 1. This is done by subtracting the mean and dividing by the standard deviation for each value of the variable.
Standardization yields comparable regression coefficients, unless the variables in the model have different standard deviations or follow different distributions (for more information, I recommend 2 of my articles: standardized versus unstandardized regression coefficients and how to assess variable importance in linear and logistic regression).
Anyway, standardization is useful when you have more than 1 predictor in your model, each measured on a different scale, and your goal is to compare the effect of each on the outcome.
After standardization, the predictor Xi that has the largest coefficient is the one that has the most important effect on the outcome Y.
However, the standardized coefficient does not have an intuitive interpretation on its own. So in our example above, if smoking was a standardized variable, the interpretation becomes:
An increase in 1 standard deviation in smoking is associated with 46% (eβ = 1.46) increase in the odds of heart disease.
3. If smoking is an ordinal variable (0: non-smoker, 1: light smoker, 2: moderate smoker, 3: heavy smoker)
Sometimes it makes sense to divide smoking into several ordered categories. This categorization allows the 10-year risk of heart disease to change from 1 category to the next and forces it to stay constant within each instead of fluctuating with every small change in the smoking habit.
In this case the coefficient β = 0.38 will also be used to calculate eβ (= e0.38 = 1.46) which can be interpreted as follows:
Going up from 1 level of smoking to the next multiplies the odds of heart disease by 1.46.
Alternatively, we can say that:
Going up from 1 level of smoking to the next is associated with an increase of 46% in the odds of heart disease.
About statistical significance and p-values:
If you include 20 predictors in the model, 1 on average will have a statistically significant p-value (p < 0.05) just by chance.
So be aware of:
- including/excluding variables from your logistic regression model based just on p-values.
- labeling effects as “real” just because their p-values were less than 0.05.
What if you get a very large logistic regression coefficient?
In our example above, getting a very high coefficient and standard error can occur for instance if we want to study the effect of smoking on heart disease and the large majority of participants in our sample were non-smokers. This is because highly skewed predictors are more likely to produce a logistic model with perfect separation.
Therefore, some variability in the independent variable X is required in order to study its effect on the outcome Y. So make sure you understand your data well enough before modeling them.