R-squared is a measure of how well a linear regression model fits the data. It can be interpreted as the proportion of variance of the outcome Y explained by the linear regression model.
It is a number between 0 and 1 (0 ≤ R2 ≤ 1). The closer its value is to 1, the more variability the model explains. And R2 = 0 means that the model cannot explain any variability in the outcome Y.
On the other hand, the correlation coefficient r is a measure that quantifies the strength of the linear relationship between 2 variables.
r is a number between -1 and 1 (-1 ≤ r ≤ 1):
- A value of r close to -1: means that there is negative correlation between the variables (when one increases the other decreases and vice versa)
- A value of r close to 0: indicates that the 2 variables are not correlated (no linear relationship exists between them)
- A value of r close to 1: indicates a positive linear relationship between the 2 variables (when one increases, the other does)
Here are 3 plots that show the relationship between 2 variables with different correlation coefficients:
- The left one was drawn with a coefficient r = 0.80
- The middle one with r = -0.09
- And the right one with r = -0.76:
Below we will discuss the relationship between r and R2 in the context of linear regression without diving too deep into the mathematical details.
We start with the special case of a simple linear regression and then discuss the more general case of a multiple linear regression.
R-squared vs r in the case of a simple linear regression
We’ve seen that both r and R-squared measure the strength of the linear relationship between 2 variables, so how do they relate in the case of a simple linear regression?
When we’re dealing with a simple linear regression:
Y = β0 + β1X + ε
R-squared will be the square of the correlation between the independent variable X and the outcome Y:
R2 = Cor(X, Y) 2
R-squared vs r in the case of multiple linear regression
In simple linear regression we had 1 independent variable X and 1 dependent variable Y, so calculating the the correlation between X and Y was no problem.
In multiple linear regression we have more than 1 independent variable X, therefore we cannot calculate r between more than 1 X and Y.
When dealing with multiple linear regression:
Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + … + ε
R-squared will be the square of the correlation between the predicted/fitted values of the linear regression (Ŷ) and the outcome (Y):
R2 = Cor(Ŷ, Y) 2
Note that in the special case of the simple linear regression:
Cor( X, Ŷ) = 1
Cor( X, Y ) = Cor( Ŷ, Y )
Which is why, in that special case:
R2 = Cor( Ŷ, Y ) 2 = Cor( X, Y ) 2