Relationship Between r and R-squared in Linear Regression

R-squared is a measure of how well a linear regression model fits the data. It can be interpreted as the proportion of variance of the outcome Y explained by the linear regression model.

It is a number between 0 and 1 (0 ≤ R2 ≤ 1). The closer its value is to 1, the more variability the model explains. And R2 = 0 means that the model cannot explain any variability in the outcome Y.

On the other hand, the correlation coefficient r is a measure that quantifies the strength of the linear relationship between 2 variables.

r is a number between -1 and 1 (-1 ≤ r ≤ 1):

  • A value of r close to -1: means that there is negative correlation between the variables (when one increases the other decreases and vice versa)
  • A value of r close to 0: indicates that the 2 variables are not correlated (no linear relationship exists between them)
  • A value of r close to 1: indicates a positive linear relationship between the 2 variables (when one increases, the other does)

Here are 3 plots that show the relationship between 2 variables with different correlation coefficients:

  • The left one was drawn with a coefficient r = 0.80
  • The middle one with r = -0.09
  • And the right one with r = -0.76:
plots with different correlation coefficients

Below we will discuss the relationship between r and R2 in the context of linear regression without diving too deep into the mathematical details.

We start with the special case of a simple linear regression and then discuss the more general case of a multiple linear regression.

R-squared vs r in the case of a simple linear regression

We’ve seen that both r and R-squared measure the strength of the linear relationship between 2 variables, so how do they relate in the case of a simple linear regression?

When we’re dealing with a simple linear regression:

Y = β0 + β1X + ε

R-squared will be the square of the correlation between the independent variable X and the outcome Y:

R2 = Cor(X, Y) 2

R-squared vs r in the case of multiple linear regression

In simple linear regression we had 1 independent variable X and 1 dependent variable Y, so calculating the the correlation between X and Y was no problem.

In multiple linear regression we have more than 1 independent variable X, therefore we cannot calculate r between more than 1 X and Y.

When dealing with multiple linear regression:

Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + … + ε

R-squared will be the square of the correlation between the predicted/fitted values of the linear regression (Ŷ) and the outcome (Y):

R2 = Cor(Ŷ, Y) 2

Note that in the special case of the simple linear regression:
Cor( X, Ŷ) = 1
So:
Cor( X, Y ) = Cor( Ŷ, Y )

Which is why, in that special case:
R2 = Cor( Ŷ, Y ) 2 = Cor( X, Y ) 2

Further reading