Understand Regularized Regression

Regularized regression is a regression method with an additional constraint designed to deal with a large number of independent variables (a.k.a. predictors). It does so by imposing a larger penalty on unimportant ones, thus shrinking their coefficients towards zero.

The objective of regularization is to end up with a model:

That is simple and interpretable.
That generalizes well beyond the sample of our study.
Whose coefficients won’t change much if we replicate the study.

Other methods that also deal with large number of variables in a regression setting include:

Stepwise selection
Best subset selection
PCR (Principal Components Regression)

Below, we will first explain how regularization works, then we will discuss its advantages and limitations.

How regularized regression works

Regularized regression works exactly like ordinary (linear or logistic) regression but with an additional constraint whose objective is to shrink unimportant regression coefficients towards zero.

And because these coefficients can either be positive or negative, minimizing the sum of the raw coefficients will not work. Instead, we can use 1 of the following constraints:

Either to minimize the sum of the absolute value of the regression coefficients — we call this method L1 regularization (a.k.a. LASSO regression)
Or to minimize the sum of the squares of the coefficients — we call this method L2 regularization (a.k.a. Ridge regression)

And because of this tiny difference, these 2 methods will end up behaving very differently.

Difference between L1 and L2 regularization

The biggest difference between L1 and L2 regularization is that L1 will shrink some coefficients to exactly zero (practically excluding them from the model), making it behave as a variable selection method.

In contrast, because L2 minimizes the sum of the squares of the coefficients, it will affect larger ones much more than it will shrink smaller ones, so coefficients close to zero will barely be shrunk further. Therefore, with L2 regularization, we end up with a model that has a lot of coefficients close to, but not exactly zero.

So is L1 better than L2 regularization?

Not necessarily.

LASSO (L1 regularization) is better when we want to select variables from a larger subset, for instance for exploratory analysis or when we want a simple interpretable model. It will also perform better (have a higher prediction accuracy) than ridge regression in situations where a small number of independent variables are good predictors of the outcome and the rest are not that important.

Ridge regression (L2 regularization) performs better than LASSO when we have a large number of variables (or even all of them) each contributing a little bit in predicting the outcome.

So how would we know in what situation we are?

Well it certainly depends on the problem at hand. This should be determined case by case using expert knowledge and an extensive literature review.

How much shrinkage should we apply?

As we discussed above, regularized regression shrinks coefficients by applying a certain penalty. We can control how big this penalty is by using different values of a parameter called lambda: λ.

The larger the value of λ , the bigger the penalty, and the smaller the regression coefficients will be.

λ can range from zero (no penalty) to infinity (where the penalty is large enough that the algorithm is forced to shrink all coefficients to zero).

Note that we cannot use the same dataset to both select the best λ and test the final model (built using the best λ). This is considered data dredging as we will be using the same data to come up with a hypothesis and to test it.

One way to get around this problem is to use k-fold cross-validation to decide on which λ to use.

How cross-validation can help in selecting the best λ?

A k-fold cross-validation divides the sample into k groups. It runs k times, each time using 1 of the groups as validation set and the other (k − 1) groups as training sets.

The training sets are used to build the models with different lambdas and the validation sets are used to check the accuracy of these models.

Once the best λ is selected, we rerun the regularized model using the best λ on all of the sample data and report its results.

Note: Don’t forget to standardize your variables:
Because regularization is trying to shrink coefficients, it will affect larger coefficients more than smaller ones. So the scale on which each variable is measured will play a very important role on how much the coefficient will be shrunk. Standardizing helps deal with this problem by setting all variables on the same scale.

Advantages and limitations of regularized regression

Advantages of regularization

1. L1 regularization produces a simple interpretable model

As discussed above, LASSO regression can be considered a variable selection method. It takes as input a large number of independent variables and outputs a simple, more interpretable model that only contains the most important predictors of the outcome.

Note that L2 regularization (ridge regression) does not share such advantage as it outputs a model that contains all the independent variables with much of their coefficients close to but not equal to zero. Therefore, ridge regression is not very useful for interpreting the relationship between the predictors and the outcome.

2. The regularized model generalizes better

The core idea of regularization is to minimize the effect of unimportant predictors by shrinking their coefficients. This improves the fit of the model by not fitting the noise in our sample which means that it will generalize better than a simple linear or logistic regression.

3. L1 regularization is computationally faster than other variable selection methods

Particularly, it is computationally faster than stepwise and best subset selection as these 2 will have to run several regression models and LASSO has to run 1 model only. This will certainly be an advantage if the number of predictors to choose from, or the sample size, are very large.

4. Regularization still works when the number of predictors exceeds the number of observations

Unlike other variable selection methods, regularized regression still works when number of independent variables exceeds the number of observations (for regularized linear regression), or the number of events (for regularized logistic regression).

Another example of a method that still works with high dimensional data is forward stepwise selection.

Limitations of regularization

1. A larger dataset with a simple model is better than a small dataset with a complex model

Collecting more data is almost always the answer when we want more accurate and more generalizable models. So regularization does not replace the need for larger sample sizes when we need them.

2. Variable selection using L1 regularization does not replace expert knowledge

Selecting variables according to expert knowledge (based on theory and past studies) is better than using LASSO or other automated methods of selection. Remember that important variables judged based on expert knowledge should still be included in the model even if they are not statistically related to the outcome — an option not available when running regularized regression.

3. No p-values for the regression coefficients

The coefficients of a regularized regression don’t seem to have standard errors and p-values that can be interpreted as easily as in ordinary linear or logistic regression.

Here’s a bad idea to work around this problem:

The idea is to use LASSO to select important variables and then use these variables as inputs in another linear/logistic regression model and interpret the outputted coefficients and p-values of that model.

This reasoning is flawed for the same reason you should not use a hypothesis test on each candidate variable and then only include those who have p-value < 0.2, for example, in the final model. In both of these examples, the problem is multiple testing (which the p-values of the final model do not account for). So you end up reading inflated results and having variables that are not related to each other in reality showing up as statistically significant.

References

James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: With Applications in R. 1st ed. 2013, Corr. 7th printing 2017 edition. Springer; 2013.
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd Edition. Springer; 2016.
Jr FEH. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd ed. 2015 Edition. Springer; 2015.
Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2009th Edition. Springer; 2008.