$\newcommand{\hb}{\hat\beta}$$\newcommand{\hbl}{\hat\beta_\lambda}$$\newcommand{\tb}{\tilde \beta}$$\newcommand{\L}{\mathcal L}$$\newcommand{\l}{\lambda}$$\newcommand{\e}{\varepsilon}$$\newcommand{\0}{\mathbf 0}$$\newcommand{\Lam}{\Lambda}$$\newcommand{\g}{\gamma}$$\newcommand{\D}{\mathcal D}$$\newcommand{\ht}{\hat\theta}$$\newcommand{\a}{\alpha}$In this post I’m going to explore ridge regression as a constrained optimization, and I’ll do so by applying some techniques from convex optimization. The problem I want to solve is finding
$$
\tb = \underset{b \in \mathbb R^p}{\text{argmin }} \|y – Xb\|^2 \\
\text{subject to} \hspace{2mm} \|b\|^2 \leq r^2
$$
for some known $r > 0$ (I’m excluding $r = 0$ since that just means $\tb = \0$ which isn’t very interesting). I will be assuming that $X \in \mathbb R^{n\times p}$ is full column rank so that the unconstrained argmin $\hb$ uniquely exists. I’ll now prove a few things about this optimization problem that will help me generalize it. Before I do, I’ll rewrite it in more standard notation for optimization. Much of this post relies on chapter 5 of Boyd and Vandenberghe’s Convex Optimization.
I’ll let $f_0$ be my objective function and I’ll use $x$ as the variable (so in this case $f_0(x) = \|y – Xx\|^2$ which is a little awkward, but it’s worth it to be consistent with $f_0(x)$ as the objective). For my constraints I’ll suppose I have $f_j(x) \leq 0$, $j=1,\dots,m$. In this case I just have $f_1(x) = \|x\|^2 – r^2$ as my single constraint so $m=1$. I will let $\D = \{x \in \mathbb R^p : f_j(x) \leq 0, j=1,\dots,m\}$ be the set of feasible points and
$$
p^\star := \inf_x \;\{f_0(x) : x\in\D\}
$$
is the optimal value. I’ll always assume that the set of feasible points is non-empty and that $p^\star$ is finite just to avoid special case reasoning.
Result 1: a function of the form $g(x) = x^T A x + b^T x + c$ is convex if and only if $A$ is PSD, and is strictly convex iff $A$ is PD.
Pf: Let $\g \in (0,1)$ and set $\delta = 1 – \g$ for convenience. Consider
$$
\begin{aligned}&\g g(x) + \delta g(y) – g(\g x + \delta y) \\&
= \g x^T A x + \g b^T x + \delta y^T A y + \delta b^T y – (\g x + \delta y)^T A (\g x + \delta y) – b^T(\g x + \delta y) \\&
= \g(1-\g)x^T A x + \delta(1-\delta)y^T A y – 2\g\delta x^T A y.\end{aligned}
$$
$\g(1-\g) = \delta(1-\delta) = \g\delta$ so I can factor out the scalar and I then have
$$
= \g\delta\left[x^T A x – 2 x^T A y + y^T A y\right] \\
= \g\delta (x-y)^T A (x – y).
$$
$\g\delta > 0$ so this is guaranteed to be non-negative if and only if $A$ is PSD, and it is always strictly positive iff $A$ is PD.
$\square$
By my assumption of $X$ being full column rank I know that $X^TX$ is PD so that means that $f_0$ is strictly convex; this result also means $f_1$ is strictly convex.
I’ll now prove two helper lemmas for my constraints.
Lemma 1: If $f : \mathbb R^p \to\mathbb R$ is convex, a set of the form $S_c := \{x : f(x) \leq c\}$ for $c \in \mathbb R$ is convex if it is nonempty.
Pf: let $x,y\in S_c$ and take $\g \in (0,1)$. Then by $f$ being convex I have
$$
f(\g x + (1-\g)y) \leq \g f(x) + (1-\g) f(y) \\
\leq \g c + (1-\g) c = c
$$
which means $f(\g x + (1-\g) y) \leq c$, so $\g x + (1-\g)y \in S_c$.
$\square$
Lemma 2: let $\{S_\a : \a \in A\}$ be an indexed collection of convex sets with $S := \bigcap_{\alpha\in A} S_\alpha \neq \emptyset$. Then $S$ is convex.
Pf: let $x, y \in S$ and for $\g \in (0,1)$ consider $\g x + (1-\g)y$. Each $S_\a$ is convex so $\forall \a \in A\;\; \g x + (1-\g)y \in S_\a$ therefore $\g x + (1-\g)y \in S$.
$\square$
The point of Lemmas 1 and 2 is that the set of feasible points $\{x : f_j(x) \leq 0, j=1,\dots,m\}$ is convex. The next two results use this to show that local minima are not a problem for me and that I’ll have a single global optimum.
Result 2: For a problem of the form
$$
\underset{x \in \mathbb R^p}{\text{minimize}} \;\;f_0(x) \\
\text{subject to} \;\; f_j(x) \leq 0,\;\; j=1,\dots,m
$$
with $f_j$ convex for $j=0,\dots,m$, any local minimum is a global minimum.
Pf: I’ll be following the proof given in Boyd and Vandenberghe’s Convex Optimization. Suppose $x$ is locally optimal, meaning it is in $\D$ and there is some $R > 0$ such that for all $z \in \D$, $\|x – z\| \leq R \implies f_0(z) \geq f_0(x)$.
Suppose $x$ is not globally optimal so there is some $y \in \D$ with $f_0(y) < f_0(x)$. If $\|y – x\| \leq R$ then the local optimality of $x$ implies I’d have $f_0(y) \geq f_0(x)$, so it must be $\|y – x\| > R$. Define a new point $z$ by
$$
z := \theta y + (1 – \theta)x, \hspace{5mm} \theta := \frac{R}{2\|y – x\|}.
$$
Thus $z$ is on the line connecting $y$ and $x$, and since by Lemma 1 each set $\{x : f_j(x) \leq 0\}$ is convex and $\D$ is the intersection of convex sets, by Lemma 2 $\D$ is convex which means $z \in \D$. Furthermore,
$$
\|z – x\| = \theta\|y -x\| = \frac R2
$$
so $z$ is a feasible point within $R$ of $x$ so $f_0(z) \geq f_0(x)$.
Finally, since $f_0$ is convex I have
$$
f_0(z) \leq \theta f_0(y) + (1 – \theta)f_0(x) < f_0(x)
$$
where the last inequality is because the middle term represents a weighted average of $f_0(x)$ and something smaller than it, so that average must be less than $f_0(x)$. But this is a contradiction so it must be that $x$ is globally optimal.
$\square$
Result 3: Suppose $f_0 : \mathbb R^p \to \mathbb R$ is strictly convex and I am minimizing $f_0$ subject to $f_j(x) \leq 0$ for convex $f_j$, $j = 1,\dots,m$. Then if $f_0$ has a feasible minimizer that point is unique.
Pf: Let $x$ be a feasible minimizer of $f_0$ and suppose there exists another minimizer $y$ with $y\neq x$. By Lemmas 1 and 2 the set of feasible points is convex so for any $\g \in (0,1)$ $\g x + (1-\g)y$ is feasible, and
$$
f_0(\g x + (1-\g)y) < \g f_0(x) + (1-\g)f_0(y) = f_0(x) $$ by the assumption that $f_0(x) = f_0(y)$. This means that $\g x + (1-\g)y$ is a feasible point with a value less than that of $x$ which contradicts the minimality of $x$. Therefore such a $y$ can’t exist.
$\square$
This is good news for me as it means that any minimum that I find will do the trick. Now I’ll show that in my case $p^\star$ is attained so by Result 3 I have a unique argmin.
Lemma 3: If $S\subset\mathbb R$ is closed and bounded (i.e. compact by Heine-Borel) then $\inf S \in S$.
Pf: $S$ being bounded means that $\inf S \in \mathbb R$ so it’s at least possible that $\inf S\in S$. Next, suppose $\inf S\notin S$ so $\inf S \in S^c$, the complement of $S$. $S$ being closed means $S^c$ is open, so there is some neighborhood around $\inf S$ entirely contained in $S^c$. Every point in this neighborhood greater than $\inf S$ is a lower bound to $S$ but this contradicts $\inf S$ being the greatest lower bound, therefore it must be that $\inf S \in S$. $\square$
Now let $S_r = \{x : \|x\| \leq r\}$ so $S_r$ is the solid sphere of radius $r$ in $\mathbb R^p$. $S_r$ is compact and $f_0$ is continuous so the image $f_0(S_r)$ is also compact. By Lemma 3 I know $p^\star = \inf_x f_0(S_r) \in f_0(S_r)$ so there is some element of $S_r$ that maps to $p^\star$, and by Result 3 it is unique. This element is $\tb$.
Introducing the dual problem
I’m going to find $\tb$ by solving the dual problem. Let
$$
\L(x, \l) = f_0(x) + \sum_{j\geq 1} \lambda_j f_j(x).
$$
be the unconstrained Lagrangian. The dual problem is to find
$$
d^\star := \max_{\l \succeq 0}\,\inf_x \,\L(x,\l)
$$
where $g(\l) := \inf_x \,\L(x,\l)$ is called the Lagrangian dual function. One intuition for the dual problem that I like is as follows (This is in chapter 5 of B&V). Let $I_- : \mathbb R\to\{0,\infty\}$ be defined by
$$
I_-(x) = \begin{cases} \infty & x > 0 \\ 0 & x \leq 0\end{cases}
$$
so $I_-$ is the infinite-valued indicator for the non-negative reals. I could try to solve
$$
\min f_0(x) + \sum_{j=1}^m I_-(f_j(x))
$$
which would correctly give me $p^\star$ but this is not really any better than just solving the primal problem directly. I can make my life easier if instead I relax these step functions to linear penalties for constrain violation, which gives me $\L(x, \l)$. One consequence of this relaxation is that
$$
\lambda_j f_j(x) \leq I_-(f_j(x))
$$
for all $x \in \text{dom}(f_j)$ so $\L$ gives a lower bound. Thus for a fixed $\l$ I have a certain penalty on being outside the constraints, and by maximizing $g$ I consider the worst case in terms of my penalty. The next result gives a key relationship between the primal and dual problems (and note how convexity doesn’t matter for it).
Result 4: weak duality: $g(\l) \leq p^\star$ for all $\l$ and $d^\star \leq p^\star$.
Pf: Let $\tilde x$ be a feasible point and let $\l \succeq 0$ be otherwise arbitrary. This means
$$
\sum_{j=1}^m \l_jf_j(\tilde x) \leq 0
$$
so
$$
\L(\tilde x, \l) = f_0(\tilde x) + \sum_{j=1}^m \l_jf_j(\tilde x) \leq f_0(\tilde x).
$$
This means
$$
g(\l) = \inf_x \L(x, \l) \leq \L(\tilde x, \l) \leq f_0(\tilde x)
$$
so for any $\l$ $g(\l)$ is a lower bound to the set $\{f_0(x) : x\in \D\}$. By the definition of the infimum this means $g(\l) \leq p^\star$, and then by the definition of the supremum this means $d^\star = \sup_{\l\succeq 0} g(\l) \leq p^\star$.
$\square$
Another perspective on this is as follows. If instead of minimizing over $x$ first, I can begin by considering the worst case in terms of the penalty
$$
\sup_{\l \succeq 0} \L(x, \l) = \sup_{\l\succeq 0} f_0(x) + \sum_j \l_jf_j(x).
$$
If all of the constraints are satisfied then the largest this will be is by setting $\l = \0$ which leads to a value of $f_0(x)$. But if even one constraint is not satisfied, so $f_k(x) > 0$ for some $k \geq 1$, then taking $\l_k\to\infty$ leads to an infinite value. Thus
$$
\sup_{\l \succeq 0} \L(x, \l) = \begin{cases}f_0(x) & x \in \D \\ \infty & x \notin \D\end{cases}
$$
so then
$$
p^\star = \inf_x\,\sup_{\l\succeq 0}\L(x, \l) .
$$
I showed already that weak duality always holds, i.e. $d^\star \leq p^\star$, but if I am to use this I would really like strong duality to hold, which is that $d^\star = p^\star$. This last computation showed that strong duality is all about whether or not I can exchange a sup and an inf. The dual problem begins with minimizing $\L$ over $x$ so it is looking at the worst best-case scenario, while in terms of the penalty the primal problem is looking at the best worst-case penalty. Weak duality shows that when I’m always considering best case scenarios with varying penalties like I do for $g$, I’ll get a lower bound to the values I see when I start with worst-case penalties, and strong duality is about when they coincide.
The next result is one reason why the dual problem is attractive.
Result 5: $g$ is concave.
This result is one reason why the dual problem can be preferable: even if any of the $f_j$ are nonconvex (including $j=0$), $g$ is always concave, and the only constraint is $\l\succeq 0$ which is convex.
Pf: Let $\l,\nu\succeq0$ and $\g\in(0,1)$ and consider
$$
g(\g\l + (1-\g)\nu) = \inf_x \L(x, \g\l + (1-\g)\nu).
$$
$$
\begin{aligned}\L(x, \g\l + (1-\g)\nu) &= f_0(x) + \sum_{j=1}^m (\g\l_j + (1-\g)\nu_j) f_j(x) \\&
= (\gamma + 1 – \gamma)f_0(x) + \sum_{j=1}^m (\g\l_j + (1-\g)\nu_j) f_j(x) \\&
= \gamma \L(x, \l) + (1-\g)\L(x, \nu).\end{aligned}
$$
Note that for any valid $y$,
$$
\begin{aligned}\g \inf_x \L(x, \l) + (1-\g)\inf_x \L(x, \nu) &= \g g(\l) + (1-\g)g(\nu) \\&
\leq \g \L(y, \l) + (1-\g)\L(y,\nu)\end{aligned}
$$
so by the definition of the infimum
$$
\g g(\l) + (1-\g)g(\nu) \leq \inf_x \g \L(x, \l) + (1-\g)\L(x,\nu) = g(\g \l + (1-\g)\nu)
$$
which proves that $g$ is concave.
$\square$
In my case $f_0$ and $f_1$ are convex and Slater’s condition holds, which is the condition that there is a point in the interior of the feasible set such that $f_1(x) < 0$ (I can just take $x = \0$). This means then that I have strong duality so I can go ahead and work with the dual problem to actually get a solution to the primal problem rather than just bounding the primal problem, which can still be useful.
Using this for ridge regression
I’ll now make this specific to my problem and return to my linear regression notation. My Lagrangian is
$$
\L(b, \l) =\|y – Xb\|^2 + \l(\|b\|^2 – r^2)
$$
and
$$
g(\l) = \inf_b \|y – Xb\|^2 + \l(\|b\|^2 – r^2).
$$
I can complete the square in $b$ to get a nice form for $\L$: expanding and rearranging,
$$
\L(b, \l) = y^Ty – 2b^TX^Ty + b^T(X^TX + \l I)b – \l r^2.
$$
Letting $A_\l = X^TX + \l I$ (and noting it is strictly PD) I can now complete the square to get
$$
\L(b, \l) = (b – \hbl)^TA_\l(b – \hbl) +y^T\left(I – XA_\l^{-1} X^T\right)y – \l r^2
$$
for $\hbl = (X^TX + \l I)^{-1}X^T y$. This makes finding $g$ easy, as this is just a convex paraboloid in $b$ so the infimum over $b$ is uniquely attained for $b = \hbl$ and then
$$
g(\l) = y^T\left(I – XA_\l^{-1} X^T\right)y – \l r^2.
$$
I can confirm the concavity of $g$ by letting $X = UDV^T$ be the SVD of $X$ so
$$
XA^{-1}_\l X^T = UD(D^2 + \l I)^{-1}DU^T
$$
and then
$$
g(\l) = y^Ty – y^TUD(D^2 + \l I)^{-1}DU^Ty – \l r^2.
$$
I could use matrix calculus but I can also take advantage of the fact that $D^2 + \l I$ is diagonal and therefore has an easy inverse. Letting $z = U^Ty$ I have
$$
\begin{aligned}g(\l) &= y^Ty – \l r^2 – \sum_{j=1}^p \frac{d_j^2}{d_j^2 + \l} z_j^2 \\&
\implies g'(\l) = -r^2 + \sum_j \frac{d_j^2z_j^2}{(d_j^2 + \l)^2} \\&
\implies g^{\prime\prime}(\l) = -2\sum_j \frac{d_j^2z_j^2}{(d_j^2 + \l)^2} \leq 0.\end{aligned}
$$
Furthermore $g$ is strictly concave if $z = U^Ty \neq \0$. If $U^Ty = \0$ then this means $y$ is completely orthogonal to the column space of $X$ and $\hb = \tb = \0$ which isn’t very interesting. Additionally if $y$ has a continuous distribution then this happens with probability 0, so in any practical situation $U^Ty\neq \0$ and $g$ will be strictly concave.
That said, I’m now looking to maximize $g$ over $\l \in [0,\infty)$ so I’ll want to consider roots of $g’$ and $g$’s behavior at the boundaries of that interval. Note that if $g’$ has a root then that necessarily will be the maximum, as $g$ is concave so any root is a local max and Results 2 and 3 mean this root is in fact the unique global max.
$g’$ can be put back into matrix notation as
$$
g^\prime(\l) = y^TUD(D^2 + \l)^{-2}DU^Ty – r^2 \\
= \|V(D^2 + \l I)^{-1}DU^Ty\|^2 – r^2 \\
= \|\hbl\|^2 – r^2.
$$
$g^{\prime\prime} < 0$ means that $g’$ is monotonically decreasing (and that $\|\hbl\|^2$ is monotonically decreasing as a function of $\l$).
$g’$ having a root entirely depends on whether the constraint is active or not. If $\|\hb\| < r$ (i.e. the constraint is inactive) then $g'(0) < 0$, and since it’s monotonically decreasing it’ll be negative forever and will not have any root, which means the maximum will be a boundary point. $g(\infty) = -\infty$ since $y^T(I – XA_\l^{-1}X^T)y \to y^Ty$ as $\l\to\infty$ so the $-\l r^2$ term dominates. This means the maximum will necessarily be at $\l = 0$, so $\tb = \hb$. In other words, if the unconstrained maximum $\hb$ just happens to be within the sphere of radius $r$, then I just get $\tb = \hb$ as I would hope and my constraint doesn’t actually matter. Similarly, if $\|\hb\| = r$ then technically the constraint is active and I’ll have $g'(0) = 0$ so there is one root, but nothing changes and I still have $\tb = \hb$. If $y$ is continuous then $\hb$ has a continuous distribution too and this has a probability of $0$ of occurring. But if the constraint is active with $\|\hb\| > r$ then I’ll see actual shrinkage. $g’$ will have one root, say $\hat\l$, and I know $\hat\l > 0$ since $\|\hb\| > r$ means $g'(0) > 0$. $\hat \l$ can be found as the unique solution to $\|\hbl\| = r$. I already know that the optimal value of $b$ given $\l$ is $\hbl$ so in this case I’ll have $\tb = \hb_{\hat \l} = (X^TX + \hat \l I)^{-1}X^Ty$. This also shows how knowing $r$ is equivalent to knowing $\l$ and normally neither is known which is why I’m not needing cross validation or anything else here to pick $\hat \l$.
This is an example of complementary slackness as in the KKT conditions. I have strong duality so
$$
f_0(\tb) = g(\hat\l) \leq f_0(\tb) + \hat\l f_1(\tb) \leq f_0(\tb)
$$
where the last inequality is from weak duality. This means that $\hat\l f_1(\tb) = 0$ so I know that at my optimal point $(\tb,\hat\l)$ I’ll have either $\hat\l = 0$ or $f_1(\tb) = 0$ (or both), which reflects how if the constraint is inactive then the Lagrange multiplier is zero, but if the constraint is active then the Lagrange multiplier can be nonzero. Thus this behavior that I saw with $\hat\l = 0$ being dependent upon the constraint being active or not is exactly to be expected.
Direction of shrinkage
Now I’m interested in understanding the way that $\hbl$ moves from $\hb$. Here’s a result that will help. Let $C = X^TX$ and by assumption $C$ is invertible. Then
$$
\begin{aligned}\hb_\l &= (C + \l I)^{-1}X^Ty \\&
= (C + \l I)^{-1} C \cdot C^{-1}X^Ty \\&
= \left[C(I + \l C^{-1})\right]^{-1}C \hb \\&
= (I + \l C^{-1})^{-1}\hb\end{aligned}
$$
so if I let $S_\l = (I + \l C^{-1})^{-1}$ I’ll have
$$
\hb_\l = S_\l \hb.
$$
I have $X = UDV^T$ as the SVD of $X$ and I’ll take $D^2 = \Lam$ so $C = V\Lam V^T$. I have $d_1^2 \geq \dots \geq d_p^2 > 0$ as the eigenvalues of $C$. This means
$$
S_\l = (I + \l V\Lam^{-1} V^T)^{-1} = V(I + \l \Lam^{-1})^{-1}V^T
$$
i.e. $S_\l$ has the same eigenvectors as $C$ but the eigenvalues of $S_\l$ are
$$
\zeta_j := \frac{1}{1 + \frac{\l}{d_j^2}} = \frac{d_j^2}{d_j^2 + \l}.
$$
All $d_j^2 > 0$ and $\l \geq 0$ so $0 < \zeta_j \leq 1$; if $\l > 0$ then all $\zeta_j \in (0,1)$.
So if I apply this matrix to some vector $z \in \mathbb R^p$, I get
$$
V\,\text{diag}(\zeta_1,\dots,\zeta_p)V^Tz.
$$
$V^Tz$ is a change of basis to get the coordinates of $z$ w.r.t. $V$. Then once in this basis I shrink the coordinates by the $\zeta_j$, so if $V = (v_1 \mid \dots \mid v_p)$ then the coordinate corresponding to $v_1$ is shrunk the least and $v_p$ is shrunk the most. I then undo this change of basis via the final multiplication by $V$.
For $\hb$, I know $\hb = VD^{-1}U^Ty$ so w.r.t. $V$ I have $V^T\hb = D^{-1}U^Ty$, so the $j$th coordinate is $d_j^{-1}U_j^Ty$ and this is shrunk by $\zeta_j = \frac{d_j^2}{d_j^2 + \l}$. This means that as $d_j$ gets smaller then both the $j$th coordinate of $\hb$ w.r.t. $V$ gets bigger (assuming the column space itself is fixed so $U^Ty$ isn’t changing) and the shrinkage increases. In particular, this means that $\hb$ is shrunk more along $v_j$ for larger $j$. Recalling that the loss can be written as a quadratic form with $X^TX = C$ this makes sense because the most important directions for the loss correspond to the leading eigenvectors of $C$, namely the $v_j$ with small $j$, so the least important direction, and therefore the best direction to shrink in, is along the $v_j$ with large $j$.
Below are two plots for $p=2$, one showing when $\hb$ is feasible and $\tb = \hb$, and the other showing what happens when the constraint is active and how the shrinkage happens mostly along $v_2$ (plotted in green) rather than $v_1$ (plotted in purple). For this example there is some correlation between the two predictors so $d_1$ is a decent bit larger than $d_2$. This leads to the shrinkage being almost entirely along $v_2$.