A column space view of linear regression via the SVD

$\newcommand{\e}{\varepsilon}$$\newcommand{\E}{\text{E}}$$\newcommand{\C}{\mathcal C}$$\newcommand{\0}{\mathbf 0}$Consider the linear model $y = X\beta+\e$ with $\E\e = \0$, $\E \e\e^T = \sigma^2 I$, and $X\in\mathbb R^{n\times p}$ full column rank. In this post I’m going to explore a purely linear algebra interpretation of finding $\hat y$ and $\hat\beta$.

$\C(X)$ is a $p$-dimensional subspace of $\mathbb R^n$ given by $X(\mathbb R^p)$, i.e. it’s the image of $\mathbb R^p$ under the linear transformation given by $X$. Since $X$ is full column rank the null space $N(X)$ is just $\{\mathbf 0\}$ so $Xv = Xw \implies v = w$ which makes $X$ an injection. This means $X : \mathbb R^p \to \C(X)$ is a bijection, or in other words each element of $\mathbb R^p$ is mapped to exactly one element of $\C(X)$. This lets me think of elements of $\mathbb R^p$ as coordinates indexing $\C(X)$ and therefore I can think of approximating $y$ as either choosing an element of $\C(X)$ (namely $\hat y$) or equivalently choosing a coordinate in $\mathbb R^p$ (which turns out to be $\hat\beta$).

Since $X$ is full column rank its columns are a basis for $\C(X)$ but they’re not the nicest one since they’re probably not orthogonal, so I’ll get a better one. To that end, let $X = UDV^T$ be the thin SVD of $X$ so $U \in \mathbb R^{n\times p}$ has orthonormal columns.

I claim that $U$ gives an orthonormal basis for $\C(X)$.

Pf: $U$ is orthonormal from the SVD so I just need to show that it’s a basis, i.e. that $U$ is a linearly independent set of vectors with span equal to $\C(X)$. Orthonormality implies linear independence so the span is the only possible point of concern.

$Uw = UDV^TVD^{-1}w = X(VD^{-1}w)$ so any linear combination of the columns of $U$ is in $\C(X)$, therefore $\C(U)\subseteq \C(X)$. Similarly $Xv = U(DV^Tv)$ so $\C(X)\subseteq \C(U)$ which together with the previous line means $\C(U) = \C(X)$. This proves $U$ is an orthonormal basis for $\C(X)$.

$\square$

I can now extend $U$ to an orthonormal basis $\tilde U = (U\mid U_\perp)$ for all of $\mathbb R^n$ (and by the orthonormality of $\tilde U$ I’ll have $U^TU_\perp = \mathbf 0$). One possible construction of this is to generate $n-p$ iid draws $Z_1,\dots,Z_{n-p}$ from $\mathcal N(\mathbf 0, I_n)$ and then apply Gram-Schmidt to $(U\mid Z_1 Z_2\dots Z_{n-p})$ to get such a $\tilde U$ almost surely.

Here’s an example of this in R using the pracma library for Gram-Schmidt.

set.seed(132)
n <- 10
p <- 4
x <- matrix(rnorm(n*p), n, p)
u <- svd(x)$u

# extend to basis for all of R^{10}
u_z <- cbind(u, matrix(rnorm(n*(n-p)), n, n-p))
u_tilde <- pracma::gramSchmidt(u_z)$Q

# is u still the first p columns?
sum((u - u_tilde[,1:p])^2)

# orthonormal?
sum((t(u_tilde) %*% u_tilde - diag(n))^2)
sum((u_tilde %*% t(u_tilde) - diag(n))^2)

Now that I have a basis for $\mathbb R^n$ I can write $y = \tilde U{\alpha \choose \gamma}$ where I’ve partitioned the coordinate of $y$ w.r.t. $\tilde U$ into the first $p$ places ($\alpha$) and the subsequent $n-p$ places ($\gamma$). This means $y = U\alpha + U_\perp\gamma$.

Next I want the part of $y$ that’s in $\C(X)$ but because $U$ and $U_\perp$ are orthogonal this is just $U\alpha$. In other words, $U^Ty = \alpha$ gives the change of basis for the part of $y$ in $\C(X)$. To actually get $\hat y$, I just need to get the corresponding vector in $\C(X)$, and since $\alpha$ is the coordinate of the part of $y$ in $\C(X)$, this will just be
$$
U\alpha = UU^Ty.
$$
So there it is: $H:=UU^T$ is the matrix that sends $y$ to the part of $y$ in $\C(X)$. Note that $H$ is square, symmetric, and $H^2=H$ so it’s idempotent as well. This makes it a valid projection matrix. Additionally,
$$
UU^T = UDV^TVD^{-2}V^TVDU^T = X(X^TX)^{-1}X^T
$$
so this is the exact same hat matrix as usual. But while I think $X(X^TX)^{-1}X^T$ is rather mysterious looking, $H = UU^T$ is very sensible: it’s composing two easy pieces, extracting the coordinate of a vector via $U^T$, and then sending that to a column space element via $U$.

$\alpha$ is the coordinate of $\hat y$ w.r.t. $U$, but I reasonably may want the coordinate with $X$ as my basis instead and I can do this by another change of basis. I want a vector $c$ such that $Xc = UU^Ty$, and I know there’s an exact solution because $UU^Ty$ is in $\C(X)$ so $X$ is a bijection here. Solving for $c$, I have
$$
\begin{aligned}Xc &= UU^Ty \\&
\implies UDV^Tc = UU^Ty \\&
\implies c = VD^{-1}U^Ty \\&
= VD^{-2}V^TVDU^Ty \\&
= (X^TX)^{-1}X^Ty = \hat\beta.\end{aligned}
$$
In other words, $\hat\beta$ is exactly the coordinate of $\hat y$ that I get when I use $X$ as my basis rather than $U$.

In conclusion, I showed that the SVD gives me an orthonormal basis $U$ for $\C(X)$, and this lets me find $\hat y$ as $UU^Ty$. And the coordinate of $\hat y$ w.r.t. the basis $X$ is exactly $\hat\beta$, so I can justly think of point estimation in linear regression as finding the coordinate of the part of $y$ in $\C(X)$.

Leave a Reply

Your email address will not be published. Required fields are marked *