$\newcommand{\e}{\varepsilon}$$\newcommand{\1}{\mathbf 1}$$\newcommand{\one}{\mathbf 1}$$\newcommand{\0}{\mathbf 0}$In this post I’m going to work with simple linear regression in matrix form. In most places that I see discussing simple linear regression (e.g. the wikipedia article on it) it is done without matrix algebra but I think there’s still value in this approach. So to that end, I’m considering the model $y = X\beta + \e$ where $X = \left(\1 \mid x \right)$ for some $x \in \mathbb R^n$ and $\beta = {\beta_0 \choose \beta_1}$. I’ll also assume $X$ is full rank, which in this particular case means $x \notin \text{span }\1$. I’ll sometimes refer to the general case where $X \in \mathbb R^{n\times p}$ and I’ll assume $n \geq p$.
I know in general
$$
\hat\beta = (X^TX)^{-1}X^Ty
$$
and because $X^TX$ is $2\times 2$ I can explicitly work out what this is.
$$
X^TX = {\1^T\choose x^T}\left(\1 \;\; x \right) = \left(\begin{array}{cc} n &\1^Tx \\ \1^Tx & x^Tx \end{array}\right)
$$
so, using the well-known formula for the inverse of a $2\times 2$ matrix,
$$
(X^TX)^{-1} = \left(\begin{array}{cc} n &\1^Tx \\ \1^Tx & x^Tx \end{array}\right)^{-1} \\
= \frac{1}{nx^Tx – (\1^Tx)^2}\left(\begin{array}{cc} x^Tx & -\1^Tx \\ -\1^Tx & n \end{array}\right).
$$
The scalar out front is $\det X^TX = nx^Tx – (\1^Tx)^2$ and I can put this in a much more interpretable form.
$$
nx^Tx – (\1^Tx)^2 = nx^Tx – x^T\1\1^Tx = nx^T\left(I – \frac 1n\1\1^T\right)x
$$
so letting $S = I – \frac 1n\1\1^T$ I have that $\det X^TX = nx^TS x$, i.e. it’s a quadratic form with $S$. I’m going to digress and discuss $S$ a bit (I find this matrix comes up often with linear regression, such as with leverage scores).
Comments on $S$
$S$ is a really neat matrix. It’s square and symmetric, and
$$
\begin{aligned}S^2 &= \left(I – \frac 1n\1\1^T\right)\left(I – \frac 1n\1\1^T\right) = I – \frac 2n\1\1^T + \frac 1{n^2}\1\1^T\1\1^T \\&
= I – \frac 2n\1\1^T + \frac 1n \1\1^T = S\end{aligned}
$$
so $S$ is also idempotent. This means that $S$ is a projection matrix. To see what space it projects into, suppose $Sv = \0$ for some $v\neq \0$. This means $v – \bar v\1 = \0$ which implies $v \in \text{span }\1$. Thus $S$ has a 1-dimensional null space spanned by $\1$, and by the rank-nullity theorem $S$ is rank $n-1$. This shows that $S$ projects into the $(n-1)$-dimensional space orthogonal to $\1$. To check this, I can note that for any $x \in \mathbb R^n$
$$
\one^T S x = \one^T x – \frac 1n \one^T\one\one^Tx = \one^Tx – \one^T x = 0
$$
so $Sx$ is indeed orthogonal to $\one$, and if $v\perp\one$ then $Sv = v – \frac 1n\one\one^Tv = v$ so $S$ leaves $v$ alone. Because $S^TS = S^2 = S$ I have $x^TSx = x^TS^TSx = \|Sx\|^2$ which lets me interpret $\det X^TX$ as the amount of $x$ not in $\text{span } \1$, and $\det X^TX = 0$ if and only if $x$ is in $\text{span } \1$. Thus my assumption of $X$ being full rank is equivalent to $x^T S x > 0$, and this is also exactly what I need for $X^TX$ to be invertible.
I can also interpret $S$ as an operator giving the sample covariance:
$$
\begin{aligned}y^T S x = y^Tx – \frac 1n (y^T\1)(x^T\1) \\&
= y^Tx – n \bar x\bar y \\&
= x^T x – n \bar x\bar y – n \bar x \bar y + n\bar x\bar y \\&
= \sum_i x_i y_i – \sum_i x_i \bar y – \sum_i y_i \bar x + \sum_i \bar x \bar y\\&
= \sum_i (x_iy_i – x_i\bar y – y_i \bar x + \bar x \bar y) \\&
= \sum_i (x_i – \bar x)(y_i – \bar y) \\&
= n\cdot \text{cov}(x, y)\end{aligned}
$$
i.e. $\frac 1n y^T S x$ is exactly the sample covariance of the vectors $x$ and $y$ (using $n$ instead of $n-1$ in the denominator for simplicity); this additionally means $\text{var}(x) =\frac 1n x^TSx =\frac 1n\|Sx\|^2$ which confirms the usual interpretation of the sample variance as measuring the spread around the mean.
An additional consequence of this that I could have proved in various ways (e.g. Cauchy-Schwarz, or by noting that as a projection matrix all of its eigenvalues are $0$ or $1$) is that $S$ is positive semi-definite (PSD) since for any vector $x$ $\text{var}(x) \geq 0$.
The final interpretation I’ll discuss gets more at how $S$ arises as a projection matrix. Suppose I have two vectors $u$ and $v$, and I want to project $u$ into the space spanned by $v$. This means I’m looking for something of the form $\hat u = av$ for $a\in\mathbb R$. I can think about the triangle formed by $\hat u$, $u$, and the error $u -\hat u$ as shown in the figure below.
This is to be an orthogonal projection so I’ll need
$$
\begin{aligned}&\hat u \perp (u – \hat u) \\&
\implies av^T(u – av) = 0 \\&
\implies au^Tv – a^2 v^Tv = 0 \\&
\implies a(u^Tv – av^Tv) = 0 \\&
\implies a = \frac{u^Tv}{v^Tv} \;\;\text{ or } \;\;a = 0.\end{aligned}
$$
If $a = 0$ then I’ll have $\hat u = \0$ which happens when $u\perp v$. But in this case I’ll also have $u^Tv = 0$ so I can just say $a = \frac{u^Tv}{v^Tv}$ without losing any solutions.
This also means
$$
\hat u = av = \frac{u^Tv}{v^Tv}v = v(v^Tv)^{-1}v^T u
$$
so $v(v^T v)^{-1}v^T$ is the rank 1 projection matrix that sends a vector $u$ to $\hat u$. The matrix $I – v(v^Tv)^{-1}v^T$ gives the part that’s left over after this projection.
The result of all this is that I can note that
$$
\frac 1n\one\one^T = \one(\one^T\one)^{-1}\one^T
$$
so the rank $1$ matrix $\frac 1n\one\one^T$ is exactly the projection onto $\one$, and then $S = I – \frac 1n\1\1^T$ gives the residual part of this projection. This means $\frac 1n \|Sx\|^2 = \text{var}(x)$ is the residual MSE for a regression of $x$ on $\1$.
Back to Linear Regression
All together, I have the tidy result that $\det X^TX = nx^TS x$ and I now have a few different ways to think about what $S$ does as an operator. I’ll actually get $\hat\beta$ now.
$$
\hat\beta = \frac{1}{nx^T S x}\left(\begin{array}{cc} x^Tx & -\1^Tx \\ -\1^Tx & n \end{array}\right){\1^Ty \choose x^Ty}
$$
so
$$
\hat\beta_1 = \frac{nx^Ty – x^T\1\1^Ty}{nx^T S x} = \frac{x^T S y}{x^T S x} = \frac{\langle Sx, Sy\rangle}{\|Sx\|^2}.
$$
This gives a few interpretations of $\hat\beta_1$. The last equation there shows that $\hat\beta_1$ is what I get from projecting $Sy$ onto $Sx$, or equivalently that $\hat\beta_1$ is the result of first regressing $x$ and $y$ on $\1$ and then regressing $y$’s residuals onto $x$’s residuals. This all makes sense as the slope doesn’t care about the location, so it’s not too weird that I’m effectively centering $x$ and $y$ and then looking at the remaining agreement between the two. From my result about $S$ in terms of sample variances and covariances I also have
$$
\hat\beta_1 = \frac{\text{cov}(x, y)}{\text{var}(x)} = \frac{\text{cov}(x, y)\cdot\text{sd}(y)}{\text{sd}(x) \cdot \text{sd}(y) \cdot \text{sd}(x)} \\
= \text{cor}(x, y)\frac{\text{sd}(y)}{\text{sd}(x)}
$$
which is another common form for $\hat \beta_1$ (assuming $\text{sd}(y) >0$; if $\text{sd}(y) = 0$ I’ll just have $\hat\beta_1 = 0$ as $y \in \text{span }\1$). This says that the slope is determined by taking the correlation coefficient and scaling it to have the right units by the ratio of the spread of $y$ to $x$. If $x$ and $y$ have been normalized to have unit variances then $\hat\beta_1$ is simply the correlation coefficient itself.
For $\hat\beta_0$, I have
$$
\hat\beta_0 = \frac{x^Tx\1^Ty – x^T\1x^Ty}{nx^TS x}
$$
and I can do a bit of rearranging to make this more palatable. First substituting in $nS = nI – \1\1^T$, and using the fact that $x^T x > 0$ (otherwise $x \in \text{span } \1$ which is forbidden by my assumption that $X$ is full rank),
$$
\begin{aligned}\frac{x^Tx\1^Ty – x^T\1x^Ty}{n x^Tx – \1^Tx x^T\1} &= \frac{\1^Ty – \1^Tx(x^Tx)^{-1}x^Ty}{\1^T\1 – \1^Tx(x^Tx)^{-1}x^T\1} \\&
= \frac{\1^T (I – x(x^Tx)^{-1}x^T)y}{\1^T(I – x(x^Tx)^{-1}x^T)\1} \\&
= \frac{\1^T R_xy}{\1^T R_x \1}\end{aligned}
$$
where
$$
R_x = I – x(x^Tx)^{-1}x^T
$$
is the matrix that projects into the space orthogonal to $\text{span }x$. Thus $\hat\beta_0$ is no different than $\hat\beta_1$ in that it considers the projection of the residuals of a regression of $y$ on $x$ onto the residuals of a regression of $\1$ on $x$.
I can use some algebraic trickery to get the usual form for $\hat\beta_0$ from here.
$$
\hat\beta_0 = \frac{y^T\1x^Tx\ – y^Tx\1^Tx}{nx^TS x} = \frac{y^T(\1x^T – x\1^T)x}{n x^TSx}.
$$
I can write
$$
\1x^T – x\1^T = \1x^T – \frac 1n \1x^T\1\1^T – x\1^T + \frac 1n \1\1^Tx\1^T \\
= \1x^T S – S x\1^T
$$
so
$$
\hat\beta_0 = \frac{y^T(\1x^T S – S x\1^T)x}{n x^TSx} \\
= \frac{y^T\1x^T S x – y^TSx\1^Tx}{n x^T S x} \\
= \frac{y^T\1}{n} – \frac{y^TS x}{x^T S x} \cdot \frac{\1^T x}{n} \\
= \bar y – \hat\beta_1 \bar x.
$$
To interpret this, I can note that if $x$ is centered or the slope is zero then $\hat\beta_0$ is simply $\bar y$. It’s also worth noting that this rearranges to
$$
\hat\beta_0 + \hat\beta_1\bar x = \bar y
$$
which just says that the intercept is chosen so that at the mean of $x$ I’m at the mean of $y$; in other words, the line is forced to pass through the center of the point cloud as $(\bar x, \bar y)$ is always on it.
Loss surface
Next I want to look at the loss surface. For a given $b \in \mathbb R^p$ ($p = 2$ in my case although I’ll be treating this in general for now) I have
$$
\ell(b) = \|y – Xb\|^2
$$
as my loss. I can expand this and complete the square in $b$ (assuming $X^TX$ is invertible, which I have been) to find
$$
\ell(b) = b^TX^TXb – 2b^TX^Ty + y^Ty \\
= (b – \hat \beta)^TX^TX(b – \hat \beta) + y^T(I – H)y
$$
for $H = X(X^TX)^{-1}X^T$. I want to work out the shape of $\ell$. The $y^T(I-H)y$ term only shifts $\ell$ up and down by a constant so WLOG I can drop that.
$X^TX$ is symmetric so by the spectral theorem I can factor it as $X^TX =V\Lambda V^T$ where $V$ is orthogonal and $\Lambda$ is diagonal with its entries in decreasing order. $X^TX$ being invertible means it’s PD so all the eigenvalues in $\Lambda$ are positive.
I can consider the change of variables $z = \Lambda^{1/2}V(b – \hat\beta)$ so (up to that constant)
$$
\ell(b) = (b – \hat \beta)^TX^TX(b – \hat \beta) = z^T z.
$$
Let $L(c) = \{z \in \mathbb R^p : z^Tz = c\}$ give the level sets of $z^Tz$. If $c < 0$ $L(c) = \emptyset$, $L(0) = \{\0\}$, and otherwise $L(c)$ is a sphere centered at $\0$, and the spheres increase in radius as $c$ increases (since $\sqrt c$ is exactly the radius) so they are nested. This means that overall in terms of $z$ I’m looking at a paraboloid; this is easily seen in my case with $p=2$ since if I imagine placing each level set at its corresponding height, the shape I’ll end up building is exactly a paraboloid.
When I undo the change of variables, I’m applying the inverse transformation $b = V^T\Lambda^{-1/2}z + \hat\beta$. $\Lambda^{-1/2}$ will leave the directions of the axes unaffected but will shrink or expand some directions and this turns the level sets into ellipsoids. The ellipsoids will have their longest axis be the one corresponding to $\lambda_p$ since $\lambda_p^{-1/2}$ is the largest. Then $V^T$ rotates these ellipsoids and $\hat\beta$ shifts them to be centered at $\hat\beta$. The upshot is that $\ell$ is a paraboloid centered at $\hat\beta$, which confirms that that is the unique global minimum, and the shape of this paraboloid is determined by the eigenvectors and eigenvalues of $X^TX$. The ellipsoids will be the longest in the direction given by the eigenvector of $X^TX$ with the smallest eigenvalue.
This has very interesting consequences for ill conditioned regressions. As my feature matrix $X$ becomes increasingly close to being low rank, I’ll have $\lambda_p \to 0$ (and perhaps others as well) which will result in $\lambda_p^{-1/2} \to \infty$. This means that the ellipsoidal contours will get more and more elongated; in the limit, the contours will open up and no longer be enclosed. If $p=2$ I can visualize it, and I’ll see the ellipses elongate and eventually they’ll open up into parallel lines and the minimum is now a line rather than a point. This is a visual representation of the fact that as it becomes harder and harder to separate out the effects of my predictors, there is increasingly a large area with almost identical loss that corresponds to increasing some coefficients and decreasing others.
The figure below shows the contour $\{b^TX^TXb = 1\}$ for some simulated data as the standard deviation of my predictor $x$ goes to zero, which puts $x$ closer and closer to being in $\text{span } \1$.
Another place this shows up is the following: suppose $\text{E}(\e) = \0$ and $\text{Var}(\e) = \sigma^2 I$. Consider the error $\hat\beta – \beta$. I have
$$
\text{E}(\hat\beta) = (X^TX)^{-1}X^T\text{E}(y) = (X^TX)^{-1}X^TX\beta = \beta
$$
so $\hat\beta$ is unbiased, and
$$
\text{Var}(\hat\beta) = (X^TX)^{-1}X^T\text{Var}(y)X(X^TX)^{-1} \\
= \sigma^2 (X^TX)^{-1}.
$$
Thus $\hat\beta – \beta$ is a RV with mean $\0$ and variance $\sigma^2(X^TX)^{-1}$. There is a standard result that if $z$ is a random vector with mean $\mu$ and covariance matrix $\Sigma$ then for a matrix $A$
$$
\text{E}(z^T A z) = \text{tr}(A\Sigma) + \mu^TA\mu.
$$
Pf: I can use the fact that the trace of a scalar is the scalar itself so
$$
\text{E}(z^TA z) = \text{E}\left[\text{tr}(z^T A z)\right] \\
= \text{E}\left[\text{tr}(A z z^T)\right].
$$
By the linearity of expectation I can exchange $\text{E}$ and $\text{tr}$ so
$$
\begin{aligned}\text{E}(z^TA z) &= \text{tr} \left[\text{E}(Azz^T)\right] \\&
= \text{tr} \left[A\text{E}(zz^T)\right] \\&
= \text{tr} \left[A\left(\Sigma + \mu\mu^T\right)\right] \\&
= \text{tr} \left[A\Sigma + A\mu\mu^T\right] \\&
= \text{tr}(A\Sigma) + \mu^TA\mu\end{aligned}
$$
as desired.
Applying this, I have
$$
\text{E}\|\hat\beta – \beta\|^2 = \sigma^2 \text{tr}(X^TX)^{-1}.
$$
Now $X^TX = V\Lambda V^T$ so
$$
\text{tr}(X^TX)^{-1} = \text{tr}(\Lambda^{-1}V^TV) = \sum_{j=1}^p \frac{1}{\lambda_j}
$$
i.e.
$$
\text{E}\|\hat\beta – \beta\|^2 = \sigma^2 \sum_{j=1}^p \frac{1}{\lambda_j}.
$$
Thus the more ill-conditioned the problem, i.e. the closer to being unidentifiable as indicated by having elongated contours, the higher the mean squared error is.
Low rank feature matrices
If $X$ is actually low rank I’ll need to change my formulation a little bit. I will consider the set of least squares estimators (LSEs)
$$
\text{LSE}(y, X) := \underset{b\in\mathbb R^p}{\text{argmin }}\|y – Xb\|^2.
$$
This problem is convex no matter the rank of $X$ so any root of the first derivative w.r.t. $b$ is a global minimum; taking derivatives shows that this is equivalent to solving the so-called normal equations $X^TXb = X^Ty$, i.e. $\text{LSE}(y,X)$ is the set of solutions to that linear equation.
If $X$ is full rank then $X^TX$ is invertible so $\text{LSE}(y,X) = \{\hat\beta\}$. But if $X$ is low rank then there are infinitely many solutions. I’ll use pseudoinverses to produce one solution and then I’ll characterize the remaining ones. I’ll only need to consider pseudoinverses of a diagonal matrix, so if $D = \text{diag}(d_1,\dots,d_p)$ then the pseudoinverse is the diagonal matrix $D^+$ where
$$
D^+_{ii} = \begin{cases}d_i^{-1} & d_i \neq 0 \\ 0 & d_i = 0\end{cases}.
$$
If $D$ is actually invertible then $D^{-1} = D^+$.
Let $X = UDV^T$ be the SVD of $X$. I’ll write
$$
D = \left(\begin{array}{c|c} D_r & \0 \\ \hline \0 & \0\end{array}\right)
$$
so
$$
D^+ = \left(\begin{array}{c|c} D_r^{-1} & \0 \\ \hline \0 & \0\end{array}\right).
$$
I’m looking to solve
$$
X^TXb= X^Ty \iff VD^2Vb = VDU^Ty \\
\iff D^2Vb = DU^Ty
$$
Noting that $D^2D^+ = D$ I can guess that a solution is $\hat b = VD^+U^Ty$, which I can verify to be true:
$$
X^TX\hat b = VD^2V^TVD^+U^Ty = VDU^Ty = X^Ty
$$
as desired. This shows that $\text{LSE}(y,X)$ is never empty, and if $X$ is full rank then all $d_i > 0$ so $D^+ = D^{-1}$ and this makes $\hat b = \hat\beta$.
To characterize the other possible solutions, I’ll return to completing the square in $b$. I can’t use $(X^TX)^{-1}$ as before but I can use $\hat b$. I’ll just replace $\hat \beta$ with $\hat b$ and see what happens.
$$
\begin{aligned}(b – \hat b)^TX^TX(b – \hat b) &= b^TX^TXb – 2b^TX^TX\hat b + \hat b^TX^TX\hat b \\&
= b^TX^TXb – 2b^TVD^2V^TVD^+U^Ty + y^TUD^+V^TVD^2V^TVD^+U^Ty \\&
= b^TX^TXb -2b^TVDU^Ty + y^TU\left(\begin{array}{c|c}I_r & \0 \\ \hline \0 & \0\end{array}\right)U^Ty \\&
= b^TX^TXb -2b^TX^Ty + y^TU_rU_r^Ty\end{aligned}
$$
where $U_r$ is the first $r$ columns of $U$. If $r = p$ then that last term will be $y^TUU^Ty = y^THy$ as before, but now I’ve generalized this. I’m trying to get this to match $\|y – Xb\|^2 = y^Ty – 2b^TX^Ty + b^TX^TXb$ so I just need to subtract that last term and add $y^Ty$ which gives me
$$
\|y – Xb\|^2 = (b – \hat b)^TX^TX(b – \hat b) + y^T(I – U_rU_r^T)y.
$$
This means that every element of $\text{LSE}(y,X)$ is such that $(b – \hat b)^TX^TX(b – \hat b) = 0$, or in other words $\{b \in \mathbb R^p : X(b – \hat b) = \0\}$. In even other words,
$$
\text{LSE}(y,X) = \hat b + N(X)
$$
where $N(X)$ is the null space of $X$. By the rank-nullity theorem $\dim N(X) = p – r$ and $Xv_{j} = \0$ for $j = r+1, \dots, p$ (where $v_j$ is the $j$th column of $V$) which means $\{v_{r+1},\dots,v_p\}$ is a linearly independent set of $p-r$ elements in $N(X)$ and is therefore a basis. This means that I can ultimately characterize the least squares estimators as
$$
\text{LSE}(y, X) = \hat b + \text{span }\{v_{r+1}, \dots, v_p\}.
$$
What’s interesting about this is I already discussed how $v_{r+1}$ through $v_p$ are the axes of the elliptical contours of $X^TX$ that are extended into infinite valleys in the case of $X$ being low rank so this exactly agrees with the results from that picture. Also note that for $j = r+1,\dots,p$,
$$
\hat b^Tv_j = y^TUD^+V^Tv_j = y^TUD^+e_j = 0
$$
(where $e_j$ is the $j$th standard basis vector) so $\hat b \notin \text{span }\{v_{r+1}, \dots, v_p\}$ so this is not a redundant specification of this affine space.