$\newcommand{\Var}{\operatorname{Var}}$$\newcommand{\Cov}{\operatorname{Cov}}$$\newcommand{\one}{\mathbf 1}$$\newcommand{\0}{\mathbf 0}$$\newcommand{\e}{\varepsilon}$$\newcommand{\E}{\operatorname{E}}$In this post I’m going to discuss leverage scores in linear regression. In particular, I’ll show how the leverage scores can be computed as a Mahalanobis distance, I’ll give a couple interpretations for them, and then I’ll finish by showing how they appear in the leave-one-out cross-validated error (LOOCV error) and Cook’s distance, two useful statistics that involve considering the model when fit with and without particular observations. I will follow this up with a second post where I’ll explore the leverage scores with random predictors, but for now the predictors will be considered nonrandom (or equivalently I have conditioned on them). Sometimes I will want to consider the covariate matrix without the intercept column, so I will have $X \in \mathbb R^{n\times p}$ as the “nontrivial” predictors and then $Z = (\one\mid X)$ will be the full $n\times(p+1)$ model matrix. I will assume $Z$, and therefore $X$, are full column rank. My model is therefore $y = Z\beta + \e$ with $\e\sim\mathcal N(0,\sigma^2)$.
The hat matrix is defined $H = Z(Z^TZ)^{-1}Z^T$ and is so called because $\hat y = Hy$. The objects of study here are the leverage scores $h_1,\dots,h_n$ defined by $h_i = H_{ii}$.
I’ll prove a couple useful properties of $H$ to get started. Right off the bat I can see $H^T=H$ and $H^2 = H$ so this means $H$ is symmetric and idempotent and is therefore a projection matrix. Furthermore,
$$
\text{tr}(H) = \text{tr}(Z(Z^TZ)^{-1}Z^T) \\
= \text{tr}((Z^TZ)^{-1}Z^TZ) = \text{tr}(I_{p+1}) = p+1.
$$
Next, I’ll show that every eigenvalue of $H$ is either $0$ or $1$, and the fact that the trace is the sum of the eigenvalues shows that the multiplicity of $1$ as an eigenvalue is $p$. Letting $(\lambda, v)$ be an eigenpair of $H$, I have $Hv = \lambda v$. Using idempotence I then have
$$
\lambda v = Hv = H^2v = H(Hv) = H(\lambda v) = \lambda^2 v.
$$
$v$ being an eigenvector means $v\neq \0$ so WLOG I can take $v_1\neq 0$. This means
$$
\lambda v_1 = \lambda^2 v_1 \\
\implies \lambda = \lambda^2.
$$
This holds for any eigenvalue of $H$ and therefore every eigenvalue is $0$ or $1$.
I’ll now turn my attention to the leverage scores. One quick result is $p+1 = \text{tr}(H) = \sum_i h_i$, but more substantial is the next result.
Result 1: $0 \leq h_i \leq 1$.
Pf: By idempotence and symmetry
$$
h_i = (H^2)_{ii} = \sum_{j=1}^n H_{ij}^2 = h_i^2 + \sum_{j\neq i}H_{ij}^2 \\
\implies h_i – h_i^2 = h_i(1-h_i) = \sum_{j\neq i}H_{ij}^2 \geq 0.
$$
The polynomial $x\mapsto x(1-x)$ is non-negative exactly when $0 \leq x \leq 1$ so this shows that for all $i\in\{1,\dots,n\}$ I have $h_i \in [0,1]$.
$\square$
This holds regardless of whether or not I’ve used an intercept. I’ll sharpen this to $h_i \in \left[\frac 1n, 1\right]$ in the next result.
Result 2: Assuming an intercept,
$$
h_i = \frac 1n + (x_i – \bar x)^T(X^TSX)^{-1}(x_i-\bar x)
$$
where $S = I_n – \frac 1n\one\one^T$ is the matrix that projects into the space perpendicular to $\one$.
Pf: By definition
$$
h_i = z_i^T(Z^TZ)^{-1}z_i.
$$
$$
Z^TZ = \left(\begin{array}{c|c}n & \one^TX \\ \hline X^T\one & X^TX\end{array}\right)
$$
so, using the standard formula for the inverse of a $2\times 2$ block diagonal matrix, I have
$$
(Z^TZ)^{-1} = \left(\begin{array}{c|c}
\frac 1n + \bar x^T(X^TSX)^{-1}\bar x & -\bar x^T(X^TSX)^{-1} \\ \hline
-(X^TSX)^{-1}\bar x & (X^TSX)^{-1}
\end{array}\right)
$$
where $\bar x = \frac 1n X^T\one$ is the column means. This only holds when $X^TSX$ is invertible. Suppose for $v\neq\0$ that $v^TX^TSXv = u^TSu = 0$ where I’ve substituted $u = Xv$. The null space of $S$ is one-dimensional and is spanned by $\one$. By my assumption that $Z$ is full rank it must be that $\one \notin \text{ColSpace}(X)$ which means that $Xv \notin \text{NullSpace}(S)$. So even though $X^TSX$ is not full rank in general, it is in this case given my assumptions.
I can then compute
$$
\begin{aligned}h_i &= z_i^T(Z^TZ)^{-1}z_i = \frac 1n + \bar x^T(X^TSX)^{-1}\bar x – 2 x_i^T(X^TSX)^{-1} + x_i^T(X^TSX)^{-1}x_i \\&
= \frac 1n + (x_i – \bar x)^T(X^TSX)^{-1}(x_i – \bar x).\end{aligned}
$$
This shows that the leverage score is determined by the Mahalanobis distance of $x_i$ from the mean weighted by the inverse of the mean-centered covariance matrix $X^TSX = (SX)^T(SX)$ (by idempotence and symmetry). This distance will be large when $x_i – \bar x$ is large and close to being a bottom eigenvector of $X^TSX$, which means that after centering each column (via $X\mapsto SX$) $x_i – \bar x$ is in a low-density part of the point cloud of the data (e.g. if the $x_i$ are Gaussian so the data forms a football shape, the distance would be large when $x_i-\bar x$ is not in the football shape).
$\square$
Corollary: in simple linear regression (so $p = 1$) I have
$$
h_i = \frac 1n + \frac{(x_i-\bar x)^2}{\sum_{i=1}^n (x_i – \bar x)^2}
$$
Pf: this follows immediately from Result 2 and the fact that $v^TSv = \sum_i (v_i – \bar v)^2$ for $v \in \mathbb R^n$.
$\square$
This corollary lets me explicitly construct an example of a regression with a leverage score of $1$. Continuing to keep $p=1$ so $Z = (\mathbf 1\mid x)$ with $x \in \mathbb R^n$ non-constant (for the full rank assumption), let $x_1 = a$ and $x_2 = \dots = x_n =b$ for $a\neq b$. Intuitively I know $x_1$ is the only point keeping $Z$ full rank so the line will pass through the center of the cloud of $y$ values for $i>1$ and will pass exactly through $(x_1, y_1)$. This means $x_1$ has as much influence as it possibly could, and I will see this reflected in $h_1 = 1$.
I’ll have $n\bar x = a + (n-1)b$ thus
$$
x_1 – \bar x = a – \frac an – \frac{(n-1)b}{n} \\
= \frac{n-1}{n}(a-b)
$$
and for $i \neq 1$
$$
x_i – \bar x = b – \frac an – \frac{(n-1)b}{n} \\
= \frac{b-a}{n}.
$$
This means
$$
h_1 = \frac 1n + \frac{\left(\frac{n-1}{n}\right)^2(a-b)^2}{\left(\frac{n-1}{n}\right)^2(a-b)^2 + \frac{(n-1)}{n^2}(a-b)^2} \\
= \frac 1n + \frac{\frac{n-1}{n}}{\frac{n-1}{n} + \frac{1}{n}} = 1
$$
so $h_1 = 1$ is confirmed and dropping this point would make $Z^TZ$ singular.
$\square$
I can now get a few other quick interpretations of $h_i$ before I show its role in a couple common statistics.
First I’ll note that
$$
\Cov(y, \hat y) = \sigma^2 H \\
\implies \Cov(y_i, \hat y_i) = \sigma^2 h_i.
$$
This is interesting because it shows that observations with larger leverage scores have predictions that are more coupled with the targets. This coupling also has nothing to do with the actual value of $y_i$ and instead is based purely on the predictors. As $h_i$ decreases to $0$, the prediction for $y_i$ is increasingly independent of $y_i$. Analogously, letting $e = y-\hat y$ be the residuals,
$$
\Var(e) = (I-H)\Var(y) = (I-H)\sigma^2
$$
so $\Var(e_i) = (1-h_i)\sigma^2$. This shows that the residual is less and less random as the leverage score increases to $1$.
A similar result is
$$
\frac{\partial \hat y}{\partial y} = \frac{\partial}{\partial y} Hy = H
$$
so $H$ is the matrix of first derivatives of $\hat y$ w.r.t. $y$. This means
$$
\frac{\partial \hat y_i}{\partial y_i} = h_i
$$
which is very similar to the covariance result in that the larger $h_i$ is, the more sensitive $\hat y_i$ is to changes in $y_i$.
Although I’m not focusing on it, the covariance and partial derivative approaches also provide interpretations for the off-diagonal elements of $H$ which is nice too.
I’ll finish this post by showing how the leverage scores appear in the LOOCV error and Cook’s distance. The following result will be useful for both.
Result 3: Let $\hat y_{(i)}$ be the prediction for $y_i$ when the model was fit on all $n-1$ observations except $(y_i, x_i)$. Define $e_{(i)} = y_i – \hat y_{(i)}$ to be the corresponding residual.
Claim:
$$
e_{(i)} = \frac{e_i}{1-h_i}
$$
where $e_i$ is the residual when unit $i$ is included in fitting the model, and $h_i = z_i^T(Z^TZ)z_i$ is the usual leverage score for unit $i$ when all $n$ observations are considered.
Pf: I will begin by finding $\hat\beta_{(i)}$, the point estimate of $\beta$ when observation $i$ is not used. In my post here I showed how to update a linear regression with a new data point. This is the same idea but instead I’m removing a data point. As I do there, I’ll treat $x_i$ and $z_i$ as column vectors. Using the fact that $Z^TZ = \sum_j z_jz_j^T$ I have
$$
\hat\beta_{(i)} = (Z^TZ – z_iz_i^T)^{-1}(Z^Ty – y_iz_i).
$$
Letting $C = Z^TZ$ for convenience I can use Sherman-Morrison to conclude
$$
(Z^TZ – z_iz_i^T)^{-1} = C^{-1} + \frac{C^{-1}z_iz_i^TC^{-1}}{1 – z_i^TC^{-1}z_i} \\
= C^{-1} + \frac{C^{-1}z_iz_i^TC^{-1}}{1 – h_i}.
$$
This also gives some more intuition for what it means to have $h_i =1$ for some $i$. Sherman-Morrison says $Z^TZ – z_iz_i^T$ is invertible exactly when $1 – z_i^T(Z^TZ)^{-1}z_i \neq 0$, or equivalently when $h_i < 1$. This means that if I have a leverage score of $1$ then that observation is the only thing between me and a low rank $Z$, so it is effectively determining the fit because there is no unique $\hat\beta$ without it. This fits with the example of $h_1=1$ and the result that $h_i=1 \implies \Var(e_i) = 0$ from earlier.
Moving on, and assuming $h_i<1$, I have
$$
\hat\beta_{(i)} = (Z^TZ – z_iz_i^T)^{-1}(Z^Ty – y_iz_i) \\
= \hat\beta + \frac{C^{-1}z_iz_i^T\hat\beta}{1 – h_i} – y_iC^{-1}z_i – y_i \frac{C^{-1}z_iz_i^TC^{-1}z_i}{1 – h_i}.
$$
I want
$$
y_i – \hat y_{(i)} = y_i – z_i^T\hat\beta_{(i)}
$$
so, continuing to plug in $h_i = z_i^TC^{-1}z_i$, I have
$$
y_i – z_i^T\hat\beta_{(i)} = y_i – z_i^T\hat\beta – \frac{h_iz_i^T\hat\beta}{1 – h_i} + y_ih_i + y_i \frac{h_i^2}{1 – h_i}.
$$
Plugging in $\hat y_i = z_i^T\hat\beta$ and rearranging leads to
$$
e_{(i)} = y_i – z_i^T\hat\beta_{(i)} = \frac{e_i}{1-h_i}
$$
as desired.
$\square$
This result is very interesting because it shows that there is a closed form for the error in predicting a point with that observation left out (i.e. I don’t need to refit the model, which could be very expensive). This leads me to my result for the LOOCV mean squared error.
Corollary: assuming all $h_i < 1$,
$$
\text{LOOCV error} = \frac 1n \sum_{i=1}^n e_{(i)}^2 = \frac 1n \sum_{i=1}^n \left(\frac{e_i}{1-h_i}\right)^2.
$$
Pf: plug in the result of Result 3.
$\square$
Being able to compute an out-of-sample loss statistic with only a single model fit is pretty slick.
Result 4: Cook’s distance for observation $i$ is defined as
$$
D_i = \frac{\sum_{j=1}^n (\hat y_j – \hat y_{j(i)})^2}{(p+1)s^2}
$$
with $s^2 = \frac 1{n-(p+1)}e^Te$ giving the MSE and $\hat y_{j(i)}$ being the prediction for $y_j$ when the model was fit without unit $i$ (and $p+1$ since $\beta$ has $p+1$ elements instead of the usual $p$).
Claim:
$$
D_i = \frac{e_i^2h_i}{(p+1)s^2(1-h_i)^2}.
$$
Pf: In Result 3 I wanted $y_i – \hat y_{(i)}$. I now need $y_j – \hat y_{j(i)}$. From the proof of Result 3 I know
$$
\hat y_{j(i)} = z_j^T\hat\beta_{(i)} \\
= z_j^T\left(\hat\beta + \frac{C^{-1}z_i\hat y_i}{1-h_i} – y_iC^{-1}z_i – y_i\frac{C^{-1}z_ih_i}{1-h_i}\right)
$$
where $C = Z^TZ$ as before. Distributing $z_j$ and combining terms I have
$$
\hat y_{j(i)} = \hat y_j – \frac{H_{ij}e_i}{1-h_i}
$$
(which is a nice result in and of itself). This means
$$
\sum_{j=1}^n (\hat y_j – \hat y_{j(i)})^2 = \left(\frac{e_i}{1-h_i}\right)^2 \sum_jH_{ij}^2.
$$
$H$ is idempotent and symmetric so
$$
h_i = (H^2)_{ii} = \sum_j H_{ij}^2
$$
which means
$$
\sum_{j=1}^n (\hat y_j – \hat y_{j(i)})^2 = \frac{e_i^2h_i}{(1-h_i)^2}
$$
and the result follows.
$\square$
I’ll follow this with a post on the $h_i$ when $X$ is random.