$\newcommand{\e}{\varepsilon}$$\newcommand{\Var}{\operatorname{Var}}$$\newcommand{\x}{\mathbf x}$$\newcommand{\one}{\mathbf 1}$$\newcommand{\0}{\mathbf 0}$In my previous post I stopped right before working out the REML objective function so I’ll start with that, and then I’ll check my work by fitting a random slopes model using my equations and comparing with the output of lme4
. All the notation in this post is defined in part 1, but as a quick reminder my model is
$$
y = X\beta + Z\gamma + \e
$$
with
$$
\gamma \sim \mathcal N(\0, \sigma^2 \Omega^{-1}) \perp \e\sim\mathcal N(\0, \sigma^2 I)
$$
and I’m parameterizing $\Omega$ with a vector $\theta$.
When I use maximum likelihood for $(\hat\sigma^2, \hat\theta)$, like I did in the previous post, it can make some things easier, like likelihood ratio tests, but $\hat\sigma^2$ is known to be biased and underestimate the true variability due to having too large of a denominator. Intuitively this comes from how the sample mean is based on the same data so the data tend to be more clustered around the mean than a denominator of $n$ accounts for.
Restricted maximum likelihood (REML) can be used to help with this. An improper uniform prior is put on $\beta$ and it is integrated out so then the likelihood that is optimized is only a function of $\sigma^2$ and $\theta$. The marginal distribution of $y$ after integrating out $\gamma$ is
$$
y \mid \beta \sim \mathcal N(X\beta, \sigma^2(I-S_\theta)^{-1})
$$
where I’m now including the dependence on $\beta$. This means that the density for REML, which I’ll denote by $p_r$, is
$$
\begin{aligned}p_r(y\mid \theta,\sigma^2) &= \int_{\mathbb R^p} \mathcal N(y\mid X\beta, \sigma^2(I-S_\theta)^{-1}) \,\text d\beta \\&
= \frac{|I-S_\theta|^{1/2}}{(2\pi\sigma^2)^{n/2}}\int_{\mathbb R^p} \exp\left(-\frac 1{2\sigma^2} (y-X\beta)^T(I-S_\theta)(y-X\beta)\right) \,\text d\beta .\end{aligned}
$$
I want to turn this into a quadratic form in $\beta$ so I’ll expand inside the exponential and complete the square. I’ll let $R_\theta = I-S_\theta$ and $A_\theta = (X^T R_\theta X)^{-1}X^TR_\theta$ to make the notation easier ($A_\theta$ gives the weighted pseudoinverse of $X$ and I’m using “$R$” for “residual” since this matrix gives the residuals of a generalized ridge regression on $Z$). This gives me
$$
(y-X\beta)^TR_\theta(y-X\beta) = y^TR_\theta y – 2 \beta^TX^TR_\theta y + \beta^TX^T R_\theta X\beta.
$$
For $v\neq \0$ consider $v^TX^TR_\theta Xv = u^T R_\theta u$ with $u = Xv$. Since $X$ is full column rank $v\neq \0 \implies u \neq \0$, and then $u^TR_\theta u > 0$ since $R_\theta $ is PD. This means that $X^T R_\theta X$ is also PD and therefore invertible. This means I can complete the square to get
$$
(y-X\beta)^TR_\theta (y-X\beta) =
(\beta – A_\theta y)^TX^TR_\theta X(\beta – A_\theta y) +
y^T(R_\theta – R_\theta XA_\theta )y.
$$
Plugging this in,
$$
\begin{aligned}p_r(y\mid \theta, \sigma^2) &= \frac{|R_\theta |^{1/2}}{(2\pi\sigma^2)^{n/2}}\exp\left(-\frac 1{2\sigma^2}y^T(R_\theta – R_\theta XA_\theta )y\right) \times \\&
\int_{\mathbb R^p} \exp\left(-\frac 1{2\sigma^2} (\beta – A_\theta y)^TX^TR_\theta X(\beta – A_\theta y)\right)\,\text d\beta.\end{aligned}
$$
The integral is now that of the kernel of a multivariate Gaussian so all together
$$
p_r(y\mid \theta,\sigma^2) = \frac{|R_\theta |^{1/2}}{(2\pi\sigma^2)^{n/2}}\exp\left(-\frac 1{2\sigma^2}y^T(R_\theta – R_\theta XA_\theta )y\right) \cdot \frac{(2\pi\sigma^2)^{p/2}}{|X^TR_\theta X|^{1/2}}.
$$
For interpreting this, I can note that
$$
H_R := X(X^TR_\theta X)^{-1}X^TR_\theta
$$
is the projection matrix for a GLS regression of $y$ on $X$ with weights given by $R_\theta ^{-1} = (I-S_\theta)^{-1}$ which is exactly the marginal variance of $y$ (after integrating out $\gamma$ but before integrating out $\beta$). Thus the quadratic form in the exponential is
$$
y^TR_\theta (I-H_R)y = \langle (I-S_\theta)y, (I – H_R)y\rangle
$$
so this gives the agreement between the residuals of a ridge regression of $y$ on $Z$ and a GLS regression of $y$ on $X$. The density will be small when this is large, which happens when both residuals are large and the angle between them is small. If both residuals are small in norm then $y$ will be more likely, which corresponds to how a more likely $y$ here is in both the column spaces of $X$ and $Z$ (with the appropriate shrinkage and weighting). Comparing this with the marginal likelihood after just integrating our $\gamma$ I can see that now both $X$ and $Y$ appear as column spaces which reflects that both have been integrated out.
Taking logs, I have
$$
\begin{aligned}\ell_r(\theta, \sigma^2\mid y) &= -\frac 1{2\sigma^2}y^T(R_\theta – R_\theta X(X^TR_\theta X)^{-1}X^TR_\theta )y – \frac{n-p}{2}\log\sigma^2 \\&
+ \frac 12 \log\det R_\theta – \frac 12 \log\det \left(X^TR_\theta X\right)\end{aligned}
$$
as my REML log likelihood. To get $\hat\sigma^2$ I have
$$
\begin{aligned}\frac{\partial \ell_r}{\partial \sigma^2} &= \frac 1{2\sigma^4}y^T(R_\theta – R_\theta X(X^TR_\theta X)^{-1}X^TR_\theta )y – \frac{n-p}{2\sigma^2} \stackrel{\text{set}}= 0\\&
\implies \frac 1{n-p}y^T(R_\theta – R_\theta X(X^TR_\theta X)^{-1}X^TR_\theta )y = \hat\sigma^2.\end{aligned}
$$
When I don’t use REML I have a coefficient of $\frac 1n$ here so this $\frac 1{n-p}$ is how I’m avoiding underestimating the variability.
I can now profile out $\sigma^2$ from $\ell_r$ to get $\ell_{rp}$ which is given by
$$
\ell_{rp}(\theta\mid y) = – \frac{n-p}{2}\log\hat\sigma^2_\theta + \frac 12 \log\det R_\theta – \frac 12 \log\det \left(X^TR_\theta X\right)
$$
(up to a constant). Optimizing this gives me $\hat\theta$ and then I can get $\hat\beta$ and $\hat\sigma^2$ and the rest proceeds as before.
Application: random slopes
I’ll now check these equations (from my previous post and this one) by fitting a random slopes model on the sleepstudy
dataset and comparing the results with those from lme4
. I won’t be worrying about efficiency or numerical stability so I’ll be explicitly inverting matrices and all that.
For a random slopes model I have $m$ groups and I want to fit a model with a group-level intercept and slope term. I’ll have $n_i$ units in group $i$ for $n = \sum_{i=1}^m n_i$ observations in all. The slope data for group $i$ will be collected into $\x_i \in \mathbb R^{n_i}$. I’ll use $x_{ij} = (\x_i)_j$ to simplify my notation. In scalar form my model is
$$
y_{ij} = \beta_0 + \alpha_i + (\beta_1 + \eta_i) x_{ij} + \e_{ij}
$$
with $\alpha_i$ as the random intercept, $\eta_i$ as the random slope, and $j=1,\dots,n_i$ indexing units within a group. Written in matrix notation I have
$$
y = X\beta + Z\gamma + \e
$$
with
$$
X = \left(\begin{array}{cc}\one_{n_1} & \x_1 \\ \vdots & \vdots \\ \one_{n_m} & \x_m\end{array}\right) \in \mathbb R^{n\times 2}
$$
giving the features for the fixed-effects part, and
$$
Z = \left(\begin{array}{c|c}\bigoplus_{i=1}^m \one_{n_i} & \bigoplus_{i=1}^m \x_i\end{array}\right) \in \mathbb R^{n\times (2m)}.
$$
My random effects are $\gamma = {\alpha \choose \eta}$ so the multiplication $Z\gamma$ picks out the correct elements of $\alpha$ and $\eta$, and multiplies that element of $\eta$ by the right $x_{ij}$ value.
For the distributions of my random effects and error I’ll have
$$
\gamma = {\alpha \choose \eta} \sim \mathcal N\left({\0 \choose \0}, \left(\begin{array}{c|c}\sigma^2_\alpha I & \rho\sigma_\alpha\sigma_\eta I \\ \hline \rho \sigma_\alpha\sigma_\eta I & \sigma^2_\eta I \end{array}\right)\right) \perp \e \sim \mathcal N(\0, \sigma^2 I)
$$
so $\sigma^2_\alpha$ and $\sigma^2_\eta$ are the marginal variances of $\alpha$ and $\eta$ respectively, and $\rho$ is the correlation between $\alpha_i$ and $\gamma_i$. $\rho$ is useful because it allows the predictor $x$ to be translated without changing the model (Bates et al. mention this on page 7 of the lme4 manual; I’ll show later why this is the case).
In the previous post I wanted to have $\Var(\gamma) = \sigma^2 \Omega^{-1}$ which means in this case I have
$$
\Omega^{-1} = \left(\begin{array}{c|c}\sigma^2_\alpha/\sigma^2 I & \rho\sigma_\alpha\sigma_\eta/\sigma^2 I \\ \hline \rho \sigma_\alpha\sigma_\eta/\sigma^2 I & \sigma^2_\eta/\sigma^2 I \end{array}\right).
$$
This is a $2\times 2$ block matrix so I could work out the inverse to explicitly get $\Omega$ but I’ll just numerically invert it in my example at the end. I don’t want to have $\sigma^2$s in $\Omega$ so I’ll parameterize this by
$$
\theta = (\sigma_\alpha/\sigma, \sigma_\eta/\sigma, \rho) \in [0, \infty)^2\times[-1,1].
$$
The advantage to taking $\rho$ to be the correlation versus the covariance is that its bounds don’t depend on the variances in question.
The variance of $y$ is therefore
$$
\Var(y) = \sigma^2(Z\Omega^{-1}Z^T + I)
$$
which is exactly what I had in my previous post. I now can use my previous results to fit this model, but before I do I want to examine the actual structure of $\Var(y)$ to see what my random slopes assumption does to the covariance matrix.
$\Var(y)$ is block diagonal because I’m still assuming independence between groups, so I’ll look at just a single block which I’ll denote by $\Psi$. I can work out
$$
\Psi = \sigma^2_\alpha \one\one^T + \sigma^2_\gamma \x\x^T + \rho\sigma_\alpha\sigma_\gamma(\one\x^T + x\one^T) + \sigma^2 I.
$$
The $\sigma^2_\alpha \one\one^T + \sigma^2 I$ part is easy to interpret, and is what comes from the random intercepts plus error: there is a baseline level of correlation between all observations within this group but the independent individual-level noise (i.e. $\e$) prevents this from being constrained to a subspace (which is what would happen if my covariance matrix was just the singular $\sigma^2_\alpha \one\one^T$).
The more complicated part is $\sigma^2_\gamma \x\x^T + \rho\sigma_\alpha\sigma_\gamma(\one\x^T + x\one^T)$. I’ll interpret it by looking at a particular element of $\Psi$.
$$
\begin{aligned}\Psi_{ij}&= \sigma^2_\alpha + \sigma^2 \delta_{ij} + \sigma^2_\gamma x_ix_j + \rho\sigma_\alpha\sigma_\gamma (x_i + x_j) \\&
= \sigma^2_\alpha + \sigma^2 \delta_{ij} + (\sigma_\gamma x_i)(\sigma_\gamma x_j) + \rho\sigma_\alpha(\sigma_\gamma x_i + \sigma_\gamma x_j) \\&
= \sigma^2_\alpha(1 – \rho^2) + \sigma^2 \delta_{ij} + (\sigma_\gamma x_i + \rho\sigma_\alpha)(\sigma_\gamma x_j + \rho\sigma_\alpha)\end{aligned}
$$
where $\delta_{ij}$ is the Kronecker delta, and this form suggests an interpretation via kernel functions. This is also where $\rho\neq 0$ can be seen to give me translation invariance in the $x$s since adding a constant to each $x_i$ can be canceled out via the $\rho\sigma_\alpha$ term. If $\rho = 0$ there’d be no way to cancel that change out.
Define a function $k : \mathbb R\times\mathbb R \to \mathbb R$ by
$$
k(x_i, x_j) = (\sigma_\gamma x_i + \rho\sigma_\alpha)(\sigma_\gamma x_j + \rho\sigma_\alpha)
$$
Result 1: $k$ is positive semidefinite, which makes it a valid kernel function.
Pf: Let $\x \in \mathbb R^M$ and consider the matrix $K$ with $K_{ij} = k(\x_i, \x_j)$. For simplicity I’ll say $k(x,x’) = (ax+b)(ax’ + b)$. I know
$$
K = a^2 \x\x^T + ab(\x\one^T + \one\x^T) + b^2\one\one^T
$$
(I basically had this before I even defined $k$). Letting $v \in \mathbb R^M$ I have
$$
v^TKv = a^2(v^T\x)^2 + 2ab (v^T\x)(v^T\one) + b^2 (v^T\one)^2 \\
= (a v^T\x + b v^T\one)^2 \geq 0
$$
for any $v \in \mathbb R^M$, which makes $K$ PSD. $K$ is not strictly PD in general though since if $v \perp \text{span}\{\x, \one\}$, which is at most a 2 dimensional space, then I’ll have $v^T K v = 0$ even if $v \neq \0$. This means $\text{rank}(K) \leq 2$ so unless $M=2$ I can’t have $K$ be PD.
$\square$
I could also have established that $k$ is PSD by noting that $k(x,x’) = \langle \varphi(x), \varphi(x’)\rangle$ with the feature map $\varphi(x) = \sigma_\gamma x + \rho\sigma_\alpha$, so $K$ is a Gram matrix.
Looking at it this way shows that the covariance is large and positive when $x_i$ and $x_j$ are large with the same signs, and is large and negative when $x_i$ and $x_j$ are large with different signs. This reflects how when I choose to use random slopes I’m expecting that some groups would grow faster than the usual rate, or perhaps slower than the usual rate, but that either way there’d be an increasing covariance between $x$s as they get larger.
It’s worth commenting on how this is not an isotropic kernel. With an isotropic kernel like the squared exponential kernel $(\x, \x’) \mapsto \exp(-\gamma \|\x-\x’\|^2)$ it only depends on the distance, so the covariance matrix is diagonally dominant and decays in a “banded” fashion away from the diagonal. With my kernel here I will still get a decently prominent diagonal but not always to the same extent since it’s only PSD rather than PD, and the actual magnitudes of the input values still matter so I don’t get a banded decay away from the diagonal.
Here are 4 example covariance matrices with this kernel, just to get a sense of what they tend to look like. Each is made with a linear sequence of length $50$ for $\x$. Red values are lower and white values are higher.
This shows how when all of $\x$ is positive then the covariance is at its largest for the final values, while when $\x$ has both positive and negative values then the covariance is at its largest at both extremes.
I’ll now actually fit this model. I’ll use both ML and REML to compare the two. I won’t worry about the BLUPs because the parameter estimates are the real sticking point.
library(lme4) mod_ml <- lmer(Reaction ~ Days + (1 + Days | Subject), data = sleepstudy, REML=FALSE) mod_reml <- lmer(Reaction ~ Days + (1 + Days | Subject), data = sleepstudy, REML=TRUE) ## setting up my data ~~~~~~~~~~~~~~~~~~~~ X <- cbind(1, sleepstudy$Days) Z <- model.matrix(~ 0 + Subject + Subject:Days, data = sleepstudy) y <- sleepstudy$Reaction logdet <- function(A) as.numeric(determinant(A)$mod) ## defining functions for R, beta, sigma2, and theta ~~~~~~~~~~ Rfunc <- function(theta, Z) { # reminder: theta = (s_alpha/sigma, s_eta/sigma, rho) m <- ncol(Z)/2 Omega_inv <- rbind( cbind(theta[1]^2 * diag(m), prod(theta) * diag(m)), cbind(prod(theta) * diag(m), theta[2]^2 * diag(m)) ) diag(nrow(Z)) - Z %*% solve(t(Z) %*% Z + solve(Omega_inv)) %*% t(Z) } bfunc <- function(R, X, y) { # beta hat solve(t(X) %*% R %*% X) %*% t(X) %*% R %*% y } s2func <- function(R, X, y, REML) { # sigma^2 hat bhat <- bfunc(R, X, y) if(!REML) { t(y - X %*% bhat) %*% R %*% (y - X %*% bhat) / length(y) } else { t(y) %*% (R - R %*% X %*% solve(t(X) %*% R %*% X) %*% t(X) %*% R) %*% y / (length(y) - ncol(X)) } } # the actual objective function for theta, \ell_p or \ell_{rp} negloglik <- function(theta, X, y, Z, REML) { R <- Rfunc(theta, Z) if(!REML) { out <- .5 * logdet(R) - (length(y)/2) * log(s2func(R, X, y, REML)) } else { out <- .5 * logdet(R) -.5 * logdet(t(X) %*% R %*% X) - (length(y)-ncol(X))/2 * log(s2func(R, X, y, REML)) } -as.numeric(out) } ## optimizing the log likelihoods ~~~~~~~~~~~~~~~~~~~ opt_ml <- optim(c(.1, .1, .1), negloglik, X=X, y=y, Z=Z, REML=FALSE, lower = c(0.001,0.001,-1), upper=c(Inf,Inf,1), method="L-BFGS-B") opt_reml <- optim(c(.1, .1, .1), negloglik, X=X, y=y, Z=Z, REML=TRUE, lower = c(0.001,0.001,-1), upper=c(Inf,Inf,1), method="L-BFGS-B") ## collecting the results and comparing to lme4 ~~~~~~~~~~~~~~~~ results <- function(theta_hat, REML, X, y, Z) { Rhat <- Rfunc(theta_hat, Z) s2 <- s2func(Rhat, X, y, REML) out <- list( beta_hat = bfunc(Rhat, X, y), s2 = s2, s2_alpha = theta_hat[1]^2 * s2, s2_eta = theta_hat[2]^2 * s2, rho = theta_hat[3] ) lapply(out, as.numeric) # the helper funcs return matrices } results(opt_ml$par, FALSE, X, y, Z) summary(mod_ml) results(opt_reml$par, TRUE, X, y, Z) summary(mod_reml)
Everything agrees so I’m good to go.