Mixed models from scratch: REML and random slopes

In my previous post I stopped right before working out the REML objective function so I’ll start with that, and then I’ll check my work by fitting a random slopes model using my equations and comparing with the output of lme4. All the notation in this post is defined in part 1, but as a quick reminder my model is
y=Xβ+Zγ+ε
with
γN(0,σ2Ω1)εN(0,σ2I)
and I’m parameterizing Ω with a vector θ.

When I use maximum likelihood for (σ^2,θ^), like I did in the previous post, it can make some things easier, like likelihood ratio tests, but σ^2 is known to be biased and underestimate the true variability due to having too large of a denominator. Intuitively this comes from how the sample mean is based on the same data so the data tend to be more clustered around the mean than a denominator of n accounts for.

Restricted maximum likelihood (REML) can be used to help with this. An improper uniform prior is put on β and it is integrated out so then the likelihood that is optimized is only a function of σ2 and θ. The marginal distribution of y after integrating out γ is
yβN(Xβ,σ2(ISθ)1)
where I’m now including the dependence on β. This means that the density for REML, which I’ll denote by pr, is
pr(yθ,σ2)=RpN(yXβ,σ2(ISθ)1)dβ=|ISθ|1/2(2πσ2)n/2Rpexp(12σ2(yXβ)T(ISθ)(yXβ))dβ.
I want to turn this into a quadratic form in β so I’ll expand inside the exponential and complete the square. I’ll let Rθ=ISθ and Aθ=(XTRθX)1XTRθ to make the notation easier (Aθ gives the weighted pseudoinverse of X and I’m using “R” for “residual” since this matrix gives the residuals of a generalized ridge regression on Z). This gives me
(yXβ)TRθ(yXβ)=yTRθy2βTXTRθy+βTXTRθXβ.
For v0 consider vTXTRθXv=uTRθu with u=Xv. Since X is full column rank v0u0, and then uTRθu>0 since Rθ is PD. This means that XTRθX is also PD and therefore invertible. This means I can complete the square to get
(yXβ)TRθ(yXβ)=(βAθy)TXTRθX(βAθy)+yT(RθRθXAθ)y.
Plugging this in,
pr(yθ,σ2)=|Rθ|1/2(2πσ2)n/2exp(12σ2yT(RθRθXAθ)y)×Rpexp(12σ2(βAθy)TXTRθX(βAθy))dβ.
The integral is now that of the kernel of a multivariate Gaussian so all together
pr(yθ,σ2)=|Rθ|1/2(2πσ2)n/2exp(12σ2yT(RθRθXAθ)y)(2πσ2)p/2|XTRθX|1/2.
For interpreting this, I can note that
HR:=X(XTRθX)1XTRθ
is the projection matrix for a GLS regression of y on X with weights given by Rθ1=(ISθ)1 which is exactly the marginal variance of y (after integrating out γ but before integrating out β). Thus the quadratic form in the exponential is
yTRθ(IHR)y=(ISθ)y,(IHR)y
so this gives the agreement between the residuals of a ridge regression of y on Z and a GLS regression of y on X. The density will be small when this is large, which happens when both residuals are large and the angle between them is small. If both residuals are small in norm then y will be more likely, which corresponds to how a more likely y here is in both the column spaces of X and Z (with the appropriate shrinkage and weighting). Comparing this with the marginal likelihood after just integrating our γ I can see that now both X and Y appear as column spaces which reflects that both have been integrated out.

Taking logs, I have
r(θ,σ2y)=12σ2yT(RθRθX(XTRθX)1XTRθ)ynp2logσ2+12logdetRθ12logdet(XTRθX)
as my REML log likelihood. To get σ^2 I have
rσ2=12σ4yT(RθRθX(XTRθX)1XTRθ)ynp2σ2=set01npyT(RθRθX(XTRθX)1XTRθ)y=σ^2.
When I don’t use REML I have a coefficient of 1n here so this 1np is how I’m avoiding underestimating the variability.

I can now profile out σ2 from r to get rp which is given by
rp(θy)=np2logσ^θ2+12logdetRθ12logdet(XTRθX)
(up to a constant). Optimizing this gives me θ^ and then I can get β^ and σ^2 and the rest proceeds as before.

Application: random slopes

I’ll now check these equations (from my previous post and this one) by fitting a random slopes model on the sleepstudy dataset and comparing the results with those from lme4. I won’t be worrying about efficiency or numerical stability so I’ll be explicitly inverting matrices and all that.

For a random slopes model I have m groups and I want to fit a model with a group-level intercept and slope term. I’ll have ni units in group i for n=i=1mni observations in all. The slope data for group i will be collected into xiRni. I’ll use xij=(xi)j to simplify my notation. In scalar form my model is
yij=β0+αi+(β1+ηi)xij+εij
with αi as the random intercept, ηi as the random slope, and j=1,,ni indexing units within a group. Written in matrix notation I have
y=Xβ+Zγ+ε
with
X=(1n1x11nmxm)Rn×2
giving the features for the fixed-effects part, and
Z=(i=1m1nii=1mxi)Rn×(2m).
My random effects are γ=(αη) so the multiplication Zγ picks out the correct elements of α and η, and multiplies that element of η by the right xij value.

For the distributions of my random effects and error I’ll have
γ=(αη)N((00),(σα2IρσασηIρσασηIση2I))εN(0,σ2I)
so σα2 and ση2 are the marginal variances of α and η respectively, and ρ is the correlation between αi and γi. ρ is useful because it allows the predictor x to be translated without changing the model (Bates et al. mention this on page 7 of the lme4 manual; I’ll show later why this is the case).

In the previous post I wanted to have Var(γ)=σ2Ω1 which means in this case I have
Ω1=(σα2/σ2Iρσαση/σ2Iρσαση/σ2Iση2/σ2I).
This is a 2×2 block matrix so I could work out the inverse to explicitly get Ω but I’ll just numerically invert it in my example at the end. I don’t want to have σ2s in Ω so I’ll parameterize this by
θ=(σα/σ,ση/σ,ρ)[0,)2×[1,1].
The advantage to taking ρ to be the correlation versus the covariance is that its bounds don’t depend on the variances in question.

The variance of y is therefore
Var(y)=σ2(ZΩ1ZT+I)
which is exactly what I had in my previous post. I now can use my previous results to fit this model, but before I do I want to examine the actual structure of Var(y) to see what my random slopes assumption does to the covariance matrix.

Var(y) is block diagonal because I’m still assuming independence between groups, so I’ll look at just a single block which I’ll denote by Ψ. I can work out
Ψ=σα211T+σγ2xxT+ρσασγ(1xT+x1T)+σ2I.
The σα211T+σ2I part is easy to interpret, and is what comes from the random intercepts plus error: there is a baseline level of correlation between all observations within this group but the independent individual-level noise (i.e. ε) prevents this from being constrained to a subspace (which is what would happen if my covariance matrix was just the singular σα211T).

The more complicated part is σγ2xxT+ρσασγ(1xT+x1T). I’ll interpret it by looking at a particular element of Ψ.

Ψij=σα2+σ2δij+σγ2xixj+ρσασγ(xi+xj)=σα2+σ2δij+(σγxi)(σγxj)+ρσα(σγxi+σγxj)=σα2(1ρ2)+σ2δij+(σγxi+ρσα)(σγxj+ρσα)
where δij is the Kronecker delta, and this form suggests an interpretation via kernel functions. This is also where ρ0 can be seen to give me translation invariance in the xs since adding a constant to each xi can be canceled out via the ρσα term. If ρ=0 there’d be no way to cancel that change out.

Define a function k:R×RR by
k(xi,xj)=(σγxi+ρσα)(σγxj+ρσα)

Result 1: k is positive semidefinite, which makes it a valid kernel function.

Pf: Let xRM and consider the matrix K with Kij=k(xi,xj). For simplicity I’ll say k(x,x)=(ax+b)(ax+b). I know
K=a2xxT+ab(x1T+1xT)+b211T
(I basically had this before I even defined k). Letting vRM I have
vTKv=a2(vTx)2+2ab(vTx)(vT1)+b2(vT1)2=(avTx+bvT1)20
for any vRM, which makes K PSD. K is not strictly PD in general though since if vspan{x,1}, which is at most a 2 dimensional space, then I’ll have vTKv=0 even if v0. This means rank(K)2 so unless M=2 I can’t have K be PD.

◻

I could also have established that k is PSD by noting that k(x,x)=φ(x),φ(x) with the feature map φ(x)=σγx+ρσα, so K is a Gram matrix.

Looking at it this way shows that the covariance is large and positive when xi and xj are large with the same signs, and is large and negative when xi and xj are large with different signs. This reflects how when I choose to use random slopes I’m expecting that some groups would grow faster than the usual rate, or perhaps slower than the usual rate, but that either way there’d be an increasing covariance between xs as they get larger.

It’s worth commenting on how this is not an isotropic kernel. With an isotropic kernel like the squared exponential kernel (x,x)exp(γxx2) it only depends on the distance, so the covariance matrix is diagonally dominant and decays in a “banded” fashion away from the diagonal. With my kernel here I will still get a decently prominent diagonal but not always to the same extent since it’s only PSD rather than PD, and the actual magnitudes of the input values still matter so I don’t get a banded decay away from the diagonal.

Here are 4 example covariance matrices with this kernel, just to get a sense of what they tend to look like. Each is made with a linear sequence of length 50 for x. Red values are lower and white values are higher.

 

 

This shows how when all of x is positive then the covariance is at its largest for the final values, while when x has both positive and negative values then the covariance is at its largest at both extremes.

I’ll now actually fit this model. I’ll use both ML and REML to compare the two. I won’t worry about the BLUPs because the parameter estimates are the real sticking point.

library(lme4)
mod_ml <- lmer(Reaction ~ Days + (1 + Days | Subject), data = sleepstudy, REML=FALSE)
mod_reml <- lmer(Reaction ~ Days + (1 + Days | Subject), data = sleepstudy, REML=TRUE)

## setting up my data ~~~~~~~~~~~~~~~~~~~~
X <- cbind(1, sleepstudy$Days)
Z <- model.matrix(~ 0 + Subject + Subject:Days, data = sleepstudy)
y <- sleepstudy$Reaction
logdet <- function(A) as.numeric(determinant(A)$mod)

## defining functions for R, beta, sigma2, and theta ~~~~~~~~~~
Rfunc <- function(theta, Z) {
   # reminder: theta = (s_alpha/sigma, s_eta/sigma, rho)
   m <- ncol(Z)/2
   Omega_inv <- rbind(
      cbind(theta[1]^2 * diag(m), prod(theta) * diag(m)),
      cbind(prod(theta) * diag(m), theta[2]^2 * diag(m))
   )
   diag(nrow(Z)) - Z %*% solve(t(Z) %*% Z + solve(Omega_inv)) %*% t(Z)
}

bfunc <- function(R, X, y) { # beta hat
   solve(t(X) %*% R %*% X) %*% t(X) %*% R %*% y
}

s2func <- function(R, X, y, REML) { # sigma^2 hat
   bhat <- bfunc(R, X, y)
   if(!REML) {
      t(y - X %*% bhat) %*% R %*% (y - X %*% bhat) / length(y)
   } else {
      t(y) %*% (R - R %*% X %*% solve(t(X) %*% R %*% X) %*% t(X) %*% R) %*% y /
         (length(y) - ncol(X))
   }
}

# the actual objective function for theta, \ell_p or \ell_{rp}
negloglik <- function(theta, X, y, Z, REML) {
   R <- Rfunc(theta, Z)
   if(!REML) {
      out <- .5 * logdet(R) - (length(y)/2) * log(s2func(R, X, y, REML))
   } else {
      out <- .5 * logdet(R) -.5 * logdet(t(X) %*% R %*% X) -
         (length(y)-ncol(X))/2 * log(s2func(R, X, y, REML))
   }
   -as.numeric(out)
}

## optimizing the log likelihoods ~~~~~~~~~~~~~~~~~~~
opt_ml <- optim(c(.1, .1, .1), negloglik, X=X, y=y, Z=Z, REML=FALSE,
      lower = c(0.001,0.001,-1), upper=c(Inf,Inf,1), method="L-BFGS-B")
opt_reml <- optim(c(.1, .1, .1), negloglik, X=X, y=y, Z=Z, REML=TRUE,
      lower = c(0.001,0.001,-1), upper=c(Inf,Inf,1), method="L-BFGS-B")

## collecting the results and comparing to lme4 ~~~~~~~~~~~~~~~~
results <- function(theta_hat, REML, X, y, Z) {
   Rhat <- Rfunc(theta_hat, Z)
   s2 <- s2func(Rhat, X, y, REML)
   out <- list(
      beta_hat = bfunc(Rhat, X, y), s2 = s2, s2_alpha = theta_hat[1]^2 * s2,
      s2_eta = theta_hat[2]^2 * s2, rho = theta_hat[3]
   )
   lapply(out, as.numeric) # the helper funcs return matrices
}

results(opt_ml$par, FALSE, X, y, Z)
summary(mod_ml)

results(opt_reml$par, TRUE, X, y, Z)
summary(mod_reml)

Everything agrees so I’m good to go.

Leave a Reply

Your email address will not be published. Required fields are marked *