Inferring a hidden Markov chain via MCMC

$\newcommand{\one}{\mathbf 1}$In this post I’m going to consider the following situation: there is a latent unobserved continuous-state Markov chain $Z := (Z_1,\dots,Z_n)$ and I observe a binary sequence $X := (X_1,X_2,\dots,X_n)$ where the $X_i \mid Z_i$ are all independent. In particular, I’ll have the chain begin with $Z_1 \mid \sigma^2 \sim\mathcal N(0, \sigma^2)$ and then for $i \geq 2$ $Z_i \mid Z_{i-1}, \sigma^2 \sim \mathcal N(Z_{i-1}, \sigma^2)$. For the observed values I’ll have
$$
X_i | Z_i \sim \text{Bern}(s(Z_i)) \hspace{5mm} \text{for} \hspace{5mm} s(z) = \frac{1}{1 + e^{-z}}.
$$
My goal is to experiment with inferring $Z$ via MCMC. $s$ is monotonic in its input so when $Z$ wanders towards larger values I’ll see lots of $1$s in $X$ and vice versa for when $Z$ is negative. $\sigma^2$ governs how rapidly the chain can change and how far from $0$ the start is likely to be; if $\sigma^2$ is large I’ll expect to see lots of probabilities near $0$ and $1$ since the chain will be more likely to take big values. If $\sigma^2$ is small then many probabilities will be near $s(0) = \frac 12$. Note
$$
E_i(Z_i) = E_{i-1}(E_{i|i-1}(Z_i \mid Z_{i-1})) = E_{i-1}(Z_{i-1}) = \dots = E_1(Z_1) = 0
$$
so marginally every point in the Markov chain has the same mean of $0$. But because of the correlation and $\sigma^2$ the chain doesn’t actually have to spend much time near $0$.

The following graphical model explicitly gives the conditional relationships in $(X,Z)$ that I’ve been describing:

Below is a simulation of this process and the resulting data set that I’ll be working with.

set.seed(13)
n <- 100
sigma <- 1.2
z <- numeric(n)
z[1] <- rnorm(1, 0, sigma)
for(i in 2:n) {
  z[i] <- rnorm(1, z[i-1], sigma)
}
x <- rbinom(n, 1, plogis(z))
plot(z, type="l", lwd=2, xlab="Time",
   main = bquote("Example data with" ~ sigma^2 ~ "=" ~ .(sigma^2)))
points(z, col=ifelse(x==1, "red", "blue"), pch=19, cex=.75)
legend("bottomleft", c("x = 1", "x = 0", "z"), col=c(2, 4, 1),
   pch=c(19,19,NA), lwd=c(NA,NA,2), bty="n")

I can specify the joint distribution of $Z$ in a different way. Let $u \mid \sigma^2 \sim \mathcal N(0, \sigma^2 I)$ and set $Z_1 = u_1$. Then I can view $Z_i \mid Z_{i-1},\sigma^2 \sim \mathcal N(Z_{i-1}, \sigma^2)$ as $Z_{i-1} + \mathcal N(0, \sigma^2)$ which is equivalent to $Z_i := Z_{i-1} + u_{i}$. Expanding $Z_{i-1}$ I arrive at
$$
Z_i = \sum_{1 \leq j \leq i} u_j.
$$
Letting
$$
L = \left( \begin{array}{cccc} 1 & 0 & 0 & \cdots & 0 \\ 1 & 1 & 0 & \cdots & 0 \\
\vdots & \vdots & \vdots & \cdots & \vdots \\
1 & 1 & 1 & \cdots & 1\end{array}\right)
$$
so $L$ is the lower triangular matrix of all $1$s, I can now write
$$
Z = Lu
$$
so
$$
Z \mid \sigma^2 \sim \mathcal N(\mathbf 0, \sigma^2 LL^T).
$$
$L$ is triangular so its eigenvalues are the diagonal elements and these are all $1$ which makes it positive definite (PD). For any $v \neq \mathbf 0$ $v^TLL^Tv = \|L^Tv\|^2 > 0$ as $L^T$ is full rank, so this means that $LL^T$ is also PD.

I can work out what $LL^T$ is too. $(LL^T)_{ij} = \langle L_i, L_j\rangle$ where $L_i$ is the $i$th row of $L$ and etc. for $L_j$. $L_i$ and $L_j$ are both identically $1$ until the $\min\{i,j\}$th element at which their elementwise product becomes zero. This means $\langle L_i, L_j\rangle = \min\{i,j\}$ so $(LL^T)_{ij} = \min\{i,j\}$.

Now define a function $k : \mathbb N \times \mathbb N \to \mathbb R$ by $k(i,j) = \min\{i,j\}$. I just showed that the Gram matrix formed by $(k(i,j))_{i,j\in\{1,\dots,n\}}$ is PD but I can show something stronger:

Claim: $k$ is a PD kernel on $\mathbb N$.

Pf: $k$ is clearly symmetric so the PD part is the only potential concern. Let $S \subset \mathbb N$ have size $m$, and consider the matrix $K_S := (k(i,j))_{i,j\in S}$. I need to show that this matrix is PD. WLOG I can take $S$ to be sorted in increasing order, as this is equivalent to conjugating $K_S$ with a permutation matrix and that doesn’t change the eigenvalues.

Define a matrix
$$
T_S = \left(
\begin{array}{cc}
\sqrt s_1 & 0 & 0 & \dots & 0 \\
\sqrt s_1 & \sqrt{s_2 – s_1} & 0 & \dots & 0 \\
\sqrt s_1 & \sqrt{s_2 – s_1} & \sqrt{s_3 – s_2} & \dots & 0\\
\vdots & \vdots & \vdots & \dots & \vdots\\
\sqrt s_1 & \sqrt{s_2 – s_1} & \sqrt{s_3 – s_2} & \dots &\sqrt{s_m – s_{m-1}}
\end{array}
\right)
$$
and since $S$ is sorted I don’t have to worry about any of these square roots being applied to negative numbers. Let $T_i$ and $T_j$ be rows of $T$ and WLOG take $i \leq j$. Then
$$
\begin{aligned}(T_ST_S^T)_{ij} &= \langle T_i, T_j\rangle\\&
= \sum_{l=1}^mT_i(l)T_j(l) = \sum_{l=1}^iT_i(l)^2 \\&
= (s_1) + (s_2 – s_1) + (s_3 – s_2) + \dots + (s_i – s_{i-1} ) \\&
= s_i = \min\{s_i, s_j\}.\end{aligned}
$$
Thus $K_{S_{ij}} = (T_ST_S^T)_{ij}$ so $K_S = T_ST_S^T$. $T_S$ is triangular with all positive diagonal elements so it’s PD and by the same logic as with $L$ this means $K_S$ is PD.

$\square$

The point of this is that $k$ is a valid kernel so I can view $Z$ as being sampled from a Gaussian process $\mathcal{GP}(0, \sigma^2 \cdot k)$ with index set $\mathbb N$ where the mean function is just the zero function and $\sigma^2 \cdot k$ is the covariance function. One cool thing about this is that really all that matters for $Z$ is $K$ so I could easily swap a different kernel in here and get a different model. The particular kernel that I’m using isn’t foundational to the general set-up.

This is all the motivation I need to try a Bayesian approach so I’ll look at the posterior distribution $Z \mid X$. From Bayes’ theorem I know
$$
\pi(z , \sigma^2\mid x) \propto p(x \mid z, \sigma^2) \pi(z, \sigma^2) \\
= p(x \mid z) \pi(z\mid\sigma^2)\pi(\sigma^2).
$$
I’ll only need ratios of posterior densities so I won’t need to worry about the marginal likelihood $p(x)$. $p(x|z)$ is a joint Bernoulli likelihood and $\pi(z|\sigma^2)$ is a multivariate Gaussian (it’s my GP prior) so I just need a prior on $\sigma^2$ to proceed. So far there’s not much context for whether I expect $Z$ to be typically near $0$ or not, but I do believe that arbitrarily large $\sigma^2$ are not reasonable so I’ll take $\sigma^2 \sim \text{IG}(a, b)$ for some known constants $a$ and $b$ (in practice I’ll use $a=5$ and $b=10$). I’m actually specifying the joint posterior of $(Z, \sigma^2)\mid X$ here but I’ll be able to obtain samples from the marginal posterior $Z\mid X$ by just taking the $Z$ coordinates of my joint samples.

For $p(x|z)$ I can take logs, rearrange, and make use of the fact that $s$ and $p \mapsto \log\frac p{1-p}$ are inverses to get
$$
\log p(x | z) = \sum_{i=1}^n x_i \log s(z_i) + (1-x_i)\log(1-s(z_i)) \\
= x^Tz – \sum_i \log(1 + e^{z_i}).
$$
For $\pi(z|\sigma^2)$ I have $Z \mid \sigma^2 \sim \mathcal N\left(\mathbf 0, \sigma^2 K\right)$ so (ignoring constants)
$$
\log \pi(z|\sigma^2 ) = -\frac 1{2\sigma^2}z^TK^{-1}z – \frac n2 \log \sigma^2 – \frac 12 \log\det K .
$$
This is a mathematically concise form but I’m going to work on it to make it faster to evaluate. $K = LL^T$ and $L$ is triangular with all $1$s on the diagonal so $\det K= (\det L)^2 = 1$ therefore $\log\det LL^T = 0$. Furthermore, suppose $Lv = w$ for some vectors $v$ and $w$. I can find $L^{-1}$ by solving this system for $v$. The first one is easy as $v_1 = w_1$. Next is $v_1 + v_2 = w_2$ so plugging in the first equation gives me $v_2 = w_2 – w_1$. Continuing, I need to solve $v_1+v_2+v_3 = w_3$ for $v_3$. But I can recognize that $v_1+v_2 = w_2$ so it’s just $v_3 = w_3 – w_2$. In general, for $i \geq 2$, I’ll have $v_i = w_i – w_{i-1}$ which means that $L^{-1}$ is the lag one differencing matrix, i.e.
$$
L^{-1} = \left(\begin{array}{cccccccc}1 & 0 & 0 & \cdots& 0 & 0 & 0 \\
-1 & 1 & 0 & \cdots & 0 & 0 & 0 \\
0 & -1 & 1 & \cdots & 0 & 0 & 0 \\
\vdots & \vdots & \vdots & \cdots & \vdots & \vdots& \vdots \\
0&0&0&\cdots&-1&1&0 \\
0 &0 &0 &\cdots & 0 &-1 &1
\end{array}\right)
$$
This means that
$$
L^{-1}z = \left(\begin{array}{c}z_1 \\ z_2 – z_1 \\ \vdots \\ z_n – z_{n-1}\end{array}\right)
$$
and therefore
$$
z^TK^{-1}z = z_1 ^2 + \sum_{i= 2}^n (z_i – z_{i-1})^2
$$
which will be much faster to compute than a general quadratic form. I could have arrived at this just by starting from my initial specification of $Z_i \mid Z_{i-1}, \sigma^2 \sim \mathcal N(Z_{i-1}, \sigma^2)$ but this was a nice sanity check of the GP form of this.

Finally, for my prior I have
$$
\pi(\sigma^2) \propto (\sigma^2)^{-a – 1}e^{-b / \sigma^2}.
$$

Plugging all the densities in and still ignoring constants, I have
$$
\log \pi(z, \sigma^2|x) = x^Tz – \sum_i \log(1 + e^{z_i})
-\frac {z_1^2 + \sum_{i= 2}^n (z_i – z_{i-1})^2}{2\sigma^2} – \left(\frac n2 + a + 1\right) \log \sigma^2 – b\sigma^{-2}
$$
with the additional note that this is $-\infty$ if $\sigma^2 \leq 0$.

For convenience let $\ell(z, \sigma^2|x) = \log \pi(z, \sigma^2|x)$. One nice feature of $\ell$ is that it is convex in $z$:
$$
\frac{\partial}{\partial z_j} \sum_i \log(1 + e^{z_i}) = \frac{e^{z_j}}{1 + e^{z_j}} = s(z_j)
$$
so
$$
\frac{\partial \ell}{\partial z} = z – s(z) – \sigma^{-2}K^{-1}z.
$$
Then
$$
\frac{\partial s(z)_i}{\partial z_j} = \begin{cases}s(z_i)(1 – s(z_i)) & i = j \\ 0 & \text{o.w.}\end{cases}.
$$
This means
$$
\frac{\partial^2 \ell}{\partial z^2} = – \text{diag}\left[s(z) \odot (1 – s(z))\right] – \sigma^{-2}K^{-1}
$$
which is strictly negative definite ($\odot$ denotes element-wise multiplication). This is nice because I don’t have to worry about difficult multimodal structures for $Z$. It’ll be harder to screw up the sampling.

Before I actually do my sampling, I’m going to use the reparameterization $\xi := \log \sigma^2$ which will let me use a standard Metropolis sampler with an all-Gaussian proposal distribution as now all my parameters are unbounded. This means $e^{\xi} = \sigma^2$ and the Jacobian is $e^{\xi}$ so all together
$$
\log \pi(z, \xi | x) = x^Tz – \sum_i \log(1 + e^{z_i})
-\frac 12 e^{-\xi}\left(z_1^2 + \sum_{i= 2}^n (z_i – z_{i-1})^2\right) – \left(\frac n2 + a\right) \xi – be^{-\xi}.
$$

In R I’ll use Geyer and Johnson’s mcmc library.

Here’s the code I used:

library(mcmc)
set.seed(119)
a_prior <- 5
b_prior <- 10
# log unnormalized posterior with xi
lupost_xi <- function(params, x, a = a_prior, b = b_prior) {
  n <- length(params) - 1  # minus 1 for xi
  z <- params[1:n]
  xi <- params[n+1]
  sum(z*x) - sum(log(1 + exp(z))) -
    .5 * exp(-xi) * (z[1]^2 + sum((z[-1] - z[-n])^2)) -
    (n/2 + a) * xi - b * exp(-xi)
}
z_init <- ifelse(x == 1, 2, -2) + rnorm(n, 0 ,.5)
xi_init <- log(b_prior / (a_prior - 1))  # log prior mean
scale_z <- .2
scale_xi <- .3
out <- metrop(lupost_xi, c(z_init, xi_init), nbatch = 1e6, blen = 1, x = x,
              scale = rep(c(scale_z, scale_xi), c(n, 1)))
cat("Accept: ", out$accept * 100, "%\n", sep="")

Checking to see how well the chains mixed:

The flat horizontal lines are the true corresponding $Z_i$ values. These didn’t do a terrible job and seem to have explored about where they were supposed to, but there’s definitely a lot of autocorrelation there so my effective sample size is much smaller than it looks. I also only ran one chain since I know the ground truth here.

Here’s the final result:

It’s interesting how it’s not very sensitive to movement in $Z$ that doesn’t lead to any changes in $X$. For example, around $Z_{15}$ the $Z_i$ dip down pretty close to zero but since $X_i = 0$ doesn’t happen the data doesn’t have any indication that that happened. The uncertainty also gets bigger in sections with $X_i$ constant for a while which makes sense since when $X_i$ is going back and forth between $0$ and $1$ a lot then I have a pretty good idea that $Z$ is near $0$, but if the $X_i$ are constant for a while then $|Z_i|$ could be extremely large but there wouldn’t really be a way of telling. It’s also interesting how sensitive the model is to a single “unexpected” $X_i$ that’s different from the surrounding ones. The model tries very hard to get close to those. Overall this didn’t do too badly!

Leave a Reply Cancel reply