$\newcommand{\E}{\text{E}}$$\newcommand{\Var}{\text{Var}}$$\newcommand{\Y}{\mathcal Y}$Here I’m going to introduce Monte Carlo maximum likelihood estimation (MCMLE) and I’ll use it to fit a logistic regression.
Suppose I’m modeling a discrete $y \in \Y$ and I decide to do so by coming up with several statistics that I feel capture the useful parts of $y$, say
$$
s(y) = \left(s_1(y), \dots, s_p(y)\right)
$$
so $s : \Y \to \mathbb R^p$. I’ll then say that
$$
f(y|\theta) = P(Y=y|\theta) \propto \exp\left(\theta^Ts(y)\right)
$$
so any aspects of $y$ not described by $s$ are ignored, and $\theta\in\mathbb R^p$ is my parameter of interest.
In order to get this to be a proper distribution it needs to sum to $1$, so I’ll have
$$
f(y|\theta) = \frac{\exp\left(\theta^Ts(y)\right)}{\sum\limits_{z\in \Y} \exp(\theta^Ts(z))}.
$$
I’ll write this as
$$
f(y|\theta) = \frac{q_\theta(y)}{k(\theta)}
$$
where $k$ is the partition function and my goal then is to find
$$
\hat\theta = \underset{\theta}{\text{argmax }} f(y|\theta).
$$
MCMLE can be used to do this optimization when I can’t evaluate $k(\theta)$. One common use-case is when I’m modeling a random graph so $y$ is an $n\times n$ (say) adjacency matrix. I could pick a few statistics that characterize a particular random graph, say
$$
f(y|\theta) \propto \exp\left(\theta_1 \cdot \text{#edges}(y) + \theta_2 \cdot \text{#twostars}(y) + \theta_3 \cdot \text{#triangles}(y)\right)
$$
for example (this is the “triad” model), but then the partition function will involve a sum over all $n\times n$ adjacency matrices which is generally intractable.
MCMLE to the rescue
The key idea behind MCMLE is that I may not be able to evaluate $k(\theta)$, but frequently I can sample from $f(y|\theta)$ so I can use that combined with the law of large numbers (LLN) to approximate $k(\theta) / k(\theta_0)$ for some fixed $\theta_0$.
Using various properties of argmaxmia I can do the following series of manipulations:
$$
\hat \theta = \text{argmax}_{\theta} f(y|\theta) \\
=\text{argmax}_{\theta} \log f(y|\theta) \\
=\text{argmax}_{\theta} \log f(y|\theta) – \log f(y|\theta_0)
$$
for some constant $\theta_0$. Continuing,
$$
= \text{argmax}_\theta \log q_\theta(y) – \log k(\theta) – \log q_{\theta_0}(y) + \log k(\theta_0) \\
= \text{argmax}_\theta \log \frac{q_\theta(y)}{q_{\theta_0}(y)} – \log \frac{k(\theta)}{k(\theta_0)}.
$$
I have $q_\theta(y) = \exp(\theta^Ts(y))$ so
$$
\log \frac{q_\theta(y)}{q_{\theta_0}(y)} = (\theta – \theta_0)^Ts(y)
$$
which is a pretty tidy form.
Now if I let
$$
B(\theta, \theta_0) = \log \frac{k(\theta)}{k(\theta_0)}
$$
then I could find $\hat \theta$ by
$$
\hat\theta = \text{argmax}_\theta \, (\theta – \theta_0)^Ts(y) – B(\theta, \theta_0)
$$
and it’ll turn out that I can estimate $B$ in a pretty straightforward way.
I’ve assumed that I can sample $Y_1^*, \dots, Y_m^* \stackrel{\text{iid}}\sim f(\cdot|\theta_0)$. This isn’t that big of an assumption because I can evaluate $f(y | \theta) / f(y’ | \theta)$ since $k(\theta)$ cancels out, so worst case I can use MCMC with a Metropolis-type algorithm to get these samples. For graphs that’s typically what’s necessary.
Note then that
$$
\begin{aligned}E_{\theta_0}\left(\frac{q_\theta(Y)}{q_{\theta_0}(Y)}\right) &= \sum_{z \in \Y} \frac{q_\theta(z)}{q_{\theta_0}(z)} f(z|\theta_0) \\&
= \sum_{z\in\Y} \frac{q_\theta(z)}{q_{\theta_0}(z)} \frac{q_{\theta_0}(z)}{k(\theta_0)} \\&
= \frac{1}{k(\theta_0)} \sum_{z} q_\theta(z) \\&
= \frac{k(\theta)}{k(\theta_0)} \sum_{z} \frac{q_\theta(z)}{k(\theta)} \\&
= \frac{k(\theta)}{k(\theta_0)} = \exp B(\theta, \theta_0).\end{aligned}
$$
This means I can estimate $B(\theta, \theta_0)$ via
$$
\hat B(\theta, \theta_0, \{Y_i^*\}) = \log\left(\frac 1m\sum_{i=1}^m \frac{q_\theta(Y_i^*)}{q_{\theta_0}(Y_i^*)}\right) \\
= \log\left(\frac 1m\sum_{i=1}^m \exp\left((\theta – \theta_0)^T s(Y_i^*)\right)\right)
$$
and this is a consistent estimator via the LLN. This is of the “log-mean-exp” form which is a function that can lead to numerical issues so it’s often worth finding a well written one.
All together, the objective function becomes
$$
g(\theta; \theta_0, y, \{Y_i^*\}) = (\theta – \theta_0)^Ts(y) – \log\left(\frac 1m\sum_{i=1}^m \exp \left((\theta – \theta_0)^Ts(Y_i^*)\right)\right)
$$
so this gives me the following algorithm:
1. initialize $\theta^{(0)}$
2. for $t = 1, \dots, T$ or until convergence:
a. sample $Y_1^*,\dots,Y_m^* \stackrel{\text{iid}}\sim f(\cdot | \theta^{(t-1)})$
b. obtain $\theta^{(t)} = \text{argmax}_{\theta} \;g(\theta; \theta^{(t-1)}, y, \{Y^*_i\})$.
Geyer and Thompson (1992) discuss how the approximation deteriorates for a fixed $m$ as $\|\theta-\theta_0\|$ grows, so they recommend an iterative approach where small improvements are made to each successive $\theta^{(t)}$ rather than trying to actually find the argmaximum at each step. This is still justified because the likelihood is globally concave (this is a property of exponential families).
Convergence could be defined by checking if $\text{max}_{\theta} \,g(\theta; \theta^{(t-1)}, y, \{Y^*_i\}) < \varepsilon$ for some $\varepsilon > 0$. This works because at the maximum $g$ will be non-negative, but if it’s tiny then there can only have been a tiny change between $\theta^{(t)}$ and $\theta^{(t-1)}$. Nevertheless I’m still going to check for convergence the old-fashioned way via $\|\theta^{(t)} – \theta^{(t-1)}\| < \varepsilon$ just because I’m not actually finding the maximum at each step and I want this to be robust to any other departures that I consider.
Logistic regression
I’ll now try this out by fitting a logistic regression with it. I’ll have
$$
f(y|\theta) = \prod_{i=1}^n \theta_i^{y_i}(1-\theta_i)^{1-y_i} \\
= \exp\left(\sum_i y_i \text{ logit}(\theta_i) + \log(1-\theta_i)\right).
$$
Now assuming I have a full column rank $n\times p$ matrix of covariates $X$, I can set $\theta = \frac{1}{1+e^{-X\beta}}$ so now I’ll write this as conditioned on $\beta$. I’ll omit the conditioning on $X$ for simplicity. Note that
$$
\text{logit}\left(\frac{1}{1+e^{-z}}\right) = z
$$
(i.e. those are inverses of each other) so I have $\text{logit}(\theta_i) = x_i^T\beta$. Additionally,
$$
\begin{aligned}\log(1-\theta_i) &= \log\left(1 – \frac{1}{1+e^{-x_i^T\beta}}\right) \\&
= \log\left(\frac{e^{-x_i^T\beta}}{1+e^{-x_i^T\beta}}\right) \\&
= \log\left(\frac{1}{1+e^{x_i^T\beta}}\right) = – \log\left(1+e^{x_i^T\beta}\right)\end{aligned}
$$
which is the negative softplus function. Plugging this all in I get
$$
f(y|\beta) = \exp\left(\sum_i y_i x_i^T\beta – \log\left(1+e^{x_i^T\beta}\right) \right) \\
= \frac{\exp(\beta^TX^Ty)}{\prod_{i=1}^n (1 + e^{x_i^T\beta})}.
$$
This means $s(y) = X^Ty$ which makes sense as the parts of $y$ that matter for this are the inner products of $y$ with each predictor.
As written the denominator is not hard to evaluate but I can do my best to fix that.
Suppose I have
$$
\prod_{i=1}^n (1+e^{a_i}).
$$
If I expand this I’ll get
$$
1 + e^{a_1} + \dots + e^{a_n} + e^{a_1 + a_2} + e^{a_1+a_3} + \dots + \dots + e^{a_1 + \dots + a_n}.
$$
Note that the numerators correspond to every value of $z^Ta$ for $z \in \{0,1\}^n$ so I can write this as
$$
\sum_{z \in \{0,1\}^n} e^{z^Ta}.
$$
Plugging this in to $f$ I get
$$
f(y|\beta) = \frac{\exp\left(\beta^Ts(y)\right)}{\sum\limits_{z \in \Y} \exp\left(\beta^Ts(z)\right)}
$$
which is exactly the form that I was looking for. That is a scary looking sum as written so I will now pretend that I don’t know how to efficiently evaluate it and I’ll use MCMLE to fit this.
Here’s the generic code to run MCMLE:
require(statnet.common) # for log_mean_exp # simulates y via a provided `y_simulator` and returns a closure # that gives B(theta). `...` is for extra arguments to # `y_simulator` and `s` and will typically just be the data # matrix `x`. It is assumed that `y_simulator` returns a list # of the simulated Y_i^*. make_Bfunc <- function(theta0, s, nsim, y_simulator, ...) { smtx <- vapply(y_simulator(theta0, nsim, ...), s, ..., numeric(length(theta0))) function(theta) { statnet.common::log_mean_exp(t(theta - theta0) %*% smtx) } } # I'll actually maximize this objfunc <- function(theta, theta0, s, y, B, ...) { sum((theta - theta0) * s(y, ...)) - B(theta) } # `...` is extra arguments to `s` and `y_simulator`. # `maxit` is for the inner optimizer. `do.trace` prints progress. MCMLE <- function(y, theta_init, s, y_simulator, nsim, ..., nsteps, maxit=25, max_theta=1e6, tol=1e-8, do.trace=FALSE) { theta_curr <- theta_init for(i in 1:nsteps) { B <- make_Bfunc(theta_curr, s, nsim, y_simulator, ...) opt <- optim( theta_curr, objfunc, theta0=theta_curr, s=s, y=y, B=B, ..., control=list(fnscale=-1, maxit=maxit) ) theta_new <- opt$par if(do.trace > 0 && i %% do.trace == 0) { print(i) print(opt) } if(any(abs(theta_new) > max_theta)) { stop("`theta` is escaping to infinity. Try decreasing `maxit`.") } if(sum((theta_new - theta_curr)^2) < tol) { cat("Converged in", i, "iterations.\n") break } theta_curr <- theta_new } opt$par }
And now I’ll apply this to a logistic regression. I’m going to try this on a simulated data set.
s_logiregr <- function(y, x) { t(x) %*% y } ysim_logiregr <- function(beta0, nsim, x) { replicate(nsim, rbinom(nrow(x), 1, plogis(x %*% beta0)), simplify = FALSE) } set.seed(1) n <- 100 p <- 5 x <- matrix(rnorm(n * p), n, p) betas <- runif(p, -2, 2) y <- rbinom(n, 1, plogis(x %*% betas)) beta_hat_mcmle <- MCMLE( y=y, theta_init=sample(c(-.1, .1), ncol(x), TRUE), s=s_logiregr, y_simulator=ysim_logiregr, nsim = 5000, x=x, nsteps=100, maxit=30 ) beta_hat_glm <- glm(y~x-1, family=binomial())$coef round(rbind(beta_hat_mcmle, beta_hat_glm), 3)
This converged in 9 iterations and took just a second or two on my laptop. The results are below.
$$
\begin{array}{c|ccccc}
& 1 & 2 & 3 & 4 & 5 \\ \hline
\hat\beta_{\text{MCMLE}} & 0.163 & 0.984 & -0.491 & 1.845 & -2.126 \\
\hat\beta_{\text{GLM}} & 0.167 & 0.979 & -0.483 & 1.837 & -2.140
\end{array}
$$
So there it is. That’s just a super simple example but it’s neat to be able to replicate a trusted result like that from glm
with this new procedure.
In my next post on the subject, I’ll generalize a lot of this because it turns out that this can all be carried out in a general exponential family and the discreteness of $y$ doesn’t actually matter. So I can actually use MCMLE to fit any GLM, not just a logistic regression, and I can do more complicated things like account for dependence between observations.