$\newcommand{\e}{\varepsilon}$Suppose I’ve got a linear model $y = X\beta + \e$ with $\e \sim \mathcal N(0, \sigma^2 I)$ and $X\in\mathbb R^{n\times p}$ is full column rank. In this post I’m going to explore the role of $\sigma^2$ in estimating $\beta$ and a few ways to get rid of it, i.e. I’ll be officially treating it like the nuisance parameter that it can be. I’ll use the parameterization $\tau := 1 / \sigma^2$ to make things tidier. I will be rather casual about when I’m thinking of the parameters $\beta$ and $\tau$ as specific fixed unknowns versus as arguments to the likelihoods but it should be clear from context.
When dealing with the likelihood $f(y\mid \beta, \tau, X)$ two of the ways that I can turn this into something that doesn’t depend on $\tau$ are by optimization, i.e. considering
$$
f_p(y \mid \beta, X) := \sup_{\tau > 0} f(y \mid \beta, \tau, X)
$$
(the subscript “$p$” refers to this being the profiled likelihood) or by integration, i.e.
$$
f_\pi(y \mid \beta, X) := \int_0^\infty f_p(y \mid \beta, \tau, X) \pi(\tau)\,\text d\tau
$$
for some prior $\pi(\tau)$. In other words, for each $\beta$ under consideration I can get rid of $\tau$ by considering the most likely scenario over all possible $\tau$ as given by $f_p$, or by considering the average case as given by $f_\pi$. Both are sensible and are commonly used for various things (e.g. in a mixed model the usual procedure involves both of these). I’ll use $\ell_p = \log f_p$ and $\ell_\pi = \log f_\pi$.
I already know that $\hat\beta = (X^TX)^{-1}X^Ty$ is the unique value of $\beta$ that maximizes the joint likelihood so I’ll be looking to see if any of these other methods give different results. Putting priors on $\beta$ and integrating it out can lead to different results for $\hat\tau$, such as with REML vs ML, so even though $\hat\beta$ doesn’t include $\tau$ I don’t think it’s obvious a priori that none of this affects the point estimate of $\hat\beta$. I’ve gone through this process with mixed models in my post here.
Profiling
I’ll begin with profiling.
I want to maximize $f$ over $\tau$ for a fixed $\beta$ and I can do that in the usual way by taking logs to get $\ell$ and then using derivatives.
$$
\frac{\partial \ell}{\partial \tau} = -\frac{1}{2}\|y-X\beta\|^2 + \frac{n}{2\tau}
$$
which when set to $0$ yields
$$
\hat\tau(\beta) = \frac n{\|y-X\beta\|^2}.
$$
Furthermore,
$$
\frac{\partial^2 \ell}{\partial\tau^2} = -\frac{n}{2\tau^2} < 0
$$
for all valid $\tau$, which means $\ell$ is convex in $\tau$ so $\hat\tau(\beta)$ is indeed a maximum.
This is nice because I’m looking to find the global optimum of $f$ in terms of both $\beta$ and $\tau$, but for any candidate $\beta$ I now know the unique best $\tau$ for that $\beta$ (in general with multivariate optimization there’s no reason to expect a closed form solution for the optimum of some inputs as a function of others, but in this case everything worked out nicely). Thus the maximal value of $f_p(\beta) = f(\beta, \hat\tau(\beta))$ will be the same as the maximum of $f(\beta,\tau)$ when optimized over both $(\beta,\tau)$.
Plugging $\hat\tau(\beta)$ in, I find
$$
\begin{aligned}\ell_p(\beta) &= \ell(\beta,\hat\tau(\beta)) = -\frac {\hat \tau}{2}\|y-X\beta\|^2 + \frac n 2 \log\hat \tau – \frac n2 \log 2\pi \\&
= -\frac n2 + \frac n2 \log \frac n {\|y-X\beta\|^2} – \frac n2 \log 2\pi \\&
= -\frac n2 \log \|y-X\beta\|^2 – \frac n2 \left(1 – \log n + \log 2\pi\right)\end{aligned}
$$
Interestingly, even though I’ve now eliminated the nuisance parameter, I find that the argmax $\hat\beta^{(p)}$ has not changed. In retrospect though this isn’t a big surprise as the $\beta$ that maximizes $\ell$ does not depend on $\tau$.
Integrating
What if instead I integrated $\tau$ out? I’ll consider two priors. First will be putting an improper uniform prior on $\tau$ so $\pi(\tau)\propto 1$. I’ll denote the resulting marginal density and log-likelihood by $f_U$ and $\ell_U$ respectively (“U” is for uniform).
This leads to the following:
$$
\begin{aligned}f_U(y \mid \beta, X) &= \int_{0}^\infty f(y\mid \beta, \tau, X)\pi(\tau)\,\text d\tau \\&
= (2\pi)^{-n/2}\int_0^\infty \tau^{n/2} \exp\left(-\frac \tau2\|y-X\beta\|^2\right)\,\text d\tau.\end{aligned}
$$
Let $z = \frac \tau 2 \|y – X\beta\|^2$ so $\text dz = \frac 1 2 \|y – X\beta\|^2\,\text d\tau$. This means
$$
\begin{aligned}f_U(y | \beta, X) &= (2\pi)^{-n/2} \left(\frac 12 \|y-X\beta\|^2\right)^{-n/2-1}\int_0^\infty z^{n/2} \exp\left(-z\right)\,\text dz \\&
= \left( \|y-X\beta\|^2\right)^{-n/2-1} \frac{2\Gamma(n/2+1)}{\pi^{n/2}} \\&
\propto \|y-X\beta\|^{-n-2}.\end{aligned}
$$
Maximizing $f_U$ over $\beta$ is again equivalent to minimizing $\|y – X\beta\|^2$ so this leads to the same $\hat\beta$ as OLS.
What if instead I use the conjugate prior $\tau \sim \Gamma(a,b)$? Then the prior density will be
$$
\pi(\tau; a,b) = \frac{b^a}{\Gamma(a)}\tau^{a-1}e^{-b\tau}
$$
which leads me to
$$
\begin{aligned}f_\Gamma(y | \beta, X) &= \int_0^\infty f(y | \tau, \beta, X) \pi(\tau)\,\text d\tau \\&
\propto \int_0^\infty \tau^{n/2}\exp\left(-\frac{\tau}{2}\|y-X\beta\|^2\right) \cdot \tau^{a-1}e^{-b\tau}\,\text d\tau \\&
= \int_0^\infty \tau^{n/2+a-1}\exp\left(-\frac{\tau}{2}\|y-X\beta\|^2-b\tau\right) \,\text d\tau \\&
=\int_0^\infty \tau^{n/2+a-1}\exp\left(-c \tau\right) \,\text d\tau\end{aligned}
$$
where $c := \frac 12 \|y-X\beta\|^2 + b$ is a constant with respect to $\tau$. Note $c > 0$ and define $z = c\tau$ so $\text dz = c\,\text d\tau$ and then
$$
\begin{aligned}&\int_0^\infty \tau^{n/2+a-1}\exp\left(-c \tau\right) \,\text d\tau \\&
= \int_0^\infty \left(\frac z c\right)^{n/2+a-1} e^{-z} c^{-1} \,\text dz\\&
= c^{-(n + 2a)/2}\int_0^\infty z^{n/2+a-1} e^{-z}\,\text dz \\&
= c^{-(n + 2a)/2}\Gamma(n/2+a)\\&
\propto \left[\frac 12 \|y-X\beta\|^2 + b\right]^{-(n+2a)/2} \\&
\propto \left[\frac 1{2b} \|y-X\beta\|^2 + 1\right]^{-(n+2a)/2}\end{aligned}
$$
which can be recognized as the kernel of a multivariate $t$ distribution with $\mu = X\beta$, $\Sigma = \frac ba I$, and $\nu = 2a$ degrees of freedom. Surprisingly enough, the argmax is still the same as in OLS.
Conclusions
So I’ve tried 3 different ways of finding a $\hat\beta$ and every time I got the same thing. At this point I’m pretty suspicious that $\hat\beta$ really doesn’t care about $\tau$ so I’ll wrap up this post by thinking about why this might be the case.
I can get a good intuition for this by considering the geometry of the log likelihood
$$
\ell(\beta, \tau \mid y, X) = -\frac \tau 2 \|y – X\beta\|^2 + \frac n2 \log \tau – \frac n2 \log 2\pi.
$$
I will use the term $\tau$-space to refer to the axis that $\tau$ lives on, the 1-dimensional line $[0,\infty)$ in this case, and $\beta$-space will be the $p$-dimensional space that contains $\beta$.
If I fix $\tau$ at some value so I’m only thinking of $\ell$ in terms of $\beta$, then I get a slice through $\tau$-space and the result is a quadratic form in $\beta$ which geometrically is a paraboloid. If I change the height of the $\tau$ slice that I’m taking, the quadratic form is only affected by being scaled and shifted. Thus all of the contours between different slices are exactly proportional.
Another way to see this is to look at the gradient for $\beta$, $\nabla_\beta \ell$.
$$
\nabla_\beta \ell = \frac{\partial}{\partial \beta} \ell = \tau (X^Ty – X^TX\beta)
$$
so for a given point $\beta$, no matter what $\tau$ is every gradient is not only collinear but also points in the exact same direction in $\beta$-space since $\tau > 0$. Thus because $\tau$’s role is only to scale and shift “vertically”, a step towards $\hat\beta$ is always the same so it makes sense that $\hat\beta$ doesn’t depend on $\tau$. So if I get rid of $\tau$ by profiling, that’s just picking a single gradient from this collinear set, while for integrating that’s like averaging these gradients over $\tau$ (this is heuristically exchanging differentiation and integration since I’m acting like the average of the gradients is the gradient of the average). But either way, since every gradient involved is collinear, the result will be too.
Just for fun, here’s a visualization of the joint likelihood with $p=2$ for some simulated data. The MLE is the green dot and this shows how it’s unimodal. I can see how each $\tau$ slice is a quadratic form (with orientation given by $X^TX$), and I can see how $\tau$ only interacts with $\beta$ via $\|y – X\beta\|^2$ in that every $\beta$ slice with $\beta$ on a contour of $\|y – X\beta\|^2$ is the same.
For further reading, I found the paper “Integrated Likelihood Methods for Eliminating Nuisance Parameters” (1999) by Berger, Liseo, and Wolpert to be very interesting.