$\newcommand{\bern}{\text{Bern}}$$\newcommand{\vp}{\varphi}$$\newcommand{\e}{\varepsilon}$$\newcommand{\P}{\mathcal P}$$\newcommand{\Cov}{\text{Cov}}$$\newcommand{\E}{\text{E}}$$\newcommand{\one}{\mathbf 1}$$\newcommand{\s}{\mathbf s}$$\newcommand{\Disc}{\text{Disc}}$In this post I’m going to explore correlations between finitely-supported discrete variables.
Bernoulli case
I’ll begin with $X\sim\bern(\theta)$ and $Y\sim\bern(\vp)$ and I’ll explore the possible joint distributions of $(X,Y)$. There are four probabilities to determine, corresponding to $P(X=x, Y=y)$ for $x,y\in\{0,1\}$, but all four values are not free to vary.
I will let $p = P(X=1,Y=1)$. Given that the marginals are given, I know
$$
\begin{aligned}\theta &= P(X=1) = P(X=1,Y=0) + P(X=1,Y=1) \\&
= P(X=1,Y=0) + p \\&
\implies P(X=1,Y=0) = \theta-p.\end{aligned}
$$
Similarly, I find $P(X=1,Y=0) = \vp – p$. Finally, I know
$$
\begin{aligned}1 &= P(X=0,Y=0) + (\theta-p) + (\vp-p) + p \\&
\implies P(X=0,Y=0) = 1-\theta-\vp+p.\end{aligned}
$$
All together, this gives me the following joint distribution, in terms of the marginals and the one free parameter $p$:
$$
\begin{array}{cc|c|c} 1 – \theta – \vp + p & \vp – p & 1-\theta\\ \theta – p& p & \theta \\ \hline1-\vp & \vp & \end{array}
$$
The next thing to note is that $p$ is not free to vary over all of $[0,1]$. First of all, I’ll need
$$
0 \leq p \leq \min\{\theta,\vp\}
$$
since, for example, $\{X=1, Y=1\}\subseteq \{X=1\}$ so by the monotonicity of measures $p = P(X=1,Y=1) \leq P(X=1) = \theta$.
This isn’t enough though because if $\theta$ and $\vp$ are both large, then $1-\theta-\vp$ will likely be negative which forces $p$ to be large. Intuitively this makes sense because if $\theta,\vp\approx 1$ it wouldn’t be possible for $p$ to be tiny. The actual constraint here is
$$
1-\theta-\vp+p \geq 0 \implies p \geq \theta+\vp-1
$$
so all together I have
$$
\max\{\theta+\vp-1, 0\} \leq p \leq \min\{\theta,\vp\}.
$$
I’ll refer to a value of $p$ as being _feasible_ if it satisfies these inequalities. $\mathcal P(\theta,\vp)\subseteq [0,1]$ will denote the set of feasible $p$. In this case I simply have
$$
\mathcal P(\theta,\vp) = \left[\max\{\theta+\vp-1, 0\} , \min\{\theta,\vp\}\right].
$$
Result 1: $\P(\theta,\vp)$ is never empty.
Pf: First I’ll show that $\min\{\theta,\vp\} < \max\{\theta+\vp-1, 0\}$ is impossible, then I’ll exhibit a particular element that always is in it.
WLOG I’ll suppose $\theta\leq\vp$ since this is symmetric in $\vp$ and $\theta$.
Case I: $\theta+\vp-1\leq 0$. Then I have $\theta < 0$ which is impossible.
Case II: $\theta+\vp-1>0$. This means
$$
\theta < \theta+\vp-1 \implies \vp > 1
$$
which is also impossible. So it must be that $\mathcal P(\theta,\vp)$ contains at least one point, and now I’ll produce such a point.
It’s always possible that $X\perp Y$ and this leads to $p = \theta\vp$. I’ll show this value is always feasible. Since $\theta,\vp\in[0,1]$ I have $\theta\vp\leq\theta$ and $\theta\vp\leq\vp$ so $\theta\vp\leq\min\{\theta,\vp\}$. For the other side, $\theta\vp \geq 0$ is always true, so I just need to consider $\theta\vp$ versus $\vp + \theta – 1$. I’ll note that
$$
\theta + \vp – \theta\vp = \theta + \vp(1-\theta)
$$
is a convex combination of $1$ and $\vp$ and therefore lies in $[\vp,1]$. In particular, this means
$$
\theta + \vp – \theta\vp \leq 1 \\
\implies \theta\vp \geq \theta+\vp-1
$$
so $\theta\vp\in\mathcal P(\theta,\vp)$.
$\square$
The next question that arises is if $\theta\vp$ is ever the only feasible point.
Result 2: conditions for $\P(\theta,\vp)$ to be a singleton.
If $\P(\theta,\vp)$ is a singleton then the only point it contains is $\theta\vp$.
$\P(\theta,\vp)$ is an interval so I can explore this by looking at its length, i.e.
$$
\ell(\theta,\vp) = \min\{\theta,\vp\} – \max\{\theta+\vp-1, 0\}.
$$
This is just a scalar-valued function of two bounded inputs so I can visualize it to see exactly what’s going on. I’ll keep $\vp$ along the $x$ axis to be consistent with the dimensions of $P$.
This shows me that $\ell$ attains a unique maximum for $\theta = \vp = \frac 12$, and is minimized whenever $\theta$ or $\vp$ is exactly zero. I’ll prove this now.
For the maximum, I’ll break this into four cases.
Case 1a: $\theta \leq \vp$ and $\theta+\vp – 1 \leq 0 \iff \theta \leq 1 – \vp$. This reduces $\ell$ to $\ell(\theta,\vp) = \theta$ so I want to maximize $\theta$ over the set $\{\theta \leq \vp\} \cap \{\theta \leq 1-\vp\}$ in $[0,1]^2$. These two restrictions leave me with the triangle in $[0,1]^2$ with vertices at $(0,0)$, $(0,1)$, and $\left(\frac 12, \frac 12\right)$, so this is maximized by $\theta = \frac 12$ which means $\left(\frac 12, \frac 12\right)$ is the actual point so $\theta=\vp=\frac 12$.
Case 1b: $\theta \leq \vp$ and $\theta+\vp – 1 \geq 0 \iff \theta \geq 1 – \vp$. Now $\ell(\theta, \vp) = \theta – \theta-\vp + 1 = 1-\vp$. This means I want to minimize $\vp$ over the feasible set in $[0,1]^2$, which now is $\{\theta \leq \vp\}\cap\{\theta \geq 1-\vp\}$. This again leaves me with a triangular subset of $[0,1]^2$ except now the triangle is pointed left so the unique minimum in terms of $\vp$, which is the maximum of $\ell$, is again at $\theta=\vp=\frac 12$.
The other two cases lead to identical analyses where $\ell$ becomes simple and the optimum occurs at the edge of the feasible space, and in every case it is $\left(\frac 12, \frac 12\right)$ which confirms what the figure shows, namely that $\theta=\vp=\frac 12$ is the unique maximizer of $\ell$. And in this case we have
$$
\P\left(\frac 12, \frac 12\right) = \left[0, \frac 12\right].
$$
Now for the minimizers: I’ll consider the same cases.
Case 1a: $\theta \leq \vp$ and $\theta+\vp – 1 \leq 0$ leads to $\ell(\theta,\vp) = \theta$ which now I’m trying to minimize. It’s feasible for $\theta = 0$ so the minima here are all points of the form $\{\theta = 0, \vp \in [0,1]\}$.
The other cases lead to analogous results, so ultimately $\ell$ is minimized by the boundary of $[0,1]^2$ where it attains its minimal value of $0$.
As an example, suppose $\theta = 0$ and $\vp = \frac 13$. Then the resulting joint distribution is
$$
\begin{array}{cc|c}
\frac 23 & \frac 13 & 1 \\
0 & 0 & 0 \\
\hline
\frac 23 & \frac 13 &
\end{array}
$$
which is the only possibility, so $\P\left(0, \frac 13\right)$ is indeed a singleton.
$\square$
Result 3: Independence is equivalent to being uncorrelated for the Bernoulli case.
Pf: I have
$$
\Cov(X,Y) = \E(XY) – (\E X)(\E Y) \\
= p – \theta\vp.
$$
This means $\Cov(X,Y) = 0 \implies p = \theta\vp$ which in turn means that $X\perp Y$ as then $P(X=i, Y = j) = P(X=i)P(Y=j)$. Thus for the Bernoulli case the covariance really does measure how close $X$ and $Y$ are to being independent, and if $p$ is greater than what I’d expect under independence then there is positive correlation between $X$ and $Y$, and vice versa if $p < \theta\vp$.
$\square$
Finitely Supported Variables
I’ll now generalize this to finitely supported discrete distributions. For a vector $\eta \in [0,1]^{k+1}$ with $\sum_{i=0}^k \eta_i = 1$ (I’m starting my indexing at zero), I’ll use $Z \sim \Disc(\eta)$ to mean $P(Z = i) = \eta_i$ for $i=0,1,\dots,k$.
As in the Bernoulli case I’ll have one row and column of parameters that are not free, so I will have $X\sim\Disc(\tilde \theta)$ and $Y\sim\Disc(\tilde \vp)$ where $\tilde \theta \in [0,1]^{m+1}$ and $\tilde \vp \in [0,1]^{n+1}$. The full joint distribution is then given by the $(m+1)\times(n+1)$ matrix $\tilde P$ with $\tilde P_{ij} = P(X=i, Y=j)$ for $i=0,\dots,m$ and $j=0,\dots,n$. I will be working with $\theta = (\theta_1,\dots,\theta_m)^T$ and $\vp = (\vp_1,\dots,\vp_n)^T$ which lead to my $m\times n$ matrix $P$ of free parameters. I’ll let $\one$ be a vector of ones of the appropriate size. My notation means $\tilde\theta^T\one = 1$ and $\theta^T\one \leq 1$, for example.
The constraints become
$$
P\one \leq \theta \\
P^T\one \leq \vp \\
\one^TP\one \geq \theta^T\one + \vp^T\one – 1
$$
where the inequalities are elementwise.
Result 4: $\P(\theta,\vp)$ always contains at least $P_0:=\theta\vp^T$, which corresponds to $X\perp Y$.
Pf: I just need to verify that $P_0$ satisfies all the inequalities.
$$
P_0\one = (\vp^T\one)\theta \leq \theta
$$
since $\vp$ sums to at most one. Similarly
$$
P_0^T\one = (\theta^T\one)\vp \leq \vp.
$$
Finally, consider
$$
\begin{aligned}\theta^T\one + \vp^T\one – \one^TP_0\one &= \theta^T\one + \vp^T\one – (\theta^T\one)(\vp^T\one) \\&
= 1\cdot \theta^T\one + \vp^T\one (1 – \theta^T\one) \leq 1\end{aligned}
$$
since this is a convex combination of $1$ and $\vp^T\one$, so overall I have
$$
\one^T P_0 \one \geq \theta^T\one + \vp^T\one – 1
$$
which means $P_0\in\P(\theta,\vp)$.
$\square$
Result 5: $\P(\theta,\vp)$ is convex
Pf: Let $P,Q\in\P(\theta,\vp)$ and consider $R = \alpha P + (1-\alpha)Q$ for $\alpha \in [0,1]$. Checking the inequalities, I have
$$
R\one = \alpha P\one + (1-\alpha)Q\one \leq \alpha \theta + (1-\alpha)\theta = \theta
$$
and similarly I get $R^T\one \leq \vp$. For the third one,
$$
\begin{aligned}\one^TR\one &= \alpha \one^TP\one + (1-\alpha)\one^TQ\one \\&
\geq \alpha (\theta^T\one + \vp^T\one – 1) + (1-\alpha)(\theta^T\one + \vp^T\one – 1) \\&
= \theta^T\one + \vp^T\one – 1\end{aligned}
$$
so $R \in \P(\theta,\vp)$ which makes $\P(\theta,\vp)$ convex.
$\square$
Intuitively this makes sense because $\P(\theta,\vp)$ is the intersection of a number of half-spaces given by my various linear inequalities, so since there is always at least one feasible point, I know I’ll either have a singleton or a convex polytope.
Result 6: If $0 < \tilde\theta,\tilde\vp < 1$ (element-wise for both) then for any $\e > 0$ there is an element of $\P(\theta,\vp)$ within $\e$ of $P_0 := \theta\vp^T$, measured w.r.t. the Frobenius norm $\|\cdot\|_F$.
Pf: I’ll explicitly construct such an element. Take
$$
\delta_0 = \min\left\{\frac{\min\{\theta_1,\vp_1\} – \theta_1\vp_1}{2}, \frac{\theta_1\vp_2}{2},\\
\frac{\theta_2\vp_1}{2}, \frac{\min\{\theta_2,\vp_2\} – \theta_2\vp_2}{2}
\right\}
$$
and note $\delta_0 > 0$ by the assumption that $0 < \theta_1,\theta_2,\vp_1,\vp_2< 1$. Define
$$
E(\delta_0) = \left(\begin{array}{c|c}\begin{array}{cc}\;\;\delta_0&-\delta_0\\-\delta_0&\;\;\delta_0\end{array} & \mathbf O \\ \hline \mathbf O & \mathbf O \end{array}\right) \in \mathbb R^{m\times n}.
$$
Consider now $Q(\delta) := P_0 + E(\delta)$ which is $P_0$ except the first $2\times 2$ block of elements are perturbed by $\pm \delta$. I’ll now show that $Q(\delta_0) \in \P(\theta,\vp)$.
Let $i,j\in\{1,2\}$. Suppose $i=j$. Then I have
$$
Q(\delta_0)_{ii} = \theta_i\vp_i + \delta_0 \\
\leq \theta_i\vp_i + \frac{\min\{\theta_i,\vp_i\} – \theta_i\vp_i}{2} \\
= \frac{\min\{\theta_i,\vp_i\} + \theta_i\vp_i}{2} \\
< \min\{\theta_i,\vp_i\}
$$
where the last line is because $0<\theta_i,\vp_i<1$ implies $\theta_i\vp_i < \min\{\theta_i,\vp_i\}$. This means this cell hasn’t gone over, and since the row and column sums are unchanged (via the signs in $A$) then $Q(\delta_0)_{ii}$ satisfies the contraints for $i=1,2$.
Now suppose $i\neq j$. Then
$$
Q(\delta_0)_{ij} = \theta_i\vp_j – \delta_0 \\
\geq \theta_i\vp_j – \theta_i\vp_j/2 \\
= \theta_i\vp_j/2 > 0
$$
since $\theta_i,\vp_j > 0$. This shows that the off-diagonal elements are still positive and therefore $Q(\delta_0) \in \P(\theta,\vp)$, and if $0 < \delta < \delta_0$ then $Q(\delta)\in \P$ too.
Let $\e > 0$. I have
$$
\|Q(\delta) – P_0\|_F = \|E(\delta)\|_F = 2\delta
$$
so taking $0 < \delta < \min\{\e/2, \delta_0\}$ I can produce an element of $\P$ that is within $\e$ of $P_0$ w.r.t. $\|\cdot\|_F$.
$\square$
Result 7: Covariance and independence for discrete $X$ and $Y$.
Let $\s_n = (1, 2,\dots, n)^T$. Then
$$
\begin{aligned}\Cov(X,Y) &= \E(XY) – (\E X)(\E Y) \\&
= \sum_{x=1}^m\sum_{y=1}^n xyP_{xy} – \left(\sum_{x=1}^m x \theta_x\right)\left(\sum_{y=1}^n y \vp_y\right) \\&
= \s_m^TP\s_n – \s_m^T\theta\vp^T\s_n \\&
= \s_m^T\left(P – \theta\vp^T\right)\s_n\end{aligned}
$$
which shows that $\Cov(X,Y)$ is like a measure of how far $P$ is from $P_0$, the distribution I’d have if it was actually the case that $X\perp Y$.
If $m = n = 1$ as in the first section, then it is the case that $\Cov(X,Y) = 0 \iff X\perp Y$. That is no longer the case here, as the following example shows:
$$
\begin{array}{ccc|c} \frac 2{30} & \frac 2{15} & \frac 2{15} & \frac 13 \\
\frac 1{12} & \frac 3{20} & \frac 1{10} & \frac 13\\
\frac{11}{60} & \frac{1}{20} & \frac 1{10} & \frac 13\\ \hline \frac 13 & \frac 13 & \frac 13 & \end{array}.
$$
Here $\tilde \theta = \tilde \vp = \frac 13 \one$ so $P_0 = \frac 19 \one\one^T$ and clearly $Q \neq P_0$.
Visualizing $\P(\theta,\vp)$
I’ll finish this by visualizing how $\P$ changes as $\theta$ and $\vp$ change. I’ll have $X\in\{0,1\}$ and $Y\in\{0,1,2\}$ so $P$ has two free parameters.