$\newcommand{\Om}{\Omega}\newcommand{\F}{\mathscr F}\newcommand{\one}{\mathbf 1}\newcommand{\R}{\mathbb R}\newcommand{\e}{\varepsilon}\newcommand{\E}{\operatorname{E}}\newcommand{\Var}{\operatorname{Var}}\newcommand{\convas}{\stackrel{\text{a.s.}}{\to}}\newcommand{\w}{\omega}\newcommand{\N}{\mathbb N}\newcommand{\convp}{\stackrel{\text{p}}{\to}}$In this post I’m going to introduce almost sure convergence for sequences of random variables, compare it with convergence in probability, and prove a form of the strong law of large numbers (SLLN). I’ll finish with an application to random walks.
Let $(\Om, \F, P)$ be a probability space and let $\{X_n : n\in\N\}$ be a collection of random variables from $\Om$ to $\R$. The sequence $(X_n)$ is defined to converge almost surely (“a.s.”) to some limiting random variable $X$, denoted $X_n \convas X$, if for almost all $\w\in\Om$ $\lim_{n\to\infty} X_n(\w)$ exists and is equal to $X(\w)$. More succinctly,
$$
X_n \convas X \iff P(\lim X_n = X) = 1.
$$
Thinking of $\{X_n : n\in\N\}$ as a discrete time stochastic process, $X_n\convas X$ means that almost all sample paths converge to the output of a measurable function $X$. In the simplest case the limiting distribution is a.s. constant and this just means almost all sample paths have the same limit. The figure below gives an example of this: each of the three lines represents a single sample path $(X_1(\w), X_2(\w), \dots)$ coming from a process with $X_n \convas 1$, and indeed all three paths can be seen to converge to that limit. These were generated by taking the cumulative mean of iid $\mathcal N(1,1)$ random variables so the a.s. convergence is a consequence of the SLLN.
I’ll start by proving that $\lim X_n$ is a measurable function when it exists.
Result 1: if $Y_1, Y_2, \dots$ is a sequence of Borel functions from $\Omega$ to $\R$, then $S : \Om\to\R$ with $S(\w) = \limsup Y_n(\w)$ is Borel.
Pf: I will prove this by showing that $\{S \leq c\} := \{\w \in \Om : S(\w) \leq c\}$ for arbitrary $c\in\R$ is equal to a countable union and intersection of measurable sets. The closure of $\sigma$-algebras w.r.t. such operations then guarantees $\{S\leq c\}$ to be measurable. Since $\{(-\infty, c] : c \in \R\}$ generates the Borel $\sigma$-algebra this is sufficient for the result (and also necessary).
Fix $c\in \R$ and consider the preimage $\{S \leq c\}$. If this set is empty then it is measurable, so WLOG I can assume I can choose some arbitrary $\w\in\{S\leq c\}$.
I have $S(\w) = \limsup Y_n(\w)$ so by definition
$$
S(\w) = \lim_{n\to\infty} \sup_{k\geq n} Y_k(\w).
$$
By the definition of a limit this means $\forall \e > 0$ there is an $N\in \N$ such that $n \geq N \implies |\sup_{k\geq n} Y_k(\w) – S(\w)| < \e$.
The sequence of “upper tail” suprema $\{\sup_{k\geq n} Y_k(\w) : n \in \N\}$ is nonincreasing since at each step $n$ I’ll either keep the supremum the same (if it comes from values past $Y_n(k)$) or it’ll decrease (if $Y_n(\w)$ was the sole value responsible for the supremum). This monotonicity means
$$
|\sup_{k\geq n} Y_k(\w) – S(\w)| = \sup_{k\geq n} Y_k(\w) – S(\w)
$$
so for $n\geq N$ I have
$$
\sup_{k\geq n} Y_k(\w) < S(\w) + \e \leq c + \e
$$
where the 2nd inequality is because of my particular choice of $\w$.
When the $\sup$ of a collection is bounded from above then every element in turn is, so I now have that, for my arbitrary $c\in\R$ and $\w\in\{S\leq c\}$, for any $j \in \N_+$ (this is now playing the role of $\e$ and removes the dependency on an arbitrary constant) there is some $N_j\in\N$ such that
$$
Y_k(\w) \leq c + j^{-1} \;\;\; \forall k\geq N_j.
$$
Let
$$
A_c := \bigcap_{j \geq 1}\bigcup_{N \geq 1} \bigcap_{k \geq N} \{Y_k \leq c + j^{-1}\}.
$$
I just showed that for every $j \geq 1$ there is an $N\in \N$ such that $\w \in \{Y_k \leq c + j^{-1}\}$ for all $k\geq N$, which means $\w\in A_c$ so $\{S\leq c\} \subseteq A_c$.
Alternatively, let $\w \in A_c$. Then by the same reasoning I know $\forall j \geq 1$ there is at least one $N\in\N$ such that $k \geq N \implies Y_k(\w) \leq c + j^{-1}$. This means $\sup_{k\geq N} Y_k(\w) \leq c + j^{-1}$ therefore
$$
S(\w) = \limsup Y_k(\w) \leq c
$$
by monotonicity. This shows that $A_c\subseteq\{S\leq c\}$, so $A_c = \{S\leq c\}$. $A_c$ is built entirely out of Borel sets so this shows that $\{S \leq c\}$ is Borel.
$\square$
I’ll use without proof the analogous result that $\w\mapsto \liminf Y_n(\w)$ is also measurable.
I now want to prove the following result:
Result 2: For any collection of Borel functions $\{Y_n : n \in \N\}$ both the set
$$
E := \{\w\in\Om : \lim_{n\to\infty} Y_n(\w) \text{ exists} \}
$$
and the function $Z : \Om\to\R$ defined by
$$
Z(\w) = \begin{cases}\lim_{n\to\infty} Y_n(\w) & \w \in E \\ Y_1(\w) & \text{o.w.}\end{cases}
$$
are measurable.
This result is good because it means that I don’t have to worry about measurability when I have something like $P(\lim X_n = X)=1$ since there’s only a null set where the limit may not exist and I can just send those to $X_1(\w)$ without affecting anything.
Pf: $\lim Y_n(\w)$ existing is equivalent to $\limsup Y_n(\w) = \liminf Y_n(\w)$. Both of these functions are measurable so the function $h : \Om\to\R$ given by $h(\w) = \limsup Y_n(\w) – \liminf Y_n(\w)$ is also measurable. $\{0\}$ is a Borel set so
$$
h^{-1}(\{0\}) = E \in \F.
$$
For $Z$, I will first prove the following. Let $f$ and $g$ be Borel and define
$$
h(\w) = \begin{cases} f(\w) & \w \in A \\ g(\w) & \text{o.w.}\end{cases}
$$
for some $A\in\F$. Fix $c \in \R$ and take $H = \{h\leq c\}\subseteq \Om$. Then
$$
H = H \cap (A\cup A^c) = (H\cap A) \cup (H \cap A^c).
$$
By construction $H\cap A = \{\w \in A : h(\w) \leq c\}$ and for all of these $\w$ I have $h(\w) = f(\w)$, so $H\cap A = \{f \leq c\}\cap A$ and that is measurable. The same argument applied to $H \cap A^c$ shows that set is also measurable, so $H$ is the union of two measurable sets and by the closure of $\F$ $H$ is also measurable.
Applying this to $Z$, on $E$ I have $\lim_n Y_n = \limsup Y_n$ which is measurable, and $Y_1$ is also measurable, so this means that my above result with $h$ applies to $Z$ and $Z$ is therefore Borel.
$\square$
I will compare this to convergence in probability, which as a reminder (see e.g. my post here) is defined by
$$
X_n \convp X \iff \forall \e > 0 \;\lim_n P(|X_n – X| > \e) = 0.
$$
Convergence in probability measures the size of the subset of $\Om$ leading to $|X_n – X| > \e$, and requires that the measure of these sets heads to zero as I advance along the sequence. The big difference between convergence a.s. and convergence in probability is that $X_n \convp X$ isn’t about the individual trajectories. It can be the case that $\lim X_n$ exists almost nowhere and yet $X_n$ still converges in probability to something. The next result gives an explicit example of this.
Result 3: $X_n\convp X$ does not imply $X_n \convas X$.
Pf: I’ll prove this by explicitly constructing a sequence of random variables with these properties. Take $\Omega = [0,1)$, $\F = \mathbb B_{[0,1)}$ (so it’s the Borel $\sigma$-algebra generated by the open subsets of $[0,1)$), and $P = \lambda$, i.e. the Lebesgue measure (which is a probability measure on this sample space).
Now for each positive $n \in \N$ let $\mathcal P_n$ be the partition of $\Om$ formed by cutting $\Om$ up into $2^n$ adjacent half-open intervals, so $\mathcal P_n$ contains $2^n$ sets of the form $A_{nk} := [k/2^n, (k+1)/2^n)$ for $k = 0, \dots, 2^n-1$. Now I’ll create a triangular array of random variables $\{X_{nk} : n = 1, 2, \dots, k = 0, \dots, 2^n-1\}$ with $X_{nk} = \one_{A_{nk}}$ (all of the $A_{nk}$ are Borel so the $X_{nk}$ are all Borel too). I’ll flatten this into a sequence $(Y_m)$ by enumerating first along $k$ within each value of $n$ and then along $n$, so
$$
(Y_m) = (X_{10}, X_{11}, X_{20}, X_{21}, X_{22}, X_{23}, X_{30},\dots ).
$$
Fix $\e>0$. Then $P(|Y_m| > \e) = P(Y_m = 1) = P(A_{nk}) = 2^{-n}$ for the corresponding $n$. $m\to\infty$ implies $n\to\infty$ so $P(|Y_m| > \e) \to 0$ for this arbitrary $\e > 0$, which shows $Y_m\convp 0$.
But for almost sure convergence there’s a problem. For any $\w\in\Om$ it will be in exactly one element of each $\mathcal P_n$, so in each row of the triangular array (I’m thinking of $n$ indexing rows and $k$ columns) I’ll have all of the $X_{nk}(\w) = 0$ except for the single one taking a value of $1$ for the piece of the partition containing $\w$. This is the case for every $n\in\N$ so $(Y_m(\w))$ hits both $0$ and $1$ infinitely often and therefore this trajectory doesn’t converge to anything. This is true for every $\w\in\Om$ so in fact there is not a single element of the sample space with a convergent trajectory. This means $(Y_m)$ doesn’t have a limiting distribution in the almost sure / pointwise sense.
$\square$
The crux of this proof was that convergence in probability doesn’t care about which $\w$s it’s measuring at each time. In this case the $A_{nk}$ had monotonically decreasing measure but the particular $\w$ in them jumped around so no individual trajectories converged even though at any given time $m$ the fraction of $\Om$ leading to $Y_m = 1$ was decreasing.
The converse is true, though:
Result 4: $X_n\convas X \implies X_n\convp X$
Pf: Fix $\e>0$. Define
$$
E_m = \{|X_m – X| > \e\}
$$
and
$$
A_n = \bigcup_{m \geq n} E_m
$$
so $A_n$ collects the $\w$ where there is at least one $m\geq n$ such that $X_m(\w)$ and $X(\w)$ are at least $\e$ apart. All of these sets are measurable so I don’t need to worry about that.
$A_n = E_n \cup A_{n+1}$ so the sequence $(A_n)$ is nonincreasing and settles on
$$
\bigcap_{n\geq 1} A_n = \bigcap_{n\geq 1}\bigcup_{m\geq n} E_m = \limsup E_m.
$$
$\limsup E_m$ is exactly the $\w\in\Om$ that are in infinitely many of the $E_m$. If $\w$ is in infinitely many $E_m$ then it cannot be the case that $\lim X_n(\w) = X(\w)$ since $X_n(\w)$ never settles within $\e$ of $X(\w)$. But since $X_n \convas X$ it must be that this set has measure zero, i.e. $P(\limsup E_n) = 0$.
By the continuity of $P$ and the nesting of the $A_n$, I have
$$
0 = P(\limsup E_m) = P(\lim A_n) = \lim P(A_n) .
$$
$P(A_n) = P(E_n \cup A_{n+1}) \geq P(E_n)$ so this means
$$
0 = \lim P(E_n) = \lim P(|X_n – X| > \e)
$$
so $X_n\convp X$.
$\square$
I’m building up to proving a form of the strong law of large numbers but I’ll need a standard helper result first:
Borel-Cantelli lemma. Let $A_1, A_2, \dots \in \F$. Then
$$
\sum_{n=1}^\infty P(A_n) < \infty \implies P(\limsup A_n) = 0.
$$
In other words, this result says that if the probabilities of these events decay fast enough then almost all sample outcomes are only in finitely many of them.
Pf: I’ll build a new sequence of events $B_1, B_2, \dots$ where
$$
B_n = \bigcup_{m \geq n} A_m
$$
(this is very reminiscent of the proof of Result 4).
Then $B_1 \supseteq B_2 \supseteq \dots \supseteq \limsup A_m$. By the continuity of measure
$$
\lim P(B_n) = P(\limsup A_n).
$$
$$
P(B_n) = P\left(\bigcup_{m \geq n} A_m\right) \leq \sum_{m\geq n} P(A_m)
$$
by subadditivity. By assumption $\sum_{n=1}^\infty P(A_n) < \infty$ so these tail sums $\sum_{m\geq n} P(A_m)$ must converge to zero. This means
$$
\lim P(B_n) = P(\limsup A_n) = 0.
$$
$\square$
I’m now ready to prove my main result.
SLLN: Let $X_1, X_2, \dots$ be an iid sequence of random variables with $\E X_1^4 < \infty$, and let $S_n = \sum_{i\leq n} X_i$. Then $S_n / n \convas \E X_1$.
The full SLLN only requires $\E |X_1| < \infty$ but this is a much simpler proof which is why I’m just doing this one, even though it requires a stronger assumption. In a future post I’ll go through the proof of the full SLLN.
Pf: I’ll be considering terms like $|S_n/n – \E X_1|$ so WLOG I can just take $\E X_1 = 0$ and show $S_n /n \convas 0$. So
$$
P(|S_n/n| > \e) = P(|S_n/n|^4 > \e^4) \leq \frac{\E S_n^4}{n^4\e^4}
$$
by Markov’s inequality (I prove this in this post). Expanding $S_n^4$ via the multinomial theorem I have
$$
S_n^4 = \sum_{k_1+\dots+k_n=4} \binom{4}{k_1, \dots, k_n} \prod_{m=1}^n X_m^{k_m}.
$$
Since the power of $4$ is small relative to the number of terms $n$ I’ll have many of these indices equal to zero. I’ll be taking the expected value of $S_n^4$ and I know $\E X_1 = 0$ so I only need to figure out which terms only have $X_i^2$ or $X_i^4$ in them.
4th moment terms arise when one $k_m=4$, and in those cases all the other $k_\ell=0$. This means I’ll have a $\sum_{i=1}^n X_i^4$ term (the multinomial coefficient for all of these is just $1$).
For the 2nd powers I need exactly two of the $k_m$ to be $2$, so the multinomial coefficient is $6$ and overall I get
$$
\frac 12 \cdot 6\sum_{i\neq j} X_i^2X_j^2
$$
where the $\frac 12$ is because my indexing double counts each term.
Putting this all together I have
$$
\E S_n^4 = n \E X_1^4 + 3 n(n-1) (\E X_1^2)^2
$$
which means
$$
P(|S_n/n| > \e) \leq \frac{n \E X_1^4 + 3 n(n-1) (\E X_1^2)^2}{n^4\e^4} \leq \e^{-4}\left( n^{-3}\E X_1^4 + 3n^{-2} (\E X_1^2)^2\right).
$$
The right hand side is summable so this means
$$
\sum_{n=1}^\infty P(|S_n/n| > \e) < \infty. $$ Taking $A_n = \{|S_n/n| > \e\}$, I can use Borel-Cantelli to conclude
$$
P(\limsup A_n) = 0.
$$
But $\limsup A_n$ is exactly the subset of $\Omega$ with $\w$ that are in infinitely many of these events. If almost all $\w$ are only in finitely many of them, then since $\e$ is arbitrary, I have shown that almost all sample paths converge to zero so $S_n/n\convas 0$.
$\square$
One way to think of this that I like is using random walks. Given an iid collection of random variables $(X_n)_{n\in\N}$, I can form a random walk $(S_n)_{n\in\N}$ by the cumulative addition of $(X_n)$ so $S_n = \sum_{i\leq n} X_i$. Typically the idea is that $(X_n)$ represents some white noise process and I’m (discretely, in this case) integrating $(X_n)$ to get $(S_n)$ which wanders around according to the fluctuations of the $X_n$. Taking $\E X_1 = 0$, if I assume $\E |X_1| < \infty$ then the SLLN applies so $S_n / n \convas 0$. In terms of sample paths, this means that almost all sample space elements have trajectories that are dominated by $n$. It’s certainly not the case that all trajectories are eventually squashed to $0$, but the set of outcomes that lead to non-squashed trajectories is measure zero.
As an example, if $X_i \stackrel{\text{iid}}\sim\mathcal N(0,1)$ so $(S_n)$ is a discrete time brownian motion, then the SLLN tells me that $S_n/n\convas 0$. But there are trajectories that have $X_i(\w) \geq 2$ for all $i\in\N$ and lead to $S_n(\w)/n$ not converging on $0$, or even diverging to $\pm \infty$. But because $\E |X_1| < \infty$ and the $X_i$ are iid, it’s rare enough to consistently produce large enough jumps in the same direction so that the whole collection of sample space elements with these trajectories is measure zero.
For comparison, if the $X_i$ were iid Cauchy, so $\E |X_1| = \infty$, then the SLLN doesn’t hold and intuitively this makes sense because no matter how far along I am in a trajectory, there’s a high enough probability of a massive shock that moves me to a new location so I don’t get convergence on a large subset of my sample space. Below is a plot of one trajectory of such a random walk.
If $X\sim\text{Cauchy}(0,1)$ (i.e. location of $0$ and scale of $1$) then the characteristic function of $X$ is $\varphi_X(t) = e^{-|t|}$ (as a total side comment, the distribution being symmetric around zero implies that the characteristic function is purely real-valued as I explore in this post). Then
$$
\varphi_{S_n/n}(t) = \varphi_{S_n}(t/n) = \varphi_X(t/n)^n = e^{-|t|}
$$
so $S_n/n \sim \text{Cauchy}(0,1)$ too. In words, the heavy tails of the Cauchy distribution are strong enough to overpower the effect of averaging and at every step in the random walk, even with division by $n$, the distribution is the same as that of $X_1$ so no convergence happens in general. This shows the importance of the $\E|X_1| < \infty$ condition.