Almost sure convergence and the strong law of large numbers

In this post I’m going to introduce almost sure convergence for sequences of random variables, compare it with convergence in probability, and prove a form of the strong law of large numbers (SLLN). I’ll finish with an application to random walks.

Let (Ω,F,P) be a probability space and let {Xn:nN} be a collection of random variables from Ω to R. The sequence (Xn) is defined to converge almost surely (“a.s.”) to some limiting random variable X, denoted Xna.s.X, if for almost all ωΩ limnXn(ω) exists and is equal to X(ω). More succinctly,
Xna.s.XP(limXn=X)=1.

Thinking of {Xn:nN} as a discrete time stochastic process, Xna.s.X means that almost all sample paths converge to the output of a measurable function X. In the simplest case the limiting distribution is a.s. constant and this just means almost all sample paths have the same limit. The figure below gives an example of this: each of the three lines represents a single sample path (X1(ω),X2(ω),) coming from a process with Xna.s.1, and indeed all three paths can be seen to converge to that limit. These were generated by taking the cumulative mean of iid N(1,1) random variables so the a.s. convergence is a consequence of the SLLN.

almost sure convergence

I’ll start by proving that limXn is a measurable function when it exists.

Result 1: if Y1,Y2, is a sequence of Borel functions from Ω to R, then S:ΩR with S(ω)=lim supYn(ω) is Borel.

Pf: I will prove this by showing that {Sc}:={ωΩ:S(ω)c} for arbitrary cR is equal to a countable union and intersection of measurable sets. The closure of σ-algebras w.r.t. such operations then guarantees {Sc} to be measurable. Since {(,c]:cR} generates the Borel σ-algebra this is sufficient for the result (and also necessary).

Fix cR and consider the preimage {Sc}. If this set is empty then it is measurable, so WLOG I can assume I can choose some arbitrary ω{Sc}.

I have S(ω)=lim supYn(ω) so by definition
S(ω)=limnsupknYk(ω).
By the definition of a limit this means ε>0 there is an NN such that nN|supknYk(ω)S(ω)|<ε.

The sequence of “upper tail” suprema {supknYk(ω):nN} is nonincreasing since at each step n I’ll either keep the supremum the same (if it comes from values past Yn(k)) or it’ll decrease (if Yn(ω) was the sole value responsible for the supremum). This monotonicity means
|supknYk(ω)S(ω)|=supknYk(ω)S(ω)
so for nN I have
supknYk(ω)<S(ω)+εc+ε
where the 2nd inequality is because of my particular choice of ω.

When the sup of a collection is bounded from above then every element in turn is, so I now have that, for my arbitrary cR and ω{Sc}, for any jN+ (this is now playing the role of ε and removes the dependency on an arbitrary constant) there is some NjN such that
Yk(ω)c+j1kNj.
Let
Ac:=j1N1kN{Ykc+j1}.
I just showed that for every j1 there is an NN such that ω{Ykc+j1} for all kN, which means ωAc so {Sc}Ac.

Alternatively, let ωAc. Then by the same reasoning I know j1 there is at least one NN such that kNYk(ω)c+j1. This means supkNYk(ω)c+j1 therefore
S(ω)=lim supYk(ω)c
by monotonicity. This shows that Ac{Sc}, so Ac={Sc}. Ac is built entirely out of Borel sets so this shows that {Sc} is Borel.

◻

I’ll use without proof the analogous result that ωlim infYn(ω) is also measurable.

I now want to prove the following result:

Result 2: For any collection of Borel functions {Yn:nN} both the set
E:={ωΩ:limnYn(ω) exists}
and the function Z:ΩR defined by
Z(ω)={limnYn(ω)ωEY1(ω)o.w.
are measurable.

This result is good because it means that I don’t have to worry about measurability when I have something like P(limXn=X)=1 since there’s only a null set where the limit may not exist and I can just send those to X1(ω) without affecting anything.

Pf: limYn(ω) existing is equivalent to lim supYn(ω)=lim infYn(ω). Both of these functions are measurable so the function h:ΩR given by h(ω)=lim supYn(ω)lim infYn(ω) is also measurable. {0} is a Borel set so
h1({0})=EF.
For Z, I will first prove the following. Let f and g be Borel and define
h(ω)={f(ω)ωAg(ω)o.w.
for some AF. Fix cR and take H={hc}Ω. Then
H=H(AAc)=(HA)(HAc).
By construction HA={ωA:h(ω)c} and for all of these ω I have h(ω)=f(ω), so HA={fc}A and that is measurable. The same argument applied to HAc shows that set is also measurable, so H is the union of two measurable sets and by the closure of F H is also measurable.

Applying this to Z, on E I have limnYn=lim supYn which is measurable, and Y1 is also measurable, so this means that my above result with h applies to Z and Z is therefore Borel.

◻

I will compare this to convergence in probability, which as a reminder (see e.g. my post here) is defined by
XnpXε>0limnP(|XnX|>ε)=0.
Convergence in probability measures the size of the subset of Ω leading to |XnX|>ε, and requires that the measure of these sets heads to zero as I advance along the sequence. The big difference between convergence a.s. and convergence in probability is that XnpX isn’t about the individual trajectories. It can be the case that limXn exists almost nowhere and yet Xn still converges in probability to something. The next result gives an explicit example of this.

Result 3: XnpX does not imply Xna.s.X.

Pf: I’ll prove this by explicitly constructing a sequence of random variables with these properties. Take Ω=[0,1), F=B[0,1) (so it’s the Borel σ-algebra generated by the open subsets of [0,1)), and P=λ, i.e. the Lebesgue measure (which is a probability measure on this sample space).

Now for each positive nN let Pn be the partition of Ω formed by cutting Ω up into 2n adjacent half-open intervals, so Pn contains 2n sets of the form Ank:=[k/2n,(k+1)/2n) for k=0,,2n1. Now I’ll create a triangular array of random variables {Xnk:n=1,2,,k=0,,2n1} with Xnk=1Ank (all of the Ank are Borel so the Xnk are all Borel too). I’ll flatten this into a sequence (Ym) by enumerating first along k within each value of n and then along n, so
(Ym)=(X10,X11,X20,X21,X22,X23,X30,).

Fix ε>0. Then P(|Ym|>ε)=P(Ym=1)=P(Ank)=2n for the corresponding n. m implies n so P(|Ym|>ε)0 for this arbitrary ε>0, which shows Ymp0.

But for almost sure convergence there’s a problem. For any ωΩ it will be in exactly one element of each Pn, so in each row of the triangular array (I’m thinking of n indexing rows and k columns) I’ll have all of the Xnk(ω)=0 except for the single one taking a value of 1 for the piece of the partition containing ω. This is the case for every nN so (Ym(ω)) hits both 0 and 1 infinitely often and therefore this trajectory doesn’t converge to anything. This is true for every ωΩ so in fact there is not a single element of the sample space with a convergent trajectory. This means (Ym) doesn’t have a limiting distribution in the almost sure / pointwise sense.

◻

The crux of this proof was that convergence in probability doesn’t care about which ωs it’s measuring at each time. In this case the Ank had monotonically decreasing measure but the particular ω in them jumped around so no individual trajectories converged even though at any given time m the fraction of Ω leading to Ym=1 was decreasing.

The converse is true, though:

Result 4: Xna.s.XXnpX

Pf: Fix ε>0. Define
Em={|XmX|>ε}
and
An=mnEm
so An collects the ω where there is at least one mn such that Xm(ω) and X(ω) are at least ε apart. All of these sets are measurable so I don’t need to worry about that.

An=EnAn+1 so the sequence (An) is nonincreasing and settles on
n1An=n1mnEm=lim supEm.
lim supEm is exactly the ωΩ that are in infinitely many of the Em. If ω is in infinitely many Em then it cannot be the case that limXn(ω)=X(ω) since Xn(ω) never settles within ε of X(ω). But since Xna.s.X it must be that this set has measure zero, i.e. P(lim supEn)=0.

By the continuity of P and the nesting of the An, I have
0=P(lim supEm)=P(limAn)=limP(An).
P(An)=P(EnAn+1)P(En) so this means
0=limP(En)=limP(|XnX|>ε)
so XnpX.

◻

I’m building up to proving a form of the strong law of large numbers but I’ll need a standard helper result first:

Borel-Cantelli lemma. Let A1,A2,F. Then
n=1P(An)<P(lim supAn)=0.
In other words, this result says that if the probabilities of these events decay fast enough then almost all sample outcomes are only in finitely many of them.

Pf: I’ll build a new sequence of events B1,B2, where
Bn=mnAm
(this is very reminiscent of the proof of Result 4).

Then B1B2lim supAm. By the continuity of measure
limP(Bn)=P(lim supAn).
P(Bn)=P(mnAm)mnP(Am)
by subadditivity. By assumption n=1P(An)< so these tail sums mnP(Am) must converge to zero. This means
limP(Bn)=P(lim supAn)=0.
◻

I’m now ready to prove my main result.

SLLN: Let X1,X2, be an iid sequence of random variables with EX14<, and let Sn=inXi. Then Sn/na.s.EX1.

The full SLLN only requires E|X1|< but this is a much simpler proof which is why I’m just doing this one, even though it requires a stronger assumption. In a future post I’ll go through the proof of the full SLLN.

Pf: I’ll be considering terms like |Sn/nEX1| so WLOG I can just take EX1=0 and show Sn/na.s.0. So
P(|Sn/n|>ε)=P(|Sn/n|4>ε4)ESn4n4ε4
by Markov’s inequality (I prove this in this post). Expanding Sn4 via the multinomial theorem I have
Sn4=k1++kn=4(4k1,,kn)m=1nXmkm.
Since the power of 4 is small relative to the number of terms n I’ll have many of these indices equal to zero. I’ll be taking the expected value of Sn4 and I know EX1=0 so I only need to figure out which terms only have Xi2 or Xi4 in them.

4th moment terms arise when one km=4, and in those cases all the other k=0. This means I’ll have a i=1nXi4 term (the multinomial coefficient for all of these is just 1).

For the 2nd powers I need exactly two of the km to be 2, so the multinomial coefficient is 6 and overall I get
126ijXi2Xj2
where the 12 is because my indexing double counts each term.

Putting this all together I have
ESn4=nEX14+3n(n1)(EX12)2
which means
P(|Sn/n|>ε)nEX14+3n(n1)(EX12)2n4ε4ε4(n3EX14+3n2(EX12)2).
The right hand side is summable so this means
n=1P(|Sn/n|>ε)<. Taking An={|Sn/n|>ε}, I can use Borel-Cantelli to conclude
P(lim supAn)=0.
But lim supAn is exactly the subset of Ω with ω that are in infinitely many of these events. If almost all ω are only in finitely many of them, then since ε is arbitrary, I have shown that almost all sample paths converge to zero so Sn/na.s.0.

◻

One way to think of this that I like is using random walks. Given an iid collection of random variables (Xn)nN, I can form a random walk (Sn)nN by the cumulative addition of (Xn) so Sn=inXi. Typically the idea is that (Xn) represents some white noise process and I’m (discretely, in this case) integrating (Xn) to get (Sn) which wanders around according to the fluctuations of the Xn. Taking EX1=0, if I assume E|X1|< then the SLLN applies so Sn/na.s.0. In terms of sample paths, this means that almost all sample space elements have trajectories that are dominated by n. It’s certainly not the case that all trajectories are eventually squashed to 0, but the set of outcomes that lead to non-squashed trajectories is measure zero.

As an example, if XiiidN(0,1) so (Sn) is a discrete time brownian motion, then the SLLN tells me that Sn/na.s.0. But there are trajectories that have Xi(ω)2 for all iN and lead to Sn(ω)/n not converging on 0, or even diverging to ±. But because E|X1|< and the Xi are iid, it’s rare enough to consistently produce large enough jumps in the same direction so that the whole collection of sample space elements with these trajectories is measure zero.

For comparison, if the Xi were iid Cauchy, so E|X1|=, then the SLLN doesn’t hold and intuitively this makes sense because no matter how far along I am in a trajectory, there’s a high enough probability of a massive shock that moves me to a new location so I don’t get convergence on a large subset of my sample space. Below is a plot of one trajectory of such a random walk.

Scaled Cauchy RW does not converge almost surely

If XCauchy(0,1) (i.e. location of 0 and scale of 1) then the characteristic function of X is φX(t)=e|t| (as a total side comment, the distribution being symmetric around zero implies that the characteristic function is purely real-valued as I explore in this post). Then
φSn/n(t)=φSn(t/n)=φX(t/n)n=e|t|
so Sn/nCauchy(0,1) too. In words, the heavy tails of the Cauchy distribution are strong enough to overpower the effect of averaging and at every step in the random walk, even with division by n, the distribution is the same as that of X1 so no convergence happens in general. This shows the importance of the E|X1|< condition.

Leave a Reply

Your email address will not be published. Required fields are marked *