In this post I’m going to introduce almost sure convergence for sequences of random variables, compare it with convergence in probability, and prove a form of the strong law of large numbers (SLLN). I’ll finish with an application to random walks.
Let be a probability space and let be a collection of random variables from to . The sequence is defined to converge almost surely (“a.s.”) to some limiting random variable , denoted , if for almost all exists and is equal to . More succinctly,
Thinking of as a discrete time stochastic process, means that almost all sample paths converge to the output of a measurable function . In the simplest case the limiting distribution is a.s. constant and this just means almost all sample paths have the same limit. The figure below gives an example of this: each of the three lines represents a single sample path coming from a process with , and indeed all three paths can be seen to converge to that limit. These were generated by taking the cumulative mean of iid random variables so the a.s. convergence is a consequence of the SLLN.

I’ll start by proving that is a measurable function when it exists.
Result 1: if is a sequence of Borel functions from to , then with is Borel.
Pf: I will prove this by showing that for arbitrary is equal to a countable union and intersection of measurable sets. The closure of -algebras w.r.t. such operations then guarantees to be measurable. Since generates the Borel -algebra this is sufficient for the result (and also necessary).
Fix and consider the preimage . If this set is empty then it is measurable, so WLOG I can assume I can choose some arbitrary .
I have so by definition
By the definition of a limit this means there is an such that .
The sequence of “upper tail” suprema is nonincreasing since at each step I’ll either keep the supremum the same (if it comes from values past ) or it’ll decrease (if was the sole value responsible for the supremum). This monotonicity means
so for I have
where the 2nd inequality is because of my particular choice of .
When the of a collection is bounded from above then every element in turn is, so I now have that, for my arbitrary and , for any (this is now playing the role of and removes the dependency on an arbitrary constant) there is some such that
Let
I just showed that for every there is an such that for all , which means so .
Alternatively, let . Then by the same reasoning I know there is at least one such that . This means therefore
by monotonicity. This shows that , so . is built entirely out of Borel sets so this shows that is Borel.
I’ll use without proof the analogous result that is also measurable.
I now want to prove the following result:
Result 2: For any collection of Borel functions both the set
and the function defined by
are measurable.
This result is good because it means that I don’t have to worry about measurability when I have something like since there’s only a null set where the limit may not exist and I can just send those to without affecting anything.
Pf: existing is equivalent to . Both of these functions are measurable so the function given by is also measurable. is a Borel set so
For , I will first prove the following. Let and be Borel and define
for some . Fix and take . Then
By construction and for all of these I have , so and that is measurable. The same argument applied to shows that set is also measurable, so is the union of two measurable sets and by the closure of is also measurable.
Applying this to , on I have which is measurable, and is also measurable, so this means that my above result with applies to and is therefore Borel.
I will compare this to convergence in probability, which as a reminder (see e.g. my post here) is defined by
Convergence in probability measures the size of the subset of leading to , and requires that the measure of these sets heads to zero as I advance along the sequence. The big difference between convergence a.s. and convergence in probability is that isn’t about the individual trajectories. It can be the case that exists almost nowhere and yet still converges in probability to something. The next result gives an explicit example of this.
Result 3: does not imply .
Pf: I’ll prove this by explicitly constructing a sequence of random variables with these properties. Take , (so it’s the Borel -algebra generated by the open subsets of ), and , i.e. the Lebesgue measure (which is a probability measure on this sample space).
Now for each positive let be the partition of formed by cutting up into adjacent half-open intervals, so contains sets of the form for . Now I’ll create a triangular array of random variables with (all of the are Borel so the are all Borel too). I’ll flatten this into a sequence by enumerating first along within each value of and then along , so
Fix . Then for the corresponding . implies so for this arbitrary , which shows .
But for almost sure convergence there’s a problem. For any it will be in exactly one element of each , so in each row of the triangular array (I’m thinking of indexing rows and columns) I’ll have all of the except for the single one taking a value of for the piece of the partition containing . This is the case for every so hits both and infinitely often and therefore this trajectory doesn’t converge to anything. This is true for every so in fact there is not a single element of the sample space with a convergent trajectory. This means doesn’t have a limiting distribution in the almost sure / pointwise sense.
The crux of this proof was that convergence in probability doesn’t care about which s it’s measuring at each time. In this case the had monotonically decreasing measure but the particular in them jumped around so no individual trajectories converged even though at any given time the fraction of leading to was decreasing.
The converse is true, though:
Result 4:
Pf: Fix . Define
and
so collects the where there is at least one such that and are at least apart. All of these sets are measurable so I don’t need to worry about that.
so the sequence is nonincreasing and settles on
is exactly the that are in infinitely many of the . If is in infinitely many then it cannot be the case that since never settles within of . But since it must be that this set has measure zero, i.e. .
By the continuity of and the nesting of the , I have
so this means
so .
I’m building up to proving a form of the strong law of large numbers but I’ll need a standard helper result first:
Borel-Cantelli lemma. Let . Then
In other words, this result says that if the probabilities of these events decay fast enough then almost all sample outcomes are only in finitely many of them.
Pf: I’ll build a new sequence of events where
(this is very reminiscent of the proof of Result 4).
Then . By the continuity of measure
by subadditivity. By assumption so these tail sums must converge to zero. This means
I’m now ready to prove my main result.
SLLN: Let be an iid sequence of random variables with , and let . Then .
The full SLLN only requires but this is a much simpler proof which is why I’m just doing this one, even though it requires a stronger assumption. In a future post I’ll go through the proof of the full SLLN.
Pf: I’ll be considering terms like so WLOG I can just take and show . So
by Markov’s inequality (I prove this in this post). Expanding via the multinomial theorem I have
Since the power of is small relative to the number of terms I’ll have many of these indices equal to zero. I’ll be taking the expected value of and I know so I only need to figure out which terms only have or in them.
4th moment terms arise when one , and in those cases all the other . This means I’ll have a term (the multinomial coefficient for all of these is just ).
For the 2nd powers I need exactly two of the to be , so the multinomial coefficient is and overall I get
where the is because my indexing double counts each term.
Putting this all together I have
which means
The right hand side is summable so this means
Taking , I can use Borel-Cantelli to conclude
But is exactly the subset of with that are in infinitely many of these events. If almost all are only in finitely many of them, then since is arbitrary, I have shown that almost all sample paths converge to zero so .
One way to think of this that I like is using random walks. Given an iid collection of random variables , I can form a random walk by the cumulative addition of so . Typically the idea is that represents some white noise process and I’m (discretely, in this case) integrating to get which wanders around according to the fluctuations of the . Taking , if I assume then the SLLN applies so . In terms of sample paths, this means that almost all sample space elements have trajectories that are dominated by . It’s certainly not the case that all trajectories are eventually squashed to , but the set of outcomes that lead to non-squashed trajectories is measure zero.
As an example, if so is a discrete time brownian motion, then the SLLN tells me that . But there are trajectories that have for all and lead to not converging on , or even diverging to . But because and the are iid, it’s rare enough to consistently produce large enough jumps in the same direction so that the whole collection of sample space elements with these trajectories is measure zero.
For comparison, if the were iid Cauchy, so , then the SLLN doesn’t hold and intuitively this makes sense because no matter how far along I am in a trajectory, there’s a high enough probability of a massive shock that moves me to a new location so I don’t get convergence on a large subset of my sample space. Below is a plot of one trajectory of such a random walk.

If (i.e. location of and scale of ) then the characteristic function of is (as a total side comment, the distribution being symmetric around zero implies that the characteristic function is purely real-valued as I explore in this post). Then
so too. In words, the heavy tails of the Cauchy distribution are strong enough to overpower the effect of averaging and at every step in the random walk, even with division by , the distribution is the same as that of so no convergence happens in general. This shows the importance of the condition.