$\newcommand{\Om}{\Omega}\newcommand{\w}{\omega}\newcommand{\F}{\mathscr F}\newcommand{\R}{\mathbb R}\newcommand{\e}{\varepsilon}\newcommand{\convd}{\stackrel{\text d}\to}\newcommand{\convp}{\stackrel{\text p}\to}\newcommand{\convas}{\stackrel{\text {a.s.}}\to}\newcommand{\E}{\operatorname{E}}\newcommand{\Var}{\operatorname{Var}}\newcommand{\one}{\mathbf 1}$In this post I’m going to introduce two modes of convergence for random variables. I’ll begin by defining convergence in distribution and convergence in probability, after which I’ll prove some results about them. I’ll finish by proving the standard weak law of large numbers (I’ll follow the path to this result as laid out in Durrett’s Probability: Theory and Examples).
Let $(\Om, \F, P)$ be a probability space and let $X_1, X_2, \dots, X : \Om\to\R$ be random variables. I’ll use $F_n$ for the CDF of $X_n$ and $F$ as the CDF of $X$.
The $\{X_n\}$ converge in distribution to $X$, denoted $X_n\convd X$, if
$$
F_n(x)\to F(x)
$$
for all $x\in\R$ such that $F$ is continuous. One quirk of convergence in distribution is that I don’t actually need the $X_n$ and $X$ to be defined on a common probability space. The central limit theorem (which involves convergence in distribution) gives many examples of this: I could have every $X_n$ discrete with a finite sample space, but the CDFs converge to that of a Gaussian supported on an uncountable sample space. Nevertheless, I’ll assume a common probability space throughout to make the comparisons easier.
Another interesting thing about convergence in distribution is that it only concerns continuity points of $F$, so I can have $X_n\convd X$ without $F_n$ converging on $F$ in a more typical pointwise way, such as in terms of the sup norm $\|F_n – F\|_\infty = \sup_{x\in\mathbb R} |F_n(x) – F(x)|$. For example, take $X_n \sim \text{Unif}\left(-\frac 1{2n}, \frac 1{2n}\right)$ so
$$
F_n(x) = \begin{cases} 0 & x < -\frac 1{2n} \\ nx + \frac 12 & -\frac 1{2n} \leq x \leq \frac 1{2n} \\ 1 & x > \frac 1{2n}\end{cases}.
$$
For $x < 0$ I’ll have $F_n(x) \to 0$ and for $x > 0$ I’ll have $F_n(x)\to 1$ so for $x\neq 0$ I have $F_n(x)\to F$ with $F(x) = \one_{x \geq 0}$ which corresponds to a point mass at $0$. And $F$ is not continuous at $0$ so the behavior there doesn’t matter for convergence in distribution, and this establishes that $X_n\convd 0$. But $F_n(0) = \frac 12$ for each $n$ so $\lim_{n\to\infty} F_n(0) = \frac 12 \neq F(0) = 1$, so in fact $\lim_n \|F_n – F\|_\infty = \frac 12$. It’s also interesting that $\lim_n F_n$ is not even a valid CDF as it’s not right continuous.
The $\{X_n\}$ converge in probability to $X$, denoted $X_n\convp X$, if for all $\e>0$
$$
\lim _{n\to\infty}P(|X_n-X| > \e) = 0.
$$
The idea behind this is that I fix some theshold $\e>0$ and then measure the size of the region of $\Omega$ that leads to $X_n$ being at least $\e$ away from $X$. As $n\to\infty$ the size of this disagreement goes to zero.
I’ll now begin by showing that convergence in probability implies convergence in distribution. I’ll use the following helper lemma which intuitively says that if a sample outcome leads to a random variable $Y$ being less than some number $y$, then for any other RV $Z$ that same sample outcome will either have $Z$ less than $y+\e$ or I’ll have $Z$ further than $\e$ from $Y$ (or both can hold). This will help me relate one-sided probabilities like CDFs, which I need for convergence in distribution, to measures of distance like I have with convergence in probability. I’ll sometimes follow the proofs as given in in this wikipedia article.
Lemma: for two RVs $Y$ and $Z$,
$$
P(Y\leq y) \leq P(Z \leq y + \e) + P(|Z-Y| > \e).
$$
Pf: Let $\w$ be an element of the event $\{Y\leq y\}$, so $Y(\w)\leq y$. Suppose $\w \notin \{|Z-Y|>\e\}$ so $|Z(\w) – Y(\w)| \leq \e$. Then
$$
-\e \leq Z(\w) – Y(\w) \leq \e \\
\implies y – \e \leq Z(\w) + y – Y(\w) \leq y + \e.
$$
$y – Y(\w) \geq 0$ by my choice of $\w$ so if $Z(\w)$ plus some positive number is less than or equal to $y+\e$, it must be that $Z(\w)\leq y+\e$. This shows that
$$
\w\notin \{|Z-Y|>\e\}\implies \w \in \{Z \leq y + \e\}
$$
so all together it is always the case that
$$
\w \in \{|Z-Y| > \e\}\cup \{Z \leq y+\e\}.
$$
$\w$ was an arbitrary element of $\{Y\leq y\}$ so this means
$$
\{Y\leq y\} \subseteq \{Z \leq y+\e\} \cup \{|Z-Y| > \e\}
$$
and I can then use the monotonicity of $P$ and the union bound to get
$$
P(Y\leq y) \leq P(Z \leq y+\e) + P(|Z-Y|>\e).
$$
$\square$
I’ll now apply this to prove my first result.
Result 1: $X_n\convp X \implies X_n\convd X$.
Pf: I’ll use the lemma twice so I can get upper and lower bounds. For the first one I’ll take $Y = X_n$ and $Z = X$ with $y = x$, while for the second one I’ll do $Y = X$, $Z = X_n$, and $y = x-\e$. Thus
$$
P(X_n \leq x) \leq P(X \leq x + \e) + P(|X_n – X| > \e) \\
P(X \leq x – \e) \leq P(X_n \leq x) + P(|X_n – X| > \e)
$$
so combining these two gives me
$$
P(X \leq x – \e) – P(|X_n – X| > \e) \le P(X_n\leq x) \leq P(X \leq x + \e) + P(|X_n – X| > \e)
$$
or, in terms of CDFs,
$$
F(x-\e) – P(|X_n – X| > \e) \leq F_n(x) \leq F(x+\e) + P(|X_n – X| > \e)
$$
for an arbitrary $x$ (with $F$ continuous there) and $\e>0$. Letting $n\to\infty$ I have $P(|X_n-X| > e)\to 0$ by assumption so
$$
F_X(x-\e) \leq \lim_{n\to\infty} F_n(x) \leq F_X(x+\e) .
$$
I’ve assumed that $F$ is continuous at $x$ so letting $\e \searrow 0$ I have
$$
F(x) = \lim_{n\to\infty} F_n(x)
$$
and thus $X_n \convd X$ by definition.
$\square$
Result 2: $X_n\convd X \not\Rightarrow X_n\convp X$
Pf: by example. Suppose $X$ has a distribution that is symmetric about $0$ on $\R$, so $P(X \leq x) = P(X \geq -x)$. Take $X_n = -X$ for $n=1,2,\dots$. Then
$$
F_n(x) = P(X_n \leq x) = P(-X \leq x) \\
= P(X \geq -x) = P(X \leq x) = F(x)
$$
so $F_n = F$ which trivially means $X_n\convd X$.
But in every case $|X_n – X| = 2|X|$ so
$$
P(|X_n – X| > \e) = P(|X| > \e/2)
$$
and unless $|X|$ is a point mass at zero, there are $\e>0$ such that this is strictly positive (and there’s no dependence on $n$ so it doesn’t decrease with $n\to\infty$).
A concrete example of this is if $X\sim\mathcal N(0,1)$ so $-X\sim\mathcal N(0,1)$ too. Then if $\e=4$, say,
$$
P(|X_n-X| > \e) = P(|X| > 2) \approx 0.05 > 0.
$$
$\square$
This example highlights the difference between convergence in distribution vs probability. When $X_n\convd X$, this means that for every $x\in\R$ for which $F$ is continuous, I’ll have the measure of $X_n^{-1}((-\infty, x])$ approach the measure of $X^{-1}((-\infty, x])$. In words, this means the size of the region of $\Omega$ that leads to $X_n \leq x$ converges on being the same size as the region of $\Om$ with $X\leq x$. But I don’t have to have the same $\w$ in these regions and that’s what I saw in Result 2. To be more explicit, if I have $X\sim\mathcal N(0,1)$ and $X_1 = X_2 =\dots = -X$ so (trivially) $X_n \convd X$, for many $\w$ I will have $X_n(\w)$ very far from $X(\w)$ since $X_n(\w) = -X(\w)$, but that doesn’t matter for convergence in distribution. For convergence in probability, however, I need the mass of $\w$s leading to $X_n(\w)$ and $X(\w)$ being far apart to diminish. In summary, the reason that convergence in distribution is weaker is because I can have two random variables with similar sizes of preimages for sets like $(-\infty, x]$ yet those random variables can take different values on many particular $\w$s. If I require that the size of the pointwise disagreement of these random variables diminishes, then I’ll also have agreement on the measures of preimages of sets like $(-\infty, x]$ so convergence in distribution is implied.
The converse does hold in one particular case:
Result 3: if $X_n\convd c$ then $X_n \convp c$.
Pf:
$$
\begin{aligned}P(|X_n – X| > \e) &= 1 – P(|X_n – c| \leq \e) \\&
= 1 – P(c-\e \leq X_n \leq c + \e) \\&
= 1 – F_n(c + \e) + F_n(c-\e) – P(X_n = c-\e) \\&
\leq 1 – F_n(c + \e) + F_n(c-\e).\end{aligned}
$$
By assumption $F_n \to F$ for all continuity points. In this case $F$ is continuous for all $x\neq c$ which includes $c + \e$ and $c-\e$. This gives me $F_n(c+\e) \to F(c+\e) = 1$ and $F_n(c-\e)\to F(c-\e) = 0$ so all together
$$
0 \leq \lim_{n\to\infty}P(|X_n – X| > \e) \leq \lim_{n\to\infty} 1 – F_n(c + \e) + F_n(c-\e) = 0
$$
so $X_n\convp c$.
$\square$
I’m now going to build up to the standard weak law of large numbers, but first I’ll prove some inequalities and use those to establish a weaker law of large numbers.
Inequality 1: Markov
Claim: if $X \geq 0$ almost surely then
$$
P(X \geq x) \leq \frac{\E X}{x}
$$
for $x > 0$ (if $x=0$ I’ll have $P(X\geq 0) = 1$ so I’m fine excluding that trivial case).
Pf: $X$ is nonnegative a.s. so by definition
$$
\E X = \int_\Om X \,\text dP = \sup\left\{\int_\Om \varphi\,\text dP : \varphi \in S_+ \text{ and } \varphi \leq X\right\}
$$
where $S_+$ is the set of non-negative simple functions on $\Omega$ (i.e. functions that are a linear combinations of indicator functions for a finite collection of measurable sets [and the use of measurable sets makes them measurable]). This $\sup$ always exists although it may be infinite, but that doesn’t matter here as that just gives a trivial bound of $P(X \geq x) \leq \infty$. Then that for any particular $\varphi \in S_+$ with $\varphi \leq X$ I’ll have
$$
\E X \geq \int \varphi \,\text dP.
$$
Consider now
$$
A := X^{-1}([x, \infty)) = \{\w \in \Om : X(\w) \geq x \}.
$$
$X$ is Borel so this set is measurable, and I’ll now take my particular $\varphi : \Om \to \R$ to be
$$
\varphi(\w) = x \one_{A}(\w)
$$
so $\varphi$ is $x$ on $A$ and $0$ elsewhere. $\varphi \in S_+$ and by construction $\varphi \leq X$ so
$$
\E X \geq \int_\Om \varphi\,\text dP = x P(A) \\
\implies \frac{\E X}{x} \geq P(X \geq x).
$$
$\square$
I really like this proof because it’s so direct from the definition. $X$ is nonnegative a.s. so $\E X$ is just the $\sup$ over all the integrals of all simple functions that approximate $X$ from below. I picked one such simple function in particular and got my inequality.
I’ll now use this to establish my next inequality.
Inequality 2: Chebyshev
Claim: if $\E X = \mu$ and $\Var X = \sigma^2$ and both exist and are finite, then for any $\e > 0$
$$
P(|X – \mu| \geq \sigma \e) \leq \frac{1}{\e^2}
$$
Pf: Let $Z = |X-\mu|^2$. $Z$ is non-negative so by Markov’s inequality I have
$$
P(|X-u| \geq \sigma \e) = P(Z \geq \sigma^2 \e^2) \leq \frac{\E Z}{\sigma^2 \e^2}.
$$
But $\E Z$ is just $\Var X$ by definiton so
$$
P(|X-u| \geq \sigma \e) \leq \frac{\Var X}{\sigma^2 \e^2} = \frac 1{\e^2}.
$$
$\square$.
I’m now equipped to prove my first law of large numbers.
2nd moment WLLN: Suppose $X_1,X_2,\dots$ is an uncorrelated sequence with a common finite mean $\E X_i = \mu$ and $\Var X_i = \sigma^2_i \leq C < \infty$ for all $i$. Then $\bar X_n \convp \mu$ where $\bar X_n = \frac 1n\sum_{i=1}^n X_i$.
Pf: by the linearity of expectation I know $\E \bar X_n = \mu$ for all $n$, and the $\{X_i\}$ being uncorrelated and having common means means
$$
\begin{aligned}\Var \bar X_n &= \E(\bar X_n^2) – \mu^2 \\&
= -\mu^2 + n^{-2}\E\left(\sum_{ij} X_iX_j\right) \\&
= -\mu^2 + n^{-2}\left(\sum_i \E X_i^2 + \sum_{i \neq j} (\E X_i)(\E X_j)\right) \\&
= -\mu^2 + n^{-2}\left(\sum_i \E X_i^2 + n(n-1)\mu^2\right) \\&
= n^{-2}\sum_i (\E X_i^2 – \mu^2) = \bar \sigma^2_n / n\end{aligned}
$$
where $\bar \sigma_n^2 = \frac 1n \sum_{i=1}^n \sigma^2_i$. Since $\sigma^2_i \leq C$ for all $i$ I know $\bar\sigma^2_n \leq C$ for all $n$ as well. I can now apply Chebyshev to get
$$
\begin{aligned}P(|\bar X_n – \mu| > \e) &= P\left(|\bar X_n – \mu| > \left[\frac{\e}{\bar \sigma_n / \sqrt n }\right]\cdot \bar \sigma_n / \sqrt n\right) \\&
\leq \frac{\bar\sigma^2_n}{n\e^2} \leq \frac{C}{n\e^2} \to 0\end{aligned}
$$
as $n\to\infty$.
$\square$
This result is nice because it’s really quick and easy to prove (and is a little more general than requiring the $\{X_i\}$ to be iid), but it required two finite moments of $X$ and it turns out I can do better.
My goal will be to prove the following:
Theorem: WLLN: suppose $\{X_i : i\in\mathbb N\}$ is an iid sequence with $\E|X_1| < \infty$ and $\E X_1 := \mu$ (which by assumption is well-defined). Then $\bar X_n \convp \mu$. I’ll be following Durrett’s journey to this theorem as in Probabilty: Theory and Examples chapter 2. Durrett proves this by first establishing a couple theorems making use of truncations which is how I’ll avoid needing to assume a finite second moment. For a random variable $X$ and real-valued $M>0$, the truncation of $X$ to $M$ is
$$
\tilde X := X \one_{|X| \leq M}
$$
so $\tilde X$ is equal to $X$ when $|X|\leq M$ and is zero outside of $X^{-1}([-M,M])$. One consequence is that $\tilde X$ is always integrable:
$$
\E |\tilde X| = \int_{\Omega} |X|\,\one_{|X| \leq M}\,\text dP = \int_{|X| \leq M} |X|\,\text dP \leq M.
$$
Theorem 1 (2.2.11 in Durrett) Let $(X_{nk})$ be a triangular array of random variables so $n=1,2,\dots$ and $k = 1,\dots, n$. Assume independence within each row so $X_{nk} \perp X_{nj}$ for $k\neq j$. Let $(b_n)$ be a positive sequence with $b_n\to\infty$ and take $\tilde X_{nk}$ to be the truncation of $X_{nk}$ to $b_n$, i.e.
$$
\tilde X_{nk} = X_{nk}\one_{|X_{nk}| \leq b_n}.
$$
Since $b_n\to\infty$ the intervals that I’m truncating to (i.e. $[-b_n, b_n]$) are getting bigger and bigger.
I’ll further assume two additional conditions, that (as Durrett comments) make the proof easy as they’re basically exactly what I need to prove the result.
Assume
1. $\sum_{k=1}^n P(|X_{nk}| > b_n) \to 0$
and
2. $b_n^{-2}\sum_{k=1}^n \E\tilde X^2_{nk} \to 0$.
In condition (1), $P(|X_{nk}| > b_n)$ is the probabilty that $X_{nk}$ is outside the interval I’m truncating to, so this condition means that these tail probabilities head to zero faster than their sum increases with $n\to\infty$. As an example, if my triangular array has iid standard Cauchy rows and $b_n = n$, so I have $\sum_{k=1}^n P(|X_{nk}| > b_n) = n P(|X_1| > n)$, then via the CDF of a standard Cauchy
$$
n P(|X_1| > n) = n \left(1 – \frac 1\pi \arctan(n) + \frac 1\pi \arctan(-n)\right) \to \frac 2\pi \neq 0
$$
so in this case the tails are heavy enough and $b_n$ grows slowly enough that I fail to drive the sum to zero. But if I take $b_n = n^2$ then even with a Cauchy distribution I still have $n P(|X_1| > n^2) \to 0$ so the condition holds (although the result will still fail with Cauchy RVs for other reasons).
For condition (2), since $\Var X = \E X^2 – (\E X)^2$, $\E X^2$ is an upper bound on the variance of $X$ so I can think of this assumption as saying that the total row variance is dominated by $b_n^2$.
Then, with all of these assumptions, if I let $S_n = \sum_{k=1}^n X_{nk}$ be the sum of the $n$th row, and $a_n = \sum_{k=1}^n \E\tilde X_{nk}$ is the mean of the truncated row sum, I claim
$$
(S_n – a_n) / b_n \convp 0.
$$
Pf: Let $\tilde S_n = \sum_{k=1}^n \tilde X_{nk}$ be the row sum of the truncated triangular array.
Let $\w\in\Om$ be such that $|S_n(\w) – a_n| > \e b_n$. If $S_n(\w) = \tilde S_n(\w)$ then I’ll have $|\tilde S_n(\w) – a_n| > \e b_n$, so either $S_n \neq \tilde S_n$ or $\tilde S_n$ is within $\e b_n$ of $a_n$. This means that, as events,
$$
\left\{\left\vert \frac{S_n – a_n}{b_n}\right\vert > \e \right\} \subseteq \left\{S_n \neq \tilde S_n\right\}\cup\left\{\left\vert \frac{\tilde S_n – a_n}{b_n}\right\vert > \e\right\}
$$
therefore
$$
P\left(\left\vert \frac{S_n – a_n}{b_n}\right\vert > \e\right) \leq P(S_n \neq \tilde S_n) + P\left(\left\vert \frac{\tilde S_n – a_n}{b_n}\right\vert > \e\right) .
$$
I now need to bound each term of this to show that the left hand side converges to zero.
For the first term, I can use the fact that $S_n\neq \tilde S_n$ happens only if at least one $X_{nk}$ is truncated so
$$
P(S_n \neq \tilde S_n) \leq P\left(\bigcup_{k=1}^n\{X_{nk} \neq \tilde X_{nk}\} \right) \leq \sum_{k=1}^n P(X_{nk} \neq \tilde X_{nk})
$$
by the union bound. By my definition of truncation
$$
P(X_{nk} \neq \tilde X_{nk}) = P(|X_{nk}| > b_n)
$$
so
$$
P(S_n \neq \tilde S_n) \leq \sum_{k=1}^n P(|X_{nk}| > b_n) \to 0
$$
by my assumption (1). To recap this section, I’m bounding the probability that the row sum is different from the row sum of the truncated array. I did this by noting that the event $\{S_n\neq\tilde S_n\}$ is contained within the event $\{\text{at least one }X_{nk}\text{ is truncated}\}$, and in turn I can bound that by the sum of each individual probability of truncation. And by assumption this heads to zero.
For the other term, I have $|\tilde S_n – a_n|$ and $\E \tilde S_n = a_n$ so I want to use Chebyshev’s inequality, but I need to confirm that $\tilde S_n$ has a finite variance first. For that, I can expand $\tilde S_n^2$ to get
$$
\E \tilde S_n^2 = \sum_{i,j=1}^n \E \tilde X_{ni}\tilde X_{nj}.
$$
When $i=j$ I have
$$
\E\tilde X_{ni}^2 = \int_{|X_{ni}| \leq b_n} X^2_{ni}\,\text dP \leq b_n^2 < \infty
$$
which means all of the squared terms have finite expectations (as a consequence of the truncation). And by Cauchy-Schwarz
$$
[\E(\tilde X_{ni}\tilde X_{nj})]^2 \leq \E(\tilde X_{ni}^2)\E(\tilde X_{nj}^2) < \infty
$$
so all together this means $\E \tilde S_n^2 < \infty$, therefore $\Var \tilde S_n$ is finite and well-defined and I can use Chebyshev. Doing so gives me $$ P\left(\left\vert\frac{\tilde S_n – a_n}{b_n}\right\vert > \e\right) \leq \e^{-2}\Var\left(\frac{\tilde S_n – a_n}{b_n}\right) \\
= \e^{-2}b_n^{-2}\Var\tilde S_n.
$$
I can now apply the assumption of independence within each row of the array to get
$$
\Var\tilde S_n = \sum_{k=1}^n \Var \tilde X_{nk}.
$$
Furthermore, for a random variable $X$ I know $\Var X = \E X^2 – (\E X)^2 \leq \E X^2$ so
$$
P\left(\left\vert\frac{\tilde S_n – a_n}{b_n}\right\vert > \e\right) \leq \e^{-2} b_n^{-2}\sum_{k=1}^n \E\tilde X_{nk}^2 \to 0
$$
by assumption (2). In summary, for this section I showed that the truncation made it so that $\tilde S_n$ is guaranteed to have a finite variance, and since $|\tilde S_n – a_n|$ is a deviation from the mean, I can use Chebyshev to bound this in terms of the sum of the 2nd moments of each truncated term in the current row of the array. And by my initial assumption, this sum is dominated by $b_n^2$.
All together I just showed that $P(|S_n – a_n| > \e b_n)$ is bounded by two things that both converge to $0$ as $n\to\infty$ which establishes that $|S_n – a_n|/b_n \convp 0$.
$\square$
I now want to use this to get a result for a single sequence.
Theorem 2 (2.2.12 in Durrett) Let $X_1,X_2,\dots$ be iid with
$$
x P(|X_1| > x) \to 0
$$
as $x\to\infty$. Also let $S_n = \sum_{i\leq n} X_i$ and $\mu_n = \E(X_1\one_{|X_1|\leq n})$ (so $\mu_n$ is the mean of the truncation of $X_1$ to $[-n,n]$). Then $S_n/n – \mu_n \convp 0$.
Pf: I want to use Theorem 1 so I’ll construct a triangular array $(X_{nk})$ with $X_{nk} = X_k$, so the $n$th row of my array is just the first $n$ elements of $X_1,X_2,\dots$. By the iid assumption I have independence within rows (although not between since elements are repeated, but that’s fine). I will take $b_n = n$ which satisfies the requirement that $b_n>0$ and $b_n\to\infty$, and my truncations are then $\tilde X_{nk} = X_k\one_{|X_k|\leq n}$.
In order to apply Theorem 1 I need to confirm the two conditions. For the first one,
$$
\sum_{k=1}^nP(|X_{nk}| > b_n) = \sum_{k=1}^nP(|X_{k}| > n) = n P(|X_1| > n) \to 0
$$
by assumption. For the second assumption I have
$$
n^{-2}\sum_{k=1}^n \E\tilde X_k^2 = n^{-1}\E\tilde X_{n1}^2
$$
and I need this to converge to zero. Durrett uses the fact that for a nonnegative RV $Y$ and real number $p>0$, $\E Y^p = \int_0^\infty py^{p-1}P(Y>y)\,\text dy$, so
$$
\E\tilde X_{n1}^2 = \int_0^\infty 2y P(|\tilde X_{n1}| > y)\,\text dy
$$
where I’ve used $Y = |\tilde X_{n1}|$ to get nonnegativity. $\tilde X_{n1} = X_1\one_{|X_1| \leq n}$ so $P(|\tilde X_{n1}| > y)= 0$ for $y \geq n$. Additionally, for $y\leq n$
$$
P(|\tilde X_{n1}| > y) = P(|X_1| \in (y, n])
$$
since if $|X_1|$ exceeds $n$ then it gets truncated. This can be written as a difference of the CDFs of $|X_1$ so
$$
P(|X_1| \in (y, n]) = P(|X_1| \leq n) – P(|X_1| \leq y) \\
= P(|X_1| > y) – P(|X_1| > n).
$$
by taking complements. This means
$$
\begin{aligned}n^{-1}\E\tilde X_{n1}^2 &= n^{-1}\int_0^n 2y \left[P(|X_1| > y) – P(|X_1| > n)\right]\,\text dy \\&
= n^{-1}\int_0^n 2y P(|X_1| > y)\,\text dy – nP(|X_1| > n ) \\&
\leq n^{-1}\int_0^n 2y P(|X_1| > y)\,\text dy\end{aligned}
$$
and by the assumption of $x P(|X_1| > x)\to 0$, the inequality becomes increasingly sharp as $n\to\infty$. I now need to show that this converges to $0$ as $n$ increases.
Let $g(y) = 2y P(|X_1| > y)$. I need to show that the average value of $g$ over $[0,n]$ converges to zero. Intuitively, as Durrett says, this makes sense because by assumption $g(y)\to 0$ as $y\to\infty$ so $g$ flattens out and the averaging includes more and more small values.
To show this rigorously, $P(|X_1| > y) \in [0,1]$ so $0 \leq g(y) \leq 2y$ everywhere, but also $g(y)\to 0$ as $y\to\infty$. This means that there is some $y_0 >0$ such that $y > y_0 \implies g(y) < 1$, so on $[0, y_0]$ $g$ is bounded by $2y_0$ while on $[y_0, \infty)$ $g$ is bounded by $1$, so all together $\sup g := M \leq \max\{2 y_0, 1\} < \infty$, i.e. $g$ is bounded everywhere. Now I can let $g_n(y) = g(ny)$ so I’ve got $$ g_n(y) = 2ny P(|X_1| > ny).
$$
Letting $y = nx$ I have
$$
n^{-1}\int_0^n g(y)\,\text dy = \int_0^1 g_n(x)\,\text dx.
$$
$g$ being bounded means $g_n$ is too, and similarly $g_n(x) \to 0$ for all $x$ by assumption (as either $x$ or $n$ increases). Since $g_n$ is bounded and I’m integrating on a finite interval, I can take $h(x) = M$ as my bounding function and the dominated convergence theorem lets me conclude $\int_0^1 g_n(x)\,\text dx \to 0$ as $n\to\infty$.
I’ve now shown that all the requirements for Theorem 1 to hold. I have $a_n = n \mu_n$ so Theorem 1 says
$$
\frac{S_n – a_n}{b_n} = S_n/n – \mu_n \convp 0
$$
as desired.
$\square$
I’m now ready to finish this off and prove the WLLN theorem that I set out to establish.
I’ll restate it:
Theorem: WLLN: Let $X_1,X_2,\dots$ be iid with $\E |X_1| < \infty$. Take $S_n = \sum_{k\leq n} X_k$ and let $\mu = \E X_1$. I claim $S_n/n\convp \mu$.
Pf: I first need to establish that $$ x P(|X_1| > x) \to 0
$$
so that I can apply Theorem 2. For sufficiently large $x\in\mathbb R$ I know
$$
0 \leq x P(|X_1| > x) = \int_{|X_1| > x} x \,\text dP \\
\leq \int_{|X_1| > x} |X_1|\,\text dP = \E(|X_1|\one_{|X_1| > x}).
$$
I now want to show
$$
\E(|X_1|\one_{|X_1| > x})\to 0\; \text{ as }\; x\to\infty.
$$
I’ll do this via the dominated convergence theorem (DCT). Let $(x_n)$ be an arbitrary sequence that converges to $\infty$ (WLOG I’ll take all elements to be positive). Define
$$
f_n(\w) = |X_1(\w)|\one_{|X_1(\w)| > x_n}
$$
and note $f_n\to 0$ pointwise as $n\to\infty$. I need an integrable function $g$ that bounds $f_n$ for each $n$ and $\w$, and since $\E|X_1|<\infty$ I can just take $g(\w) = |X_1(\w)|$. Then I have $$ \lim_{n\to\infty} \int f_n\,\text dP = \int \lim_n f_n \,\text dP = 0. $$ This was for an arbitrary $x_n\to\infty$ therefore $x P(|X_1| > x)\to 0$.
I can now apply Theorem 2 to conclude that
$$
S_n / n – \mu_n \convp 0
$$
with $\mu_n = \E(X_1\one_{|X_1| \leq n})$. But I want to show that $S_n / n – \mu \convp 0$ so I need the limit of $\mu_n$. For this I can use the DCT again. I have
$$
\lim_{n\to\infty} \int X_1 \one_{|X_1| \leq n}\,\text dP
$$
and the sequence of functions $X_1 \one_{|X_1| \leq n}$ is bounded in absolute value by the integrable function $|X_1|$. This means I can exchange the limit and integral so
$$
\lim_{n\to\infty} \int X_1 \one_{|X_1| \leq n}\,\text dP = \int X_1\,\text dP = \mu.
$$
I now have confirmed that $\mu_n\to\mu$ so I have
$$
S_n/n – \mu_n \convp 0 \\
\mu_n\to \mu.
$$
By Slutsky’s theorem I know
$$
(S_n / n – \mu_n) + \mu_n = S_n/n \convd \mu
$$
and $\mu$ is constant so by Result 3 $S_n/n \convp \mu$.
$\square$
At this point I’ve established some basic results about convergence in distribution and probability. In my next post on this I’ll introduce almost sure convergence, compare that with convergence in probability, and prove the strong law of large numbers.