Monte Carlo Methods¶

-->

Please access the video via one of the following two ways

1. Google Drive: https://drive.google.com/drive/folders/1xRvsxuVMJ05fRfgaKk8DF_yCyhsqfa0K?usp=share_link

2. Baidu Netdisk: https://pan.baidu.com/s/12pxtMxcT7uG1os6FN_t8lw?pwd=wktn

Download the lecture notes here: Notes ¶

Monte Carlo methods¶

Let $\lambda \geq 0$ be a probability density function on $G \subset \mathbb{R}^{d}$ such that

\[ \int_{G} \lambda(\omega) d \omega=1 \]

For example:

\[ \lambda(\omega)=\frac{1}{|G|} \]

if $G$ is bounded. The expectation is defined:

\[ \mathbb{E} g:=\int_{G} g(\omega) \lambda(\omega) d \omega \]

and for any $h=h\left(\omega_{1}, \omega_{2}, \ldots, \omega_{n}\right): G \times G \cdots G \mapsto \mathbb{R}$

\[ \mathbb{E}_{n} h:=\int_{G \times G \times \ldots \times G} h\left(\omega_{1}, \omega_{2}, \ldots, \omega_{n}\right) \lambda\left(\omega_{1}\right) \lambda\left(\omega_{2}\right) \ldots \lambda\left(\omega_{n}\right) d \omega_{1} d \omega_{2} \ldots d \omega_{n} \]

A basic result¶

Lemma 9

For any $g \in L^{\infty}(G)$, we have

\[\begin{split} \mathbb{E}_{n}\left(\mathbb{E} g-\frac{1}{n} \sum_{i=1}^{n} g\left(\omega_{i}\right)\right)^{2}=\left\{\begin{array}{c} \frac{1}{n} \mathbb{E}\left((\mathbb{E} g-g)^{2}\right) \leq \frac{1}{n} \sup _{\omega, \omega \in G}\left|g(\omega)-g\left(\omega^{\prime}\right)\right|^{2} \\ \frac{1}{n}\left(\mathbb{E}\left(g^{2}\right)-(\mathbb{E}(g))^{2}\right) \leq \frac{1}{n} \mathbb{E}\left(g^{2}\right) \leq \frac{1}{n}\|g\|_{L^{\infty}}^{2}, \end{array}\right. \end{split}\]

Proof. First note that

\[\begin{split} \begin{aligned} \left(\mathbb{E} g-\frac{1}{n} \sum_{i=1}^{n} g\left(\omega_{i}\right)\right)^{2} &=\frac{1}{n^{2}}\left(n \mathbb{E} g-\sum_{i=1}^{n} g\left(\omega_{i}\right)\right)^{2}=\frac{1}{n^{2}}\left(\sum_{i=1}^{n}\left(\mathbb{E} g-g\left(\omega_{i}\right)\right)\right)^{2} \\ &=\frac{1}{n^{2}} \sum_{i, j=1}^{n}\left(\mathbb{E} g-g\left(\omega_{i}\right)\right)\left(\mathbb{E} g-g\left(\omega_{j}\right)\right) \\ &=\frac{I_{1}}{n^{2}}+\frac{I_{2}}{n^{2}} \end{aligned} \end{split}\]

with

$\left.I_{1}=\sum_{i=1}^{n}\left(\mathbb{E} g-g\left(\omega_{i}\right)\right)^{2}, \quad I_{2}=\sum_{i \neq j}^{n}\left((\mathbb{E} g)^{2}-\mathbb{E}(g)\left(g\left(\omega_{i}\right)+g\left(\omega_{j}\right)\right)+g\left(\omega_{i}\right) g\left(\omega_{j}\right)\right)\right)$

Consider $I_{1}$, for any $i$,

\[ \mathbb{E}_{n}\left(\mathbb{E} g-g\left(\omega_{i}\right)\right)^{2}=\mathbb{E}_{n}(\mathbb{E} g-g)^{2} \]

Thus,

\[ \mathbb{E}_{n} I_{1}=n \mathbb{E}\left((\mathbb{E} g-g)^{2}\right) \]

For $I_{2}$, note that

\[ \mathbb{E}_{n} g\left(\omega_{i}\right)=\mathbb{E}_{n} g\left(\omega_{j}\right)=\mathbb{E}(g)\]

and, for $i \neq j$,

\[\begin{split} \begin{aligned} \mathbb{E}_{n}\left(g\left(\omega_{i}\right) g\left(\omega_{j}\right)\right) &=\int_{G \times G \times \ldots \times G} g\left(\omega_{j}\right) g\left(\omega_{j}\right) \lambda\left(\omega_{1}\right) \lambda\left(\omega_{2}\right) \ldots \lambda\left(\omega_{n}\right) d \omega_{1} d \omega_{2} \ldots d \omega_{n} \\ &=\int_{G \times G} g\left(\omega_{j}\right) g\left(\omega_{j}\right) \lambda\left(\omega_{1}\right) \lambda\left(\omega_{1}\right) \lambda\left(\omega_{2}\right) d \omega_{1} d \omega_{2} \\ &=\mathbb{E}_{n}\left(g\left(\omega_{i}\right)\right) \mathbb{E}_{n}\left(g\left(\omega_{j}\right)\right)=[\mathbb{E}(g)]^{2} \end{aligned} \end{split}\]

Thus

\[ \mathbb{E}_{n}\left(I_{2}\right)=\mathbb{E}_{n}\left(\sum_{i \neq j}^{n}\left((\mathbb{E} g)^{2}-\mathbb{E}(g)\left(\mathbb{E}\left(g\left(\omega_{i}\right)\right)+\mathbb{E}\left(g\left(\omega_{j}\right)\right)\right)+\mathbb{E}\left(g\left(\omega_{i}\right) g\left(\omega_{j}\right)\right)\right)\right)=0 \]

Consequently, there exist the following two formulas for $\mathbb{E}_{n}\left(\mathbb{E} g-\frac{1}{n} \sum_{i=1}^{n} g\left(\omega_{i}\right)\right)^{2}:$

\[\begin{split} \mathbb{E}_{n}\left(\mathbb{E} g-\frac{1}{n} \sum_{i=1}^{n} g\left(\omega_{i}\right)\right)^{2}=\frac{1}{n^{2}} \mathbb{E}_{n} I_{1}=\left\{\begin{array}{c} \frac{1}{n} \mathbb{E}\left((\mathbb{E} g-g)^{2}\right) \\ \frac{1}{n}\left(\mathbb{E}\left(g^{2}\right)-(\mathbb{E} g)^{2}\right) \end{array}\right. \end{split}\]

Based on the first formula above, since

\[ |g(\omega)-\mathbb{E} g|=\left|\int_{G}(g(\omega)-g(\tilde{\omega})) \lambda(\tilde{\omega}) d \tilde{\omega}\right| \leq \sup _{\omega, \omega \in G}\left|g(\omega)-g\left(\omega^{\prime}\right)\right| \]

it holds that

\[ \mathbb{E}_{n}\left(\mathbb{E} g-\frac{1}{n} \sum_{i=1}^{n} g\left(\omega_{i}\right)\right)^{2} \leq \frac{1}{n} \sup _{\omega, \omega \in G}\left|g(\omega)-g\left(\omega^{\prime}\right)\right|^{2} \]

Due to the second formula above,

\[ \mathbb{E}_{n}\left(\mathbb{E} g-\frac{1}{n} \sum_{i=1}^{n} g\left(\omega_{i}\right)\right)^{2} \leq \frac{1}{n} \mathbb{E}\left(g^{2}\right) \leq \frac{1}{n}\|g\|_{L^{\infty}}^{2} \]

which completes the proof. $\square$

Of course, we can use this to prove a high probability result.

Corollary 1

Under the assumptions of the preceding lemma, we have

\[ \overline{\mathbb{P}}\left[\left(\mathbb{E} g-\frac{1}{n} \sum_{i=1}^{n} g\left(\omega_{i}\right)\right)^{2}>\frac{k}{n}\|g\|_{L^{\infty}}^{2}\right]<\frac{1}{k} \]

Proof. $ \overline{\mathbb{P}}\left[\left(\mathbb{E} g-\frac{1}{n} \sum_{i=1}^{n} g\left(\omega_{i}\right)\right)^{2}>\epsilon\right] \leq \epsilon^{-1} \overline{\mathbb{E}}\left(\mathbb{E} g-\frac{1}{n} \sum_{i=1}^{n} g\left(\omega_{i}\right)\right)^{2} \leq \frac{1}{n \epsilon}\|g\|_{L^{\infty}}^{2} $

This corollary implies that the set of $\omega_{i}$ where the estimate $n^{-1} \sum_{i=1}^{n} g\left(\omega_{i}\right)$ is far from the desired value $\mathbb{E} g$ is small.

The practical usefulness of this algorithm depends upon the existence of a repeatable process (for instance some physical process) which generates $\omega$ according to a desired distribution $\mu$.

The precise meaning of this last statement is essentially that the strong law of large numbers holds. Specifically, if $\omega_{1}, \ldots, \omega_{n}, \ldots$ is a infinite sequence generated by the process, and $A \subset \Omega$ is any a measurable set, then

\[ \lim _{n \rightarrow \infty} \frac{1}{n} \sum_{i=1}^{n} \chi_{A}\left(\omega_{i}\right)=\mu(A) \]

Generating $n$ independent samples means generating $\omega_{1}, \ldots, \omega_{n}$ from $\mu^{n}$ according to the above notion. The existence of a realizable process generating samples from a probability distribution, and the practical use of such processes is an interesting topic in the intersection of statistics, physics, and computer science. In addition, statistics/probability theory studies how to take samples from one probability distribution and transform them to samples from another distribution.

Application¶

Lemma 10

Let

\[ f(x)=\int_{G} g(x, \theta) \lambda(\theta) d \theta=\mathbb{E}(g) \]

with $\lambda(\theta) \geq 0$ and $\|\lambda(\theta)\|_{L^{1}(G)}=1$. For any $n \geq 1$, there exist $\theta_{i}^{*} \in G$ such that

\[ \left\|f-f_{n}\right\|_{L^{2}(\Omega)}^{2} \leq \frac{1}{n} \int_{G}\|g(\cdot, \theta)\|_{L^{2}(\Omega)}^{2} \lambda(\theta) d \theta=\frac{1}{n} \mathbb{E}\left(\|g(\cdot, \theta)\|_{L^{2}(\Omega)}^{2}\right) \]

where $\|g(\cdot, \theta)\|_{L^{2}(\Omega)}^{2}=\int_{\Omega}[g(x, \theta)]^{2} d \mu(x)$, and

\[ f_{n}(x)=\frac{1}{n} \sum_{i=1}^{n} g\left(x, \theta_{i}^{*}\right) \]

Proof. Introducing a probability distribution $\lambda(\theta)$,

\[ f(x)=\mathbb{Z}(g) \]

By Lemma 9,

\[ \left.\mathbb{E}_{n}\left(\left(\mathbb{E}(g(x, \cdot))-\frac{1}{n} \sum_{i=1}^{n} g\left(x, \theta_{i}\right)\right)\right)^{2}\right) \leq \frac{1}{n} \mathbb{E}\left(g^{2}\right) \]

and

\[ \mathbb{E}_{n}\left(h\left(\theta_{1}, \theta_{2}, \cdots, \theta_{n}\right)\right) \leq \frac{1}{n} \mathbb{E}\left(\int_{\Omega} g^{2} d \mu(x)\right), \]

by taking integral where

\[ \left.h\left(\theta_{1}, \theta_{2}, \cdots, \theta_{n}\right)=\int_{\Omega}\left(\mathbb{E}(g(x, \cdot))-\frac{1}{n} \sum_{i=1}^{n} g\left(x, \theta_{i}\right)\right)\right)^{2} d \mu(x) \]

Sine $\mathbb{E}_{n}(1)=1$ and $\mathbb{E}_{n}(h) \leq \frac{1}{n} \mathbb{E}\left(\int_{O} g^{2} d \mu(x)\right)$, there exists $\left(\theta_{1}^{*}, \theta_{2}^{*}, \cdots, \theta_{n}^{*}\right) \in G \times G \times$ $\cdots \times G$ such that

\[ h\left(\theta_{1}^{*}, \theta_{2}^{*}, \cdots, \theta_{n}^{*}\right) \leq \frac{1}{n} \int_{\Omega} \mathbb{E}\left(g^{2}\right) d \mu(x) \]

Otherwise, $\mathbb{E}_{n}(h)>\frac{1}{n} \mathbb{E}\left(\int_{\Omega} g^{2} d \mu(x)\right)$ if $\left.h\left(\theta_{1}, \theta_{2}, \cdots, \theta_{n}\right)>\frac{1}{n} \int_{\Omega} \mathbb{E}\left(g^{2}\right)\right) d \mu(x) .$ This implies that $$\left\|f-f_{n}\right\|_{L^{2}(\Omega)}^{2} \leq \frac{1}{n} \int_{G}\|g(\cdot, \theta)\|_{L^{2}(\Omega)}^{2} \lambda(\theta) d \theta,$$ which completes the proof.

We also have a more general version of the above lemma.

Lemma 11

Let

\[ f(x)=\int_{G} g(x, \theta) \lambda(\theta) d \theta=\mathbb{E}(g) \]

with $\|\lambda(\theta)\|_{L^{1}(\Theta)}=1$. For any $n \geq 1$, there exist $\theta_{i}^{*} \in G$ such that

\[ \left\|f-f_{n}\right\|_{H^{m}(\Omega)}^{2} \leq \int_{G}\|g(\cdot, \theta)\|_{H^{m}(\Omega)}^{2} \lambda(\theta) d \theta=\frac{1}{n} \mathbb{E}\left(\|g(\cdot, \theta)\|_{H^{m}(\Omega)}^{2}\right) \]

where

\[ f_{n}(x)=\frac{1}{n} \sum_{i=1}^{n} g\left(x, \theta_{i}^{*}\right) \]

In particular, if

\[ \left|D^{\alpha} g(x, \theta)\right| \leq C, \quad \forall x, \theta,|\alpha| \leq m \]

Then

\[\begin{split} \left\|f-f_{n}\right\|_{H^{m}(\Omega)} \leq\left(\begin{array}{c} m+d \\ m \end{array}\right)^{1 / 2}|\Omega|^{1 / 2} n^{-1 / 2} \end{split}\]

For any $f(x)=\int_{G} g(x, \theta) \rho(\theta) d \theta$ with $\|\rho\|_{L^{1}(\Theta)} \neq 1 .$ Let $\lambda(\theta)=\frac{\rho(\theta)}{\|\rho\|_{L^{1}(\theta)}} .$ Thus,

\[ f(x)=\|\rho\|_{L^{1}(\Theta)} \int_{G} g(x, \theta) \lambda(\theta) d \theta \]

with $\|\lambda(\theta)\|_{L^{1}(\Theta)}=1 .$ We can apply the above two lemmas to the given function $f(x)$.

Integral representations of functions¶

Fourier representation¶

Consider the Fourier transform:

\[ \hat{f}(\omega)=\frac{1}{(2 \pi)^{d}} \int_{\mathbb{R}^{d}} e^{-i \omega \cdot x} f(x) d x \quad \forall \omega \in \mathbb{R}^{d} \]

We write $\hat{f}(\omega)=e^{i \theta(\omega)}|\hat{f}(\omega)| .$ By Fourier inversion formula,

\[ f(x)=\int_{\mathbb{R}^{d}} e^{i \omega \cdot x} \hat{f}(\omega) d \omega=\int_{\mathbb{R}^{d}} e^{i(\omega \cdot x+\beta(\omega))}|\hat{f}(\omega)| d \omega \]

Since $f(x)$ is real-valued, it implies that, for $x$

\[\begin{split} \begin{aligned} f(x) &=\operatorname{Re} \int_{\mathbb{R}^{d}} e^{i \omega \cdot x} \hat{f}(\omega) d \omega \\ &=\operatorname{Re} \int_{\mathbb{R}^{d}} e^{i \omega \cdot x} e^{i \beta(\omega)}|\hat{f}(\omega)| d \omega \\ &=\int_{\mathbb{R}^{d}} \cos (\omega \cdot x+\beta(\omega))|\hat{f}(\omega)| d \omega \end{aligned} \end{split}\]

Then we have

\[ f(x)=\int_{\mathbb{R}^{d}} k(x, \omega) d \omega \]

with

\[ k(x, \omega)=\cos (\omega \cdot x+\beta(\omega))|\hat{f}(\omega)| \]

and

\[ |k(x, \omega)| \leq|\hat{f}(\omega)|=\rho(\omega) \]

Theorem 11

There exist $\omega_{i} \in \mathbb{R}^{d}$, s.t., $G=\mathbb{R}$ and

\[ \int_{\Omega}\left(f(x)-f_{n}(x)\right)^{2} \leq \frac{1}{n} \int_{\mathbb{R}^{d}}|\hat{f}(\omega)| d \omega \]

where

\[ f_{n}(x)=\frac{\|\hat{f}\|_{L^{1}}}{n} \sum_{i=1}^{n} \frac{\cos \left(\omega_{i}^{*} \cdot x+\beta_{i}^{*}\right)}{\rho\left(\omega_{i}^{*}\right)} \]

Note that

\[ f_{n}=\sum_{i=1}^{n} \frac{\cos \left(\omega_{i}^{*} \cdot x+\beta_{i}^{*}\right)}{\rho\left(\omega_{i}^{*}\right)} \in{ }_{n} \mathrm{~N}(\sigma, n) \]

with

\[ \sigma(t)=\cos (t) \]

Double Fourier representation¶

Assume that $\sigma$ is a locally Riemann integrable function and $\sigma \in L^{1}(\mathbb{R})$ and thus the Fourier transform of $\sigma$ is well-defined and continuous. Since $\sigma$ is non-zero and

\[ \hat{\sigma}(\omega)=\frac{1}{2 \pi} \int_{\mathbb{R}} \sigma(t) e^{-i \omega t} d t \]

this implies that $\hat{\sigma}(a) \neq 0$ for some $a \neq 0$. Via a change of variables $t=w \cdot x+b$ and $d t=d b$, this means that for all $x$ and $\omega$, we have

\[\begin{split} \begin{aligned} 0 \neq \hat{\sigma}(a) &=\frac{1}{2 \pi} \int_{\mathbb{R}} \sigma(\omega \cdot x+b) e^{-i a(\omega \cdot x+b)} d b \\ &=e^{-i a \omega \cdot x} \frac{1}{2 \pi} \int_{\mathbb{R}} \sigma(\omega \cdot x+b) e^{-i a+b} d b \end{aligned} \end{split}\]

and so

\[ e^{i a \omega \cdot x}=\frac{1}{2 \pi \hat{\sigma}(a)} \int_{\mathbb{R}} \sigma(\omega \cdot x+b) e^{-i a b} d b \]

Likewise, since the growth condition also implies that $\sigma^{(k)} \in L^{1}$, we can differentiate the above expression under the integral with respect to $x$.

This allows us to write the Fourier mode $e^{i a \omega \cdot x}$ as an integral of neuron output functions. We substitute this into the Fourier representation of $f$ (note that the assumption we make implies that $\hat{f} \in L^{1}$ so this is rigorously justified for a.e. $x$ ) to get

$f(x)=\int_{\mathbb{R}^{d}} e^{i \omega \cdot x} \hat{f}(\omega) d \omega=\int_{\mathbb{R}^{d}} \int_{\mathbb{R}} \frac{1}{2 \pi \hat{\sigma}(a)} \sigma\left(a^{-1} \omega \cdot x+b\right) \hat{f}(\omega) e^{-i a b} d b d \omega=\int_{\mathbb{R}^{d} \times \mathbb{R}} k(x, \theta) d \theta$ where $\theta=(\omega, b)$ and

\[ k(x, \theta)=\frac{1}{2 \pi \hat{\sigma}(a)} \sigma\left(a^{-1} \omega \cdot x+b\right) \hat{f}(\omega) e^{-i a b} \]

Thus we have

\[\begin{split} \begin{aligned} |k(x, \theta)| & \leq \frac{1}{2 \pi|\hat{\sigma}(a)|} \max _{x \in \Omega}\left|\sigma\left(a^{-1} \omega \cdot x+b\right) \| \hat{f}(\omega)\right| \\ & \leq h(\omega, b)|\hat{f}(\omega)|=\rho(\theta) \end{aligned} \end{split}\]

where

\[ h(\omega, b)=\max _{x \in \Omega}\left|\sigma\left(a^{-1} \omega \cdot x+b\right)\right| \]

if we ignore the coefficient. Thus, following the discussion in last section, the next step is to analyze $h(\omega, b)$ which we will discuss in the next section. Before that, let us introduce a special case of the above representation once the activation function is periodic.

DNN for image classification

Building and Training Deep Neural Networks (DNNs) with Pytorch

Deep Learning Web Course

Monte Carlo Methods¶

Please access the video via one of the following two ways

Download the lecture notes here: Notes¶

Monte Carlo methods¶

A basic result¶

Application¶

Integral representations of functions¶

Fourier representation¶

Double Fourier representation¶

Download the lecture notes here: Notes ¶