Monte Carlo Methods¶
-->Please access the video via one of the following two ways
Monte Carlo methods¶
Let \(\lambda \geq 0\) be a probability density function on \(G \subset \mathbb{R}^{d}\) such that
For example:
if \(G\) is bounded. The expectation is defined:
and for any \(h=h\left(\omega_{1}, \omega_{2}, \ldots, \omega_{n}\right): G \times G \cdots G \mapsto \mathbb{R}\)
A basic result¶
Lemma 9
For any \(g \in L^{\infty}(G)\), we have
Proof. First note that
with
\(\left.I_{1}=\sum_{i=1}^{n}\left(\mathbb{E} g-g\left(\omega_{i}\right)\right)^{2}, \quad I_{2}=\sum_{i \neq j}^{n}\left((\mathbb{E} g)^{2}-\mathbb{E}(g)\left(g\left(\omega_{i}\right)+g\left(\omega_{j}\right)\right)+g\left(\omega_{i}\right) g\left(\omega_{j}\right)\right)\right)\)
Consider \(I_{1}\), for any \(i\),
Thus,
For \(I_{2}\), note that
and, for \(i \neq j\),
Thus
Consequently, there exist the following two formulas for \(\mathbb{E}_{n}\left(\mathbb{E} g-\frac{1}{n} \sum_{i=1}^{n} g\left(\omega_{i}\right)\right)^{2}:\)
Based on the first formula above, since
it holds that
Due to the second formula above,
which completes the proof. \(\square\)
Of course, we can use this to prove a high probability result.
Corollary 1
Under the assumptions of the preceding lemma, we have
Proof. \( \overline{\mathbb{P}}\left[\left(\mathbb{E} g-\frac{1}{n} \sum_{i=1}^{n} g\left(\omega_{i}\right)\right)^{2}>\epsilon\right] \leq \epsilon^{-1} \overline{\mathbb{E}}\left(\mathbb{E} g-\frac{1}{n} \sum_{i=1}^{n} g\left(\omega_{i}\right)\right)^{2} \leq \frac{1}{n \epsilon}\|g\|_{L^{\infty}}^{2} \)
This corollary implies that the set of \(\omega_{i}\) where the estimate \(n^{-1} \sum_{i=1}^{n} g\left(\omega_{i}\right)\) is far from the desired value \(\mathbb{E} g\) is small.
The practical usefulness of this algorithm depends upon the existence of a repeatable process (for instance some physical process) which generates \(\omega\) according to a desired distribution \(\mu\).
The precise meaning of this last statement is essentially that the strong law of large numbers holds. Specifically, if \(\omega_{1}, \ldots, \omega_{n}, \ldots\) is a infinite sequence generated by the process, and \(A \subset \Omega\) is any a measurable set, then
Generating \(n\) independent samples means generating \(\omega_{1}, \ldots, \omega_{n}\) from \(\mu^{n}\) according to the above notion. The existence of a realizable process generating samples from a probability distribution, and the practical use of such processes is an interesting topic in the intersection of statistics, physics, and computer science. In addition, statistics/probability theory studies how to take samples from one probability distribution and transform them to samples from another distribution.
Application¶
Lemma 10
Let
with \(\lambda(\theta) \geq 0\) and \(\|\lambda(\theta)\|_{L^{1}(G)}=1\). For any \(n \geq 1\), there exist \(\theta_{i}^{*} \in G\) such that
where \(\|g(\cdot, \theta)\|_{L^{2}(\Omega)}^{2}=\int_{\Omega}[g(x, \theta)]^{2} d \mu(x)\), and
Proof. Introducing a probability distribution \(\lambda(\theta)\),
By Lemma 9,
and
by taking integral where
Sine \(\mathbb{E}_{n}(1)=1\) and \(\mathbb{E}_{n}(h) \leq \frac{1}{n} \mathbb{E}\left(\int_{O} g^{2} d \mu(x)\right)\), there exists \(\left(\theta_{1}^{*}, \theta_{2}^{*}, \cdots, \theta_{n}^{*}\right) \in G \times G \times\) \(\cdots \times G\) such that
Otherwise, \(\mathbb{E}_{n}(h)>\frac{1}{n} \mathbb{E}\left(\int_{\Omega} g^{2} d \mu(x)\right)\) if \(\left.h\left(\theta_{1}, \theta_{2}, \cdots, \theta_{n}\right)>\frac{1}{n} \int_{\Omega} \mathbb{E}\left(g^{2}\right)\right) d \mu(x) .\) This implies that $\(\left\|f-f_{n}\right\|_{L^{2}(\Omega)}^{2} \leq \frac{1}{n} \int_{G}\|g(\cdot, \theta)\|_{L^{2}(\Omega)}^{2} \lambda(\theta) d \theta,\)$ which completes the proof.
We also have a more general version of the above lemma.
Lemma 11
Let
with \(\|\lambda(\theta)\|_{L^{1}(\Theta)}=1\). For any \(n \geq 1\), there exist \(\theta_{i}^{*} \in G\) such that
where
In particular, if
Then
For any \(f(x)=\int_{G} g(x, \theta) \rho(\theta) d \theta\) with \(\|\rho\|_{L^{1}(\Theta)} \neq 1 .\) Let \(\lambda(\theta)=\frac{\rho(\theta)}{\|\rho\|_{L^{1}(\theta)}} .\) Thus,
with \(\|\lambda(\theta)\|_{L^{1}(\Theta)}=1 .\) We can apply the above two lemmas to the given function \(f(x)\).
Integral representations of functions¶
Fourier representation¶
Consider the Fourier transform:
We write \(\hat{f}(\omega)=e^{i \theta(\omega)}|\hat{f}(\omega)| .\) By Fourier inversion formula,
Since \(f(x)\) is real-valued, it implies that, for \(x\)
Then we have
with
and
Theorem 11
There exist \(\omega_{i} \in \mathbb{R}^{d}\), s.t., \(G=\mathbb{R}\) and
where
Note that
with
Double Fourier representation¶
Assume that \(\sigma\) is a locally Riemann integrable function and \(\sigma \in L^{1}(\mathbb{R})\) and thus the Fourier transform of \(\sigma\) is well-defined and continuous. Since \(\sigma\) is non-zero and
this implies that \(\hat{\sigma}(a) \neq 0\) for some \(a \neq 0\). Via a change of variables \(t=w \cdot x+b\) and \(d t=d b\), this means that for all \(x\) and \(\omega\), we have
and so
Likewise, since the growth condition also implies that \(\sigma^{(k)} \in L^{1}\), we can differentiate the above expression under the integral with respect to \(x\).
This allows us to write the Fourier mode \(e^{i a \omega \cdot x}\) as an integral of neuron output functions. We substitute this into the Fourier representation of \(f\) (note that the assumption we make implies that \(\hat{f} \in L^{1}\) so this is rigorously justified for a.e. \(x\) ) to get
\(f(x)=\int_{\mathbb{R}^{d}} e^{i \omega \cdot x} \hat{f}(\omega) d \omega=\int_{\mathbb{R}^{d}} \int_{\mathbb{R}} \frac{1}{2 \pi \hat{\sigma}(a)} \sigma\left(a^{-1} \omega \cdot x+b\right) \hat{f}(\omega) e^{-i a b} d b d \omega=\int_{\mathbb{R}^{d} \times \mathbb{R}} k(x, \theta) d \theta\) where \(\theta=(\omega, b)\) and
Thus we have
where
if we ignore the coefficient. Thus, following the discussion in last section, the next step is to analyze \(h(\omega, b)\) which we will discuss in the next section. Before that, let us introduce a special case of the above representation once the activation function is periodic.