Stochastic gradient descent method and convergence theory¶
--> -->Please access the video via one of the following two ways
Stochastic gradient descent method and convergence theory¶
The next optimization problem is the most common case in machine learning.
Problem
One version of stochastic gradient descent (SGD) algorithm is:
Algorithm 4 (SGD)
Input: initialization \(x_0\), learning rate \(\eta_t\).
For: t = 0,1,2,\(\dots\)
Randomly pick \(i_t \in \{1, 2, \cdots, N\}\) independently with probability \(\frac{1}{N}\)
Convergence of SGD¶
Theorem 2
Assume that each \(f_i(x)\) is \(\lambda\)-strongly convex and \(\|\nabla f_i(x)\| \le M\) for some \(M >0\). If we take \(\eta_t = \frac{a}{\lambda (t+1)}\) with sufficiently large \(a\) such that
then
where \(e_t = \|x_t - x^*\|\).
Proof. Proof Proof. The \(L^2\) error of SGD can be written as
The third line comes from the fact that
and
Note when \(t=0\), we have
based on the assumption.
In the case of SDG, by the inductive hypothesis,
which completes the proof. ◻
SGD with mini-batch¶
Firstly, we will introduce a natural extended version of the SGD discussed above with introducing mini-batch.
Algorithm 5 (SGD with mini-batch)
Input: initialization \(x_0\), learning rate \(\eta_t\).
For: t = 0,1,2,\(\dots\)
Randomly pick \(B_t \subset \{1, 2, \cdots, N\}\) independently with
probability \(\frac{m!(N-m)!}{N!}\)
and \(\# B_t = m\).
where
Now we introduce the SGD algorithm with mini-batch without replacement which is the most commonly used version of SGD in machine learning.
Algorithm 6 (Shuffle SGD with mini-batch)
Input: learning rate \(\eta_k\), mini-batch size \(m\), parameter initialization \(x_{0}\) and denote \(M = \lceil \frac{N}{m} \rceil\).
For Epoch \(k = 1,2,\dots\)
Randomly pick \(B_t \subset \{1, 2, \cdots, N \}\) without replacement
with \(\# B_t = m\) for \(t = 1,2,\cdots,M\).
mini-batch \(t = 1:M\)
Compute the gradient on \(B_{t}\):
Update \(x\):
EndFor
To “randomly pick \(B_i \subset \{1, 2, \cdots, N \}\) without replacement with \(\# B_i = m\) for \(i = 1,2,\cdots,t\)”, we usually just randomly shuffle the index set first and then consecutively pick every \(m\) elements in the shuffled index set. That is the reason why we would like to call the algorithm as shuffled SGD while this is the mostly used version of SGD in machine learning.
Remark
Let us recall a general machine learning loss function
where \(\{(X_i, Y_i)\}_{i=1}^N\) correspond to these data pairs. For example, \(\ell(\cdot, \cdot)\) takes cross-entropy and \(h(x; \theta) = p(x;\theta)\) as we discussed in Section 2.2.1. Thus, we have the following corresponding relation